Avoiding the instrumental policy by hiding information about humans — AI Alignment Forum

x

Avoiding the instrumental policy by hiding information about humans — AI Alignment Forum

I've been thinking about situations where alignment fails because "predict what a human would say" (or more generally "game the loss function," what I call the instrumental policy) is easier to learn than "answer questions honestly" (overview).

One way to avoid this situation is to avoid telling our agents too much about what humans are like, or hiding some details of the training process, so that they can't easily predict humans and so are encouraged to fall back to "answer questions honestly." (This feels closely related to the general phenomena discussed in Thoughts on Human Models.)

Setting aside other reservations with this approach, could it resolve our problem?

One way to get the instrumental policy is to "reuse" a human model to answer questions (discussed here). If our AI has no information about humans at all, then it totally addresses this concern. But in practice it seems inevitable for the environment to leak some information about how humans answer questions (e.g observing human artifacts tells you something about how humans reason about the world and what concepts would be natural for them). So the model will have some latent knowledge that it can reuse to help predict how to answer questions. The intended policy may not able to leverage that knowledge, and so it seems like we may get something (perhaps somewhere in between the intended and instrumental policies) which is able to leverage it effectively. Moderate amounts of leakage might be fine, but the situation would make me quite uncomfortable.
Another way to get something similar to the instrumental policy is to use observations to translate from the AI's world-model to humans' world-model (discussed here). I don't think that hiding information about humans can avoid this problem, because in this case training to answer questions already provides enough information to infer the humans' world-model.
I have a strong background concern about "security through obscurity" when the alignment of our methods depends on keeping a fixed set of facts hidden from an increasingly-sophisticated ML system. This is a general concern with approaches that try to benefit from avoiding human models, but I think it bites particularly hard in this case.

Overall I think that hiding information probably isn't a good way to avoid the instrumental policy, and for now I'd strongly prefer to pursue approaches to this problem that work even if our AI has a good model of humans and of the training process.

(Sometimes I express hope that the training process can be made too complex for the instrumental policy to easily reason about. I'm always imagining doing that by having additional ML systems participating as part of the training process, introducing a scalable source of complexity. In the cryptographic analogy, this is more like hiding a secret key or positing a computational advantage for the defender than hiding the details of the protocol.)

That said, hiding information about humans does break the particular hardness arguments given in both of my recent posts. If other approaches turned out to be dead ends, I could imagine revisiting those arguments and seeing if there are other loopholes once we are willing to hide information. But I'm not nearly that desperate yet.