I've been thinking about situations where alignment fails because "predict what a human would say" (or more generally "game the loss function," what I call the instrumental policy) is easier to learn than "answer questions honestly" (overview).
One way to avoid this situation is to avoid telling our agents too much about what humans are like, or hiding some details of the training process, so that they can't easily predict humans and so are encouraged to fall back to "answer questions honestly." (This feels closely related to the general phenomena discussed in Thoughts on Human Models.)
Setting aside other reservations with this approach, could it resolve our problem?
Overall I think that hiding information probably isn't a good way to avoid the instrumental policy, and for now I'd strongly prefer to pursue approaches to this problem that work even if our AI has a good model of humans and of the training process.
(Sometimes I express hope that the training process can be made too complex for the instrumental policy to easily reason about. I'm always imagining doing that by having additional ML systems participating as part of the training process, introducing a scalable source of complexity. In the cryptographic analogy, this is more like hiding a secret key or positing a computational advantage for the defender than hiding the details of the protocol.)
That said, hiding information about humans does break the particular hardness arguments given in both of my recent posts. If other approaches turned out to be dead ends, I could imagine revisiting those arguments and seeing if there are other loopholes once we are willing to hide information. But I'm not nearly that desperate yet.
There are various ideas along the lines of "however much you tell the AI X it just forgets it". https://www.lesswrong.com/posts/BDXvRp8w9T8KkDw5A/policy-restrictions-and-secret-keeping-ai
I think that would be the direction to look in if you have a design tha'ts safe as long as it doesn't know X.
Unpacking "mutual information," it seems like these designs basically take the form of an adversarial game:
But this rests on the adversary not already knowing about X (otherwise we couldn't measure whether the adversary succeeds).
In the case of mutual information, this is achieved formally by having a random variable that the adversary does not observe directly. If we are talking about "what humans are like" then we can't take the naive approach of mutual information (since we can't deploy the entire training process many times in different worlds where humans are different). So what do we do instead?
The obvious approach is to just train the adversary to answer questions about humans, but then we somehow need to prevent the adversary from simply learning the facts themselves. If instead we don't give the adversary much time to learn, or much compute to work with, then we need to worry about cases where the model learns about X but is able to easily obscure that information from the adversary.
(Mostly I'm dissuaded from this approach by other considerations, but I am still interested in whether we could make anything along these lines actually work.)