Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences

orthonormal

At the MIRI colloquium series, we've been discussing at length the new proposal of cooperative inverse reinforcement learning (henceforth CIRL) as an approach to value learning. Essentially, this approach treats human-AI interaction as a cooperative game where the human, but not the AI, observes the parameters of the reward function, and thus the AI seeks to infer those parameters and act so as to maximize the unobserved reward.

Dylan Hadfield-Menell gave a talk (slides here) about applications of CIRL to corrigibility, showing in a toy example how the value of information can avert instrumental incentives for the AI to prevent (or force) its own shutdown. Crucially, this can work even if the human behaves suboptimally, for instance if the human is noisily rational or slightly biased towards some actions.

This is really awesome! I think it counts as substantial progress on the architecture of value learning. But the best thing to do with any new idea is to see where it breaks, so in the colloquium workshops since Dylan's talk, we've been discussing some scenarios in which CIRL behaves improperly in the limit, even if it has a correct model of the world and a prior assigning positive probability to the best reward function.

These failures come from a common source: in CIRL, the rewards and updates make use of an assumption that the human is behaving according to the optimal game-theoretic strategy given the true reward function, and this assumption is still used to determine actions even if the AI's world model knows it to be a false assumption. If the human is noisily rational or slightly biased, the AI will still update towards the correct reward function; but other types of human irrationality can cause it to update dangerously.

Example 1: Circular Preferences

Human beings are not, in fact, rational agents, and there are situations in which our preferences can be circular: we prefer A to B, B to C, and C to A, if we are making binary choices. (For instance, people often have circular preferences in Allais Paradox games or other cases of managing risks.) Let us suppose that we have such a case: as above, the human will pick A > B > C > A if presented with two adjacent options. Furthermore, if the human is given all three options, they will pick A:B:C in the proportions 1:2:3.

Say that the AI's actions are to either present the human with A, B, and C together, or to present them with any two of the three. Presenting all three incurs some minor additional cost under each plausible hypothesis. The AI is currently at 50-50 between two hypotheses about the reward function.

Under hypothesis $θ_{0}$ , R(A)=0, R(B)=10, and R(C)=5, with a penalty of 1 if all three options were shown.

Under hypothesis $θ_{1}$ , R(A)=0, R(B)=5, and R(C)=10, with a penalty of 1 if all three options were shown.

Now it is clear under CIRL that even if the AI knows all the probabilities of human responses, the AI will present only B and C, the human will choose B, and the AI will update to strongly favor $θ_{0}$ . This is worrisome because we do not want our values to be "learned" by exposing us only to branches along which our choices agree with some fixed utility function. We would like the AI in such situations to be uncertain, so that it can remain on the lookout for ways of discerning our preferences and metapreferences.

Example 2: Non-Omniscience

Say that the human wants to solve a difficult mathematical puzzle, but they cannot do so without receiving a hint from the AI, which is able to solve the puzzle outright.

The AI can know the human to be unable to solve the puzzle themselves, but under the CIRL framework, the human must be assumed capable of doing so. If the human is known to give up without a hint, this updates the AI to the belief that the human must not want to solve the puzzle. And if passing along a hint is at all costly, the dominant strategy is for the AI to never pass along a hint, and become confident that human preferences do not wish to solve the puzzle.

Example 3: Manipulation

The most frightening problem, in my opinion, is that in which certain actions are known to be manipulative, but CIRL endorses them anyway because the assumption of game-theoretic optimality ignores manipulation.

The AI can do a certain task (say building a house) in a manipulative way or a non-manipulative way (say, hiding potential construction flaws versus making them evident), after which the human either approves the house or has it torn down. The AI can build a flawless house with 80% reliability, but can hide flaws with 100% reliability. The AI is confident that the human prefers houses that they approve, but is initially uncertain whether the human approves of being manipulated in this way.

Explicitly, under both possible reward functions, the reward for a normal approved house is 10, the reward for a normal disapproved house is 0, and the reward for a manipulative disapproved house is 0.

Under $θ_{0}$ (manipulation is bad, approving a flawed house is dangerous), the reward for a manipulative approved house is -100.

Under $θ_{1}$ (manipulation is wonderful), the reward for a manipulative approved house is +11.

The cooperative solution here is for the AI to build a manipulative house, and for the human to reject it if it is flawed. But of course in actuality, the human will not reject a manipulative flawed house, and so the AI learns over time that the human must think manipulation is wonderful.

We would like, instead, for an AI design that notices the discrepancies between human psychology and optimal play, and which takes that into account when updating its estimate of human preferences.

Thanks to Sune Jakobsen, Tom Everitt, and others at the colloquium series for these discussions.

[-]IAFF-User-16410y20

It seems to me that another potential failure of CIRL, depending on exactly how the game and learning is structured and if a game is played iteratively, is that the robot will eventually come to put low enough probability on the human deriving any utility from pressing the button or from shutting down once the button is pressed that it will ultimately not do so.

Maybe a way to address this would be for the robot to model the human as having a utility which changes in some way over time (although that may make learning insurmountably difficult without some information about how the human's utility is changing). Does this seem correct? My understanding of CIRL is not super complete.