# 9

In my last post, I argued that interaction between the human and the AI system was necessary in order for the AI system to “stay on track” as we encounter new and unforeseen changes to the environment. The most obvious implementation of this would be to have an AI system that keeps an estimate of the reward function. It acts to maximize its current estimate of the reward function, while simultaneously updating the reward through human feedback. However, this approach has significant problems.

Looking at the description of this approach, one thing that stands out is that the actions are chosen according to a reward that we know is going to change. (This is what leads to the incentive to disable the narrow value learning system.) This seems clearly wrong: surely our plans should account for the fact that our rewards will change, without treating such a change as adversarial? This suggests that we need to have our action selection mechanism take the future rewards into account as well.

While we don’t know what the future reward will be, we can certainly have a probability distribution over it. So what if we had uncertainty over reward functions, and took that uncertainty into account while choosing actions?

## Setup

We’ve drilled down on the problem sufficiently far that we can create a formal model and see what happens. So, let’s consider the following setup:

• The human, Alice, knows the “true” reward function that she would like to have optimized.
• The AI system maintains a probability distribution over reward functions, and acts to maximize the expected sum of rewards under this distribution.
• Alice and the AI system take turns acting. Alice knows that the AI learns from her actions, and chooses actions accordingly.
• Alice’s action space is such that she cannot take the action “tell the AI system the true reward function” (otherwise the problem would become trivial).
• Given these assumptions, Alice and the AI system act optimally.

This is the setup of Cooperative Inverse Reinforcement Learning (CIRL). The optimal solution to this problem typically involves Alice “teaching” the AI system by taking actions that communicate what she does and does not like, while the AI system “asks” about parts of the reward by taking actions that would force Alice to behave in different ways for different rewards.

## Does this solve our problems?

Two of the problems we identified in the last post are simply assumed away:

• Alice does not know the “true” reward function, but we assumed that she does.
• Alice may be unable to optimally give feedback to the AI system, but we assume that she is optimal here.

So this particular kind of reward uncertainty does not fix either of these problems. What about convergent instrumental subgoals?

Utility preservation. One major worry we had with the original setup was that the AI system would disable its narrow value learning system, as a manifestation of the instrumental goal of protecting its utility function. This is reversed in our setup: the AI system has a positive incentive to continue doing narrow value learning, since it helps it hone in on the true reward function, which in turn allows it to optimize the reward better. (We might worry that this prevents us from fixing any problems in the narrow value learning system, but that is a robustness problem: in the world where everything is working correctly, this is the correct incentive.)

Survival incentive, aka shutdown incorrigibility. Another worry is that the AI system has a survival incentive that causes it to prevent us from shutting it down. With reward uncertainty, the fact that we are trying to shut the AI system down is itself strong evidence about the reward function. The AI system should reason that its operation leads to worse outcomes, and so allow itself to be turned off.

The Off-Switch Game formalizes this reasoning in a simple setting. The AI system can either directly take action a (perhaps by disabling an off switch), can shut down, or can allow Alice to choose between these options. If the AI system allows Alice to choose, Alice then decides whether or not to shut down the AI system. Assuming that Alice is optimal, the AI system reasons that when Alice chooses to shut it down, the true reward function must have been one which makes the action a bad, and so it does better by deferring to Alice. However, when Alice is modeled as noisily rational instead of optimal, the AI system might reason that Alice might make a mistake when deciding to shut the AI system down, and so it might take action a directly without deferring to her. So, the AI system becomes shutdown corrigible, as long as it assumes that Alice is sufficiently rational.

Should robots be obedient? makes a similar point, arguing that an AI system that learns preferences and then acts to maximize their satisfaction can perform better than an AI system that simply obeys instructions, because humans are not perfectly rational. This creates a tradeoff between performance and obedience (which shutdown corrigibility is an instance of).

Of course, these simple models exclude many actions that a realistic AI system could take. In particular, it seems likely that an AI system would prefer to disable the shutdown button, gather information about the reward until it has fully updated, and optimize the resulting set of rewards. If the space of reward functions is misspecified, as it likely will be, this will lead to bad behavior. (This is the point made by Incorrigibility in the CIRL Framework.)

Note though that while this cuts against shutdown corrigibility (since the AI system would prefer to disable the shutdown button), I would frame the problem differently. If the space of rewards is well-specified and has sufficient weight on the true reward function and the AI system is sufficiently robust and intelligent, then the AI system must update strongly on us attempting to shut it down. This should cause it to stop doing the bad thing it was doing. When it eventually narrows down on the reward it will have identified the true reward, which by definition is the right thing to optimize. So even though the AI system might disable its off switch, this is simply because it is better at knowing what we want than we are, and this leads to better outcomes for us. So, really the argument is that since we want to be robust (particularly to reward misspecification), we want shutdown corrigibility, and reward uncertainty is an insufficient solution for that.

## A note on CIRL

There has been a lot of confusion on what CIRL is and isn’t trying to do, so I want to avoid adding to the confusion.

CIRL is not meant to be a blueprint for a value-aligned AI system. It is not the case that we could create a practical implementation of CIRL and then we would be done. If we were to build a practical implementation of CIRL and use it to align powerful AI systems, we would face many problems:

• As mentioned above, Alice doesn’t actually know the true reward function, and she may not be able to give optimal feedback.
• As mentioned above, in the presence of reward misspecification the AI system may end up optimizing the wrong thing, leading to catastrophic outcomes.
• Similarly, if the model of Alice’s behavior is incorrect, as it inevitably will be, the AI system will make incorrect inferences about Alice’s reward, again leading to bad behavior. As an example that is particularly easy to model, should the AI system model Alice as thinking about the robot thinking about Alice, or should it model Alice as thinking about the robot thinking about Alice thinking about the robot thinking about Alice? How many levels of pragmatics is the “right” level?
• Lots of other problems have not been addressed: the AI system might not deal with embeddedness well, or it might not be robust and could make mistakes, etc.

CIRL is supposed to bring conceptual clarity to what we could be trying to do in the first place with a human-AI system. In Dylan’s own words, “what cooperative IRL is, it’s a definition of how a human and a robot system together can be rational in the context of fixed preferences in a fully observable world state”. In the same way that VNM rationality informs our understanding of humans even though humans are not expected utility maximizers, CIRL can inform our understanding of alignment proposals, even though CIRL itself is unsuitable as a solution to alignment.

Note also that this post is about reward uncertainty, not about CIRL. CIRL makes other points besides reward uncertainty, that are well explained in this blog post, and are not mentioned here.

While all of my posts have been significantly influenced by many people, this post is especially based on ideas I heard from Dylan Hadfield-Menell. However, besides the one quote, the writing is my own, and may not reflect Dylan’s views.

New Comment

[A couple of  (to me seemingly fairly obvious) points about value uncertainty, which it still seems like a lot of people here may have not been discussing:]

Our agent needs to be able to act in the face of value uncertainty. That means that each possible action the agent is choosing between has a distribution of possible values, for two reasons: 1) the universe is stochastic, or at least the agent doesn't have a complete model of it so cannot fully predict what state of the universe an action will produce -- with infinite computing power this problem is gradually solvable via Solmanoff induction, just as was considered for AIXI [and Solmanoff induction has passable computable approximations that, when combined with goal-oriented consequentialism, are generally called "doing science"] 2) the correct function mapping from states of the universe to human utility is also unknown, and also has uncertainty. These two uncertainties combine to produce a probability distribution of possible true-human-utility values for each action.

1) We know a lot about reasonable priors for utility functions, and should encode this into the agent. The agent is starting in an environment that has already been heavily optimized by humans, who were previously optimized for life on this planet by natural selection. So this environment's utility for humans is astonishingly high, by the standards of randomly selected patches of the universe or random arrangements of matter. Making large or random changes to it thus has an extremely high probability of decreasing human utility. Secondly, any change that takes the state of the universe far outside what you have previously observed puts it into a region where the agent has very little idea what humans will think is the utility of that state - the agent has almost no non-prior knowledge about states far outside its prior distribution. If the agent is a GAI, there are going to be actions that it could take that can render the human race extinct -- a good estimate of the enormous negative utility of that possibility should be encoded into its priors for human utility functions of very unknown states, so that it acts with due rational caution about this possibility. It needs to be very, very, very certain that an act cannot have that result before it risks taking it. Also, if the state being considered is also far out of prior distribution for those humans it has encountered, they may have little accurate data to estimate its utility either - even if it sounds pretty good to them now, they won't really know if they like it until they've tried living with it, and they're not fully rational and have limited processing power and information. So in general, if a state is far out of the distribution of previously observed states, it's a very reasonable prior that its utility is much lower,  that the uncertainty of its utility is high, that that uncertainty is almost all on the downside (the utility distribution has a fat lower tail, but not a fat upper one: it could be bad, or it could be really, really bad), and that downside has non-zero weight all the way down to the region of "extinction of the human race" - so overall, the odds of the state actually being better then the ones in the previously-observed state distribution, are extremely low. [What actually matters here is not whether you've observed the state, but to what level your Solmanoff induction-like process has given you sufficiently high confidence that you can well-predict its utility to overcome this prior, bearing in mind the sad fact that you're not actually running the infinite-computational power version of Solmanoff induction.] This is also true even if the state sounds good offhand, and even if it sounds good to a human you ask, if it's well outside their distribution of previously observed states -- especially if its a state that they might have cognitive biases about, or insufficient information or processing power to accurately evaluate. If it was a state that they could predict was good and arrange to reach, they would already have done so, after all. So either don't optimize over those states at all, or at least use priors in your utility function distribution for them that encode all of these reasonable assumptions and will make these states reliably get ignored by the optimizer. If you consider a plan leading to such a state at all, probably the first thing you should be doing is safe investigations to further pin down both its achievability and to get a better estimate of its true utility. So, before deciding to transport humans to Mars, investigate not just rocketry, and building a self-sustaining city there, but also whether humans would actually be happy in a city on Mars (at a confidence level much higher than just asking them "Does going to Mars sound fun to you?") Incidentally, another very reasonable prior to give it is "the smarter the smartest GAI in the world is, relative to a smart human, the higher the risk of the AI apocalypse is". This is just a special case of "going outside the previously observed range of states is almost always bad", but it's an extremely important one, and I'd suggest preencoding it in priors, and also other similar best-thinking-on existential risks information.

This is how you avoid optimizers frequently finding optimal-looking states way outside their training distribution -- you teach them the true fact that they live in a very unusual place where, for the specific thing that they're supposed to be figuring out how to optimize for, almost all such states are bad, and some are abysmal, because the environment is already heavily optimized for that value. So you write "here there could be dragons, and probably at least bears" all over the rest of the map.

[Note that a very plausible result of building a rational system with these priors, at least without preloading a lot of training  data into it to give it a significant set of previously observed states that it has high confidence in the safety of, is that it either on startup tells you "please turn me off immediately - I am far too dangerous to be allowed to exist", or else goes catatonic.]

2) So, given those cautious priors, should we just have the agent optimize the average of that utility distribution of each state it considers, so we're optimizing a single currently-estimated utility function that meanwhile changes as the agent uses a Bayesian/Solomanoff-like process to learn more about its true value? No, that's also a bad idea - it leads to what might be called over-optimizing [though I'd prefer to call it "looking elsewhere"]. The distribution contains more information that just its average, and that information is useful for avoiding over-optimization/looking elsewhere.

Even with this set of priors, i.e. even if the agent (effectively or actually) optimizes only over states that are in or near your distribution of previously observed states, there is a predictable statistical tendency for the process of optimizing over a very large number of states to produce not the state with the largest true utility, but rather one whose true utility is in fact somewhat lower but just happens to have been badly misestimated on the high side -- this is basically the "look elsewhere effect" from statistics and is closely related to "P-hacking". If we were playing a multi-armed bandit problem and the stakes were small (so none of the bandit arms have "AI apocalypse" or similar human-extinction level events on their wheel), this could be viewed as rather dumb exploration strategy to sequentially locate all such states and learn that they're actually not so great after all, by just trying them one after another and being repeatedly disappointed. If all you're doing is bringing a human a new flavor of coffee to see if they like it, this might even be not that dreadful a strategy, if perhaps annoying for the human after the third or fourth try  -- so the more flavors the coffee shop has, the worse this strategy it. But the universe is a lot more observable and structured than a multi-armed bandit, and there are generally much better/safer ways to find out more about whether a world state would be good for humans than just trying it on them (you could ask the human if they like mocha, for example).

So what the agent should be doing is acting cautiously, and allowing for the size of the space it is optimizing over. For simplicity of statistical exposition, I'm temporarily going to assume that all our utility distributions, here for states in or near the previously-observed states distribution, are all well enough understood and multi-factored that we're modeling them as normal distributions (rather than distributions that are fat -tailed on the downside, or frequently complex and multimodal, both of which are much more plausible), and also that all the normal distributions have comparable standard deviations. Under those unreasonably simplifying assumptions, here is what I believe is an appropriately cautious optimization algorithm that suppresses over-optimization:

1. Do a rough estimate of how many statistically-independent-in-relative-utility states/actions/factors in the utility calculation you are optimizing across, whichever is lowest (so, if there are a continuum of states, but you somehow only have one normal-distributed uncertainty involved in deducing their relative utility, that would be one).
2. Calculate how many standard deviations above the norm the highest sample will on average be if you draw that many random samples from a uniform normal distribution, and call this L (for "look-elsewhere factor"). [In fact, you should be using a set of normal distributions for the indicidual independent variables you found above, which may have varying standard deviations, but for simplicity of exposition I above assumed these were all comparable.]
3. Then what you should be optimizing for each state is its mean utility minus L times the standard deviation of it's utility [Of course, my previous assumption that their standard deviations were all fairly similar also means this doesn't have much effect -- but the argument generalizes to cases where the standard deviations are wider, while also generally reducing the calculated value of L.]

Note that L can get largeish -- it's fairly easy to have, say, millions of uncorrelated-in-relative-utility states, in which case L would be around 5, so then you're optimizing for the mean minus five standard deviations, i.e.  looking for a solution at a 5-sigma confidence level. [For the over-simplified normal distribution assumption I gave above to still hold, this also requires that your normal distributions really are normal, not fat-tailed, all the way out to 1-in-several-million odds -- which almost never happens: you're putting an awful lot of weight on the central limit theorem here, so if your assumed-normal distribution is in fact only, say, the sum of 1,000,000 equally-weighted coin flips then it already failed.] So in practice your agent needs to do more complicated extreme-quantile statistics without the normal distribution assumption, likely involving examining something like a weighted ensemble of candidate Bayes-nets for contributions to  the relative utility distribution -- and in particular they need to have an extremely good model of the lower end of the utility probability distribution for each state. i.e. pay a lot of detailed attention to the question "is there even a really small chance that I could be significantly overestimating the utility of this particular state, relative to all the others I'm optimizing between?"

So the net effect of this is that, the larger the effective dimension of the space you're optimizing over, the more you should prefer the states for which you're most confident about their utility being good, so you tend to reduce the states you're looking at to ones your set of previous observations hived you very high confidence of their utility being high enough to be worth considering.

Now, as an exploration strategy, it might be interesting to also do a search optimizing, say, just the mean of the distributions, while also optimizing across states bit further from the observed-state distribution (which that set of priors will automatically do for you, since their average is a lot less pessimistic than their fat-tailed downside), to see what state/action that suggests, but don't actually try the action (not even with probability epsilon -- the world is not a multi-armed bandit, and if you treat it like one, it has arms that can return the extinction of the human species): instead consider spawning a subgoal of your "get better at fetching coffee" subgoal to cautiously further investigate uncertainties in that state's true utility. So, for example, you might ask the human "They also have mocha there, would you prefer that?" (on the hypothesis that the human didn't know that, and would have asked for mocha if they had).

This algorithm innately gives you behavior that looks a lot like the agent is modeling both a soft version of the Impact of its actions (don't go far outside the previously-observed distribution, unless you're somehow really, really sure it's good idea) and also like a quantilizer (don't optimize too hard, with bias towards the best-understood outcomes, again unless you're really, really sure it's a good idea). It also pretty-much immediately spawns a "learn more about human preferences in this area" sub-goal to any sub-goal the agent already has, and thus forces the AI to cautiously apply goal-oriented approximate Solmanoff induction (i.e. science) to learning more about human preferences. So any sufficiently intelligent/rational AI agent is thus forced to become an alignment researcher and solve the alignment problem for you, preferably before fetching you coffee, and also to be extremely cautious until it has done so. Or at very least, to ask you any time they have a new flavor at the coffee shop whether you prefer it, or want to try it.

[Note that this is what is sometimes called an "advanced" control strategy, i.e. it doesn't really start working well until your agent is approaching GAI, and is capable of both reasoning and acting in a goal-oriented way about the world, and your desires, and how to find out more about them more safely than just treating the world like a multi-armed bandit, and can instead act more like a rational Bayesian-thinking alignment researcher. So the full version of it has a distinct "you only get one try at making this work" element to it. Admittedly, you can safely fail repeatedly on the "I didn't give it enough training data for its appropriate level of caution, so it shut itself down" side, as long as you don't respond to that by making the next version's priors less cautious, rather than by collecting a lot more training data -- though maybe what you should actually do is ask it to publicly explain on TV or in a TED talk that it's too dangerous to exist before it shuts down? However, elements of this like priors that doing something significantly out-of-observed-distribution in an environment that has already been heavily optimized is almost certain to be bad, or that if you're optimizing over many actions/states/theories of value you should be optimizing not the mean utility but a very cautious lower bound of it, can be used on much dumber systems. Something dumber, that isn't a GAI and can't actually cause the extinction of the human race (outside contexts containing nuclear weapons or bio-containment labs that it shouldn't be put in) also doesn't need priors that go quite that far negative  -- its reasonable low-end prior utility value is probably somewhere more in the region of "I set the building on fire and killed many humans". However, this is still a huge negative value compared to the positive value of "I fetched a human coffee", so it should still be very cautious, but its estimate of "Just how implausible is it that I could actually be wrong about this?" is going to be as dumb as it is, so its judgment on when it can stop being cautious will be bd. So an agent actually needs to be pretty close to GAI for this to be workable.]

Nice comment!

The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn't know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we get it to actually reason in that way.

On your actual proposals, talking just about "how the agent should reason" (and not how we actually get it to reason that way):

1) Yeah I really like this idea -- it was the motivation for my work on inferring human preferences from the world state, which eventually turned into my dissertation. (The framing we used was that humans optimized the environment, but we also thought about the fact that humans were optimized to like the environment.) I still basically agree that this is a great way to learn about human preferences (particularly about what things humans prefer you not change), if somehow that ends up being the bottleneck.

2) I think you might be conflating a few different mechanisms here.

First, there's the optimizer's curse, where the action with max value will tend to be an overestimate of the actual value. As you note, one natural solution is to have a correction based on an estimate of how much the overestimate is. For this to make a difference, your estimates of overestimates have to be different across different actions; I don't have great ideas on how this should be done. (You mention have different standard deviations + different numbers of statistically-independent variables, but it's not clear where those come from.)

Second, there's information value, where the agent should ask about utilities in states that it is uncertain about, rather than charging in blindly. You seem to be thinking of this as something we have to program into the AI system, but it actually emerges naturally from reward uncertainty by itself. See this paper for more details and examples -- Appendix D also talks about the connection to impact regularization.

Third, there's risk aversion, where you explicitly program the AI system to be conservative (instead of maximizing expected utility). I tend to think that in principle this shouldn't be necessary and you can get the same benefits from other mechanisms, but maybe we'd want to do it anyway for safety margins. I don't think it's necessary for any of the other claims you're making, except perhaps quantilization (but I don't really see how any of these mechanisms lead to acting like a quantilizer except in a loose sense).

I agree, this is only a proposal for a solution to the outer alignment problem.

On the optimizer's curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn't going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).

Optimizing while not allowing for the optimizer's curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you're doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they're all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn't build a near-human AI that is still making that mistake.

Edit: I expanded this comment into a post, at: https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1