Previously, I wrote a post introducing a strategy for corrigibility based on counterfactuals.

There were various objections to this of various strengths. I don't think any defeating the notion of using this style of approach for the stop button problem specifically, but So8res presented one style of argument that, after thinking about it, I have to admit seems like a pretty strong challenge for applying my approach beyond the stop button. In this post, I will be trying to repeat the argument using my own frames.

Obviously anything bad about this argument should be blamed on me misrepresenting it, while anything good can be credited to So8res. 😉

Counterfactual stop button

The stop button problem is simple enough: We want humans to decide whether an AI gets shut down or continues optimizing. The AI shouldn't try to mess with humans' ability to do so, and it shouldn't be incentivized to manipulate humans to either shut it down or keep it running: It might even be good if it directly helped provide humans with control in noninvasive ways, but overall the point is that it should be up to the humans to decide what happens.

The stop button problem seems like a single instance of a more general problem in corrigibility and alignment, where you might want to provide people with some way of controlling AI, ideally in much richer ways than just a binary choice between shut down and continue.

My idea for this kind of problem is simple enough: Instead of rewarding the AI for the outcomes it produces, reward it for a causal connection between human wants and the relevant outcomes. Which theoretically is achieved via a utility function defined over counterfactual human behaviors, and practically achieved by training the AI across different counterfactual human behaviors.

Out of distribution

A problem that So8res pointed out is that the real world is out-of-distribution with respect to this training environment.

This is a result of a problem that I skip over in the original post. Back then I defined the utility function as:

Here,  and  represent the counterfactual values of , with  being the value of  if humans want to stop the AI, and  being the value of  if humans don't want to stop the AI. But this is really undefined; there are lots of ways that one could want to stop the AI, depending on the exact time one goes out to try to stop it.  might be different if you immediately try to stop it, or if you wait an hour to stop it, and this is not defined by the counterfactual.

What I essentially advocated for doing was to pick some distributions  and  of ways to want to stop or not want to stop the AI, and then average over those distributions when computing the counterfactuals. So a modified equation could be written:

By design, this picks counterfactuals where humans trying to shut the AI down is fixed to either happen or not happen, regardless of what the AI does. But in the real world humans do in fact react to the AI. Where So8res and I disagree is in whether the AI would notice this difference between its training and deployment environment; I think that once deployed, if humans start turning against it, the AI would quickly shut down, which would prevent it from running into any out-of-distribution problems, because only a few bits of evidence about humans reacting to it would have leaked into its observation.

But - even if we grant that I am right about this, what then happens when we want to scale this approach to other aspects of AI control? Not all problems involve immediate shut-down. Sometimes we would want AIs to continually react to people.

Meetup planning

As a mundane example, consider planning and running a meetup; if a corrigible AI was deciding on what to eat, we would want it to continually adjust its behavior in reaction to feedback, such as by not ordering food that the participants find ethically objectionable.

Similarly to how with the stop button, we took a sum over different behavior programs that we might want to run (shut down vs optimize ), to create this meetup food planner, we would also take a sum over different foods we might want to support (order pizza vs make chili con carne vs veggie stir fry vs ...). And for each term in the sum, we would apply a counterfactual where people express their desire for this food. And then the AI would be rewarded for the degree to which people end up with the appropriate food.

So the AI's policy would be trained on a number of different 'episodes'. In each episode, it would first be randomly selected which food  people 'should have'. Within the episode, we then apply a counterfactual, where people spontaneously talk about how they want to eat , giving random justifications for it. The reward for the AI over this episode is then determined by whether people get food . This incentivizes the AI to respect people's expressed preferences, as the food people talk about wanting is the sole way it can get evidence for what food  is.

Under the proposal I gave, the AI would essentially be assuming that people's opinions about the food happen independently of anything it does, solely arising from a True Inner Preference of getting food , which it seeks to fulfill. This is what makes it corrigible; it observes people's behavior in the hope that it is evidence about their True Inner Preference, rather than trying to intervene on their preferences through manipulating them to approve of the food it brings.

But when deployed in reality, it would observe that people's objections are not independent of its behavior. For instance, rather than informing it about the ethics of different food sources at random times, people might inform it about the ethics right after it suggests a food source that they consider objectionable. If this happens once, it might just be a coincidence, but if it happens many times, it could much better be explained by a theory that people react to its suggestions than that people's statements are solely randomly selected expressions of their True Inner Preferences.

So, this sort of continuous interaction with people would be a steady leak of information, which seems like it would rapidly take the AI far "out of distribution" - that is, place it in situations that are reliably identifiably different from the distribution of situations it was trained in.

What's so bad about OOD, anyway?

But with artificial intelligence, it seems like we'd be hoping for it to generalize anyway? After all, we can't train the AI on the precise scenarios it is going to encounter, we need to rely on some kind of extrapolation. Can't we just ignore out-of-distribution concerns, since any problems might just represent insufficient capabilities? And assume that if the AI is corrigible according to the original utility function, it is corrigible across all "reasonable" distributions? At least as long as we make sure that the data points it encounters in reality are still assigned fairly high probability according to the training distribution?

That is, might the AI not just ignore the possibility that it is influencing people's behavior, and always assume that their behavior reflects the True Inner Preferences that it cares so much about observing?

No. Let's consider a hypothetical example before generalizing to a more abstract rule:

In order to interact continuously with humans, presumably the AI has to predict human behavior. One class of human behavior is probably fairly universal, another class is probably sufficiently chaotic to be unpredictable, but some third class of behavior (such as personality quirks) probably requires a degree of induction, such that you have to extrapolate from people's past behavior to guess their future behavior.

Since such extrapolation is useful for prediction, presumably the AI will end up having it, as long as it reaches a sufficient capability level. But if the AI applies such psychological extrapolation abilities, it could probably figure out that people are reacting to its behavior.

But the reason it cared about respecting people's behavior was because their behavior was evidence of people's True Inner Preference; of what value function was applied to it; of what the secret food  that it gets rewarded for is. But if what people say is determined by its behavior, then it is not determined by which food  was picked, and therefore it's not evidence about food  - and thus not something the AI should respect.


But would it really figure that out, and react in this way? After all, it was trained with data where humans would never do this; presumably whichever sorts of universal principles of reasoning it learned, it would just add exceptions to them when it comes to predicting human expressions of food preferences, since this is what the training data says - that human expressions of food preferences are solely determined by the food  that it gets rewarded for bringing.

That was my original counterargument, and I still think it has some domain of applicability, but I don't think it applies in general. Specifically, we might imagine that the AI introduces a bias for itself, where it's less likely to apply the psychological extrapolation rules to the things we train it with counterfactuals on. But such a bias may be continuous, weighting its tendency down while still being able to apply it if it was sufficiently strongly pushed to do so. The question then is, how much is "sufficiently strongly"?

Well, the answer to that is that you get what you ask for. The utility function specifies some distribution of counterfactuals; the training procedure involves some amount of training in response to various different counterfactuals. These counterfactuals will, purely by coincidence, sometimes generate something that "looks like" the humans reacted to the AI, when really the counterfactual just happened to be switched at the same time as the AI was doing something critical.

If the AI applies the psychological extrapolation rules to these cases, it will lose utility according to the utility function. But the probability that these coincidences happens drops exponentially with the number of coincidences happening in a row; so the utility function won't be affected much if the AI starts being willing to apply the psychological extrapolation after just a handful of coincidences.

We can abstract this to a more general rule: Whenever the loss function works over a distribution of possibilities, these possibilities should have the same statistical properties as the possibility that actually happens; the possibility that actually happens should not involve astronomical coincidences, as otherwise the trained AI could just act completely differently in reality from what was intended with the loss.


Again I don't think this is a problem with the stop button. At least, not necessarily - I made it easier for myself by reducing the stop button to a 1-bit problem, where as soon as there was an indication that someone wanted to shut it down, it should get shut down, even if the person changed their mind. This is probably fine for a stop button under some conditions.

But once you get even slightly beyond that, you quickly run into cases where you want continuous control, multiple bits of information, etc., and under those conditions my approach gets exponentially worse.

At least unless there's a solution to this. I've considered various possibilities, e.g.

  • One could give an AI utility for how well it performs in the worst case, over all the counterfactuals, rather than on average over the entire distribution. But "worst cases" seem to likely be extreme and unhelpful, so that doesn't seem like a good idea to me.
  • One could try to push the counterfactuals closer to the real world. But most approaches for this end up decreasing the degree to which the counterfactuals are applied to human behavior (since the AI tends to need to control the timing of human reactions in order for the reactions to be realistic), which in my opinion is an absolute no-go, as it makes the counterfactuals less pure. (I would not recommend using only partial counterfactuals that still allow the AI some control, since it seems extremely brittle to me.)

Nothing seems scalable so far. I still believe that utilities that are based on counterfactuals over human wants are going to be critical for alignment and corrigibility, since I have a hard time seeing how one expects to specify these things using only the actual state of the world and nothing about the human impact on those states; but it seems to me that there are currently one or more entirely unrelated critical pieces that are missing from this.

Thanks to Justis Mills for proofreading and pointing out a section that was very unclear.


New Comment