Summary: We would like to build corrigible AIs, which do not prevent us from shutting them down or changing their utility function. While there are some corrigibility solutions (such as utility indifference) that appear to partially work, they do not capture the philosophical intuition behind corrigibility: we want an agent that not only allows us to shut it down, but also desires for us to be able to shut it down if we want to. In this post, we look at a few models of utility function uncertainty and find that they do not solve the corrigibility problem.


Eliezer describes the hard problem of corrigibility on Arbital:

On a human, intuitive level, it seems like there's a central idea behind corrigibility that seems simple to us: understand that you're flawed, that your meta-processes might also be flawed, and that there's another cognitive system over there (the programmer) that's less flawed, so you should let that cognitive system correct you even if that doesn't seem like the first-order right thing to do. You shouldn't disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn't model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of 'correction'.

Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A's preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.

The objective of this post is to be some of the preliminary research described in the second paragraph.


We will assume that the AI exists in the same world as the human. We will examine various models the AI could use for the human and the true utility function. None of these models will truly yield a corrigible agent.

1. The human is a logically omniscent Bayesian utility maximizer who knows their utility function

a) The human is a black box

i) The human is aware of the AI

If the AI models the human as a black box Bayesian utility maximizer who knows about the AI, then it can assume that the human will communicate their utility function to the AI efficiently. This leads to a signalling equilibrium in which the human communicates the correct utility function to the AI using an optimal code. So the AI will assume that the human communicates the utility function, e.g. by writing it as a computer program.

Of course, in real life this will not work, because the human will be unable to write their utility function.

ii) The human is not aware of the AI

If the human is not aware of the AI, then the AI must learn the human's values by observing the human interacting with the world, rather than through signalling. Since this model assumes the human is perfectly rational, it is very close to value learning models used in economics. However, these models are inappropriate for corrigibility, because corrigibility requires the human to interact with the AI (e.g. by shutting it down). Additionally, the AI will want to manipulate the human into being an efficient morality sensor; for example, it may set up trolley problems for the human to encounter. This will not yield the right answer unless the value learning model is correct (which it isn't, because humans are not logically omniscent Bayesian utility maximizers).

b) The human is not a black box

Here, the AI can possibly gain information about the human's utility function faster than in the signalling equilibrium, by taking apart the human's brain (literally or metaphorically). This will give a sufficiently powerful AI enough information to predict the human's actions in many different possible situations. Therefore, the AI will need to further observations of the human. We expect this to be bad, because it requires the AI's value learning algorithm to be correct from the start. Certainly, this does not count as corrigible behavior!

2. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function

Bayesian uncertainty about the utility function, with no way of learning about the utility function, will not change much. If there is no way to learn about one's utility function, then (depending on one's approach to normative uncertainty) the optimal action is to optimize a weighted average of them, or something similar to this. So this situation is really the same as in the case with the known utility function.

3. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function and observes it over time

a) The human is a black box

i) The human is aware of the AI

As in 1ai, we get a signalling equilibrium. In this idealized model, instead of communicating the full utility function at the start, the human will communicate observations of the true utility function (i.e. moral intuitions) over time. So the AI will keep the human alive and use them as a morality sensor. Similar to 1a, this fails in real life because it requires the human to specify their moral intuitions.

ii) The human is not aware of the AI

As in 1aii, the AI will learn about the human's values from the human's interaction with the world. This is different from 1aii in that the AI cannot assume that the human makes consistent decisions over time (because the human learns more about the utility function over time). However, in practice it is similar: the AI will manipulate the human into being an optimal morality sensor, only it will do this differently to account for the fact that the human gains moral updates over time.

b) The human is not a black box

As in 1b, the AI might more efficiently gain information about the utility function by taking apart the human's brain. Then, it can predict the human's actions in possible future situations. This includes predictions about what moral intuitions the human would communicate. Similar to 1b, this fails in real life because it requires the AI's value learning algorithm to be correct.

4. One of the above, but the human also has uncertainty about mathematical statements

In this case the human solves the problem as before, except that they delegate mathematical uncertainty questions to the AI. For example, the human might write out their true utility function as a mathematical expression that contains difficult-to-compute numbers. This requires the AI to implement a solution to logical uncertainty, but even if we already had such a solution, this would still place unrealistic demands on the human (namely, reducing the value alignment problem to a mathematical problem).


This short overview of corrigibility models shows that simple uncertainty about the correct utility function is not sufficient for corrigibility. It is not clear what the correct solution to the hard problem of corrigibility is. Perhaps it will involve some model like those in this post in which humans are bounded in a specific way that causes them to desire corrigible AIs, or perhaps it will look completely different from these models.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 5:15 AM

If our goal was only to get corrigible behavior, we could build agents with learn our (instrumental) preferences over behaviors and then respect those preferences. It doesn't seem hard to learn that humans would prefer "be receptive to corrections" to "disassemble the human in order to figure out how they would have corrected you."

It seems like the main puzzle is reconciling our intuitions about corrigibility with our desire to build systems which are only motivated by their users' terminal preferences (rather than respecting particular instrumental preferences).

I share the intuitive sense that this ought to be possible, though after thinking about it more I'm very uncertain. If we can figure out how to make that work, I agree it would be very useful.

That said, I think it's worth keeping in mind:

(1) it may not be possible to reconcile corrigibility with sharing-only-terminal-values. (2) it's not necessarily necessary

If I wanted to communicate the problem, I would focus on making the existing problem statement attractive/understandable to mainstream researchers rather than producing a precise formal statement, because (a) I don't think that a formal statement would be compelling to people without significant additional expositive work, (b) I think that producing a good formal statement probably involves mostly solving the problem, and (c) I think that the basic problem you are outlining here is already sufficient that (if properly presented) it should be understandable to mainstream researchers.

(The last paragraph applies very specifically to this problem, it is not intended to generalize to other problems, where precision may be a key part of getting other people to care about the problem.)

That said, trying to produce a formal statement may be the right way to attack the problem, and attacking the problem may be higher-priority than communicating about it (depending on how much you've already tried / how promising it seems. I'm definitely not at the stage where I would want to communicate rather than work on it.) In that case, ignore the last 3 paragraphs.

Yes, I think that learning the user's instrumental preferences is a good way to get corrigible behavior. I'm hoping to explore the idea of learning an ontology in which instrumental preferences can be represented. There seems to be a spectrum between learning a user's terminal preferences and learning their actions, with learning instrumental preferences falling in between these.

I'm planning on writing up some posts about models for goal-directed value learning. I like your suggestion of presenting the problem so it's understandable to mainstream researchers; I'll think about what to do about this after writing up the posts.

We should add the possibility where the AI learns form the human's actions, but is indifferent to what it will learn. This won't keep the human safe, but will avoid the AI optimising it into a optimal morality sensor.

Interesting suggestion. Would this work by computing value of information and then adding a negative term for that in the utility function, so that actual value of information is zero?

You could do that, but my initial inspiration was just corrigibility - the AI is indifferent to the update in values. Maybe there could be a subroutine (or another agent) that gathers the information for the update, while the AI just doesn't care about that.

See "pre-corriged agents" for setups where the AI isn't indifferent to the process, but is indifferent to the direction of the update.