Learning values versus indifference

Stuart_Armstrong

A putative new idea for AI control; index here.

Corrigibility should allow safe value or policy change. Indifference allows the agent to accept changes without objecting. However, an indifferent agent is similarly indifferent to the learning process.

Classical uncertainty over values has the opposite problem: the AI is motivated to learn more about its values (and preserve the learning process) BUT is also motivated to manipulate its values.

Both these effects can be illustrated on a single graph. Assume that the AI follows utility $U$ is uncertain between utilities $v$ and $w$ , and has a probability $p$ that $U = v$ .

Note that the correct way of achieving this is to define $U = I_{X} v + (1 - I_{X}) w$ for some indicator function $I_{X}$ . This allows the agent to correctly solve the naive cake or death problem. However, I'll continue to use the terminology that we're used to, with the understanding that $P (U = v)$ means $E (I_{X})$ .

Then there are four key values: $E (v | p = 1)$ , $E (w | p = 0)$ , $E (v | p = 0)$ , and $E (w | p = 1)$ (all expectations and probabilities are taken with respect to the AI's estimates). Since $p$ is the AI's probability that $U = v$ , $p = 1$ means that the AI will behave as a pure $v$ maximiser. Thus $E (v | p = 1)$ and $E (w | p = 1)$ are the expectations of $v$ and $w$ , respectively, given that the AI is maximising $v$ . And $E (v | p = 0)$ and $E (w | p = 0)$ are the expectations of the utilities given that the AI is maximising $w$ .

In any reasonable world, $E (v | p = 1) \geq E (v | p = 0)$ and $E (w | p = 0) \geq E (w | p = 1)$ -- the AI cannot maximise a utility better by trying to maximise a different one.

For illustrative purposes, assume $E (v | p = 1) = 1$ , $E (w | p = 0) = 3$ , $E (v | p = 0) = - 2$ , $E (w | p = 1) = - 0.5$ , and consider the following graph:

The blue line connects $E (w | p = 0)$ (at $p = 0$ ) with $E (v | p = 1)$ (at $p = 1$ ). This is the expected $U$ , plotted against the AI's current $p$ , if the AI expects to be immediately informed of the correct $U$ . It is the maximal possible expected $U$ given $p$ .

The expected purple line connects $E (w | p = 0)$ (at $p = 0$ ) with $E (v | p = 0)$ (at $p = 1$ ). Note that the second point's $p$ -value is not where you might naively expect. This represents the expected $U$ , as a function of $p$ , if the AI were to behave as a pure $w$ -maximiser. The yellow line connects $E (v | p = 1)$ (at $p = 1$ ) with $E (w | p = 1)$ (at $p = 0$ ) and represents the expected $U$ , as a function of $p$ , for an AI that behaves as a pure $v$ -maximiser.

Since the AI has the option of behaving as a pure $v$ -maximiser or $w$ -maximiser, those lines represent the minimal $U$ -utility the AI can achieve. These minimums can actually be obtained: imagine an AI that has a single choice between two options, and these options have differential effects on $v$ and $w$ .

But in general, there will be some but not perfect tradeoff between maximising $v$ and $w$ , and the true expectations for $U$ , as a function of $p$ , will be given by a curve within the triangle defined by the three lines -- a curve like the green one.

Theorem: Curves of expected $U$ -utility must be convex as a function of $p$ .

Proof: Let the curve be defined as $f (p)$ . Fix any $p_{1} < p_{2} < p_{3}$ . Assume the AI has $p = p_{2}$ . Now update its information so that it knows that after getting that info, either $p$ will be $p_{1}$ (with probability $(p_{2} - p_{3}) / (p_{1} - p_{3})$ ) or $p = p_{3}$ (with probability $(p_{1} - p_{2}) / (p_{1} - p_{3})$ ). Because of the probabilities we've chosen, $p$ is still $p_{2}$ at the moment. Hence the expectation of $U$ , given this information update, is the point of $x$ -coordinate $p_{2}$ in the line joining $(p_{1}, f (p_{1}))$ with $(p_{3}, f (p_{3}))$ . Because extra information cannot make the agent worse in expectation, this point must have a $y$ -coordinate higher or equal to $f (p_{2})$ (it expectation without the information update). This demonstrates the curve is convex.

As a side effect of this argument, we can clearly see the value of information. Looking at the "expectation of $U$ " curve (now purple), we can see that extra information can lift its expectation up to the blue line (perfect value information). Therefore the orange segment is the improvements the AI can expect from learning more about is values. This demonstrates that the AI has a urge to preserve its learning process.

Bias

What is the problem then with classical learning systems? The problem is that, typically, $E (w | p = 0)$ and $E (v | p = 1)$ are not equal. So though the AI is motivated to learn, it's also motivated to manipulate the learning process.

For instance, $v$ could be "make cake", $w$ "cause deaths", and the update process could be "ask humans which to do".

If $I_{c}$ is the indicator function for "the human answers Cake" and $I_{d}$ the indicator function for "the human answers death", then

$U = v I_{c} + w I_{d}$ .

And the AI will attempt to increase $I_{d}$ if it can.

Versus indifference

How does indifference compare with classical value uncertainty? Well, whenever an indifferent AI changes its utility function, it gets compensatory rewards to ensure the expected utility doesn't change. Therefore its expectation is constant for varying $p$ , as shown by the yellow line:

So, between indifference and classical value learning, the first has no incentive to manipulate the learning process, but also no incentive to preserve it. While classical value learning differs in both ways.

Could we get a compromise between the two? One initial idea would be to use a compensatory reward to move $E (w | p = 0)$ and $E (v | p = 1)$ to the same level, removing the bias but preserving the desire to learn. This approach seems to have a lot of problems, though, and I can't get it to work properly for the moment.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

1

Learning values versus indifference

1

Bias

Versus indifference