Learning values versus indifference

0IAFF-User-111

0IAFF-User-111

0Jessica Taylor

0Jessica Taylor

0Stuart Armstrong

New Comment

RE: my last question-
After talking to Stuart, I think one way of viewing the problem with such a proposal is:
The agent cares about its *future* expected utility (which depends on the state/history, not just the MDP).

Why doesn't normalizing rewards work?

(i.e. set max_pi(expected returns)=1 and min_pi(expected_returns)=0, for all environments)... I assume this is what you're talking about at the end?

So though the AI is motivated to learn, it’s also motivated to manipulate the learning process.

It seems like the problem here is that the *prior* probability that the human says "cake" depends on the AI's policy. The update when seeing the human actually say "cake" isn't a problem, due to conservation of expected evidence.

Under my (very incomplete) model of Everitt's approach, the programmer will specify the prior over values (so the prior is independent of the AI's policy), then disallow actions that would prevent the reward signal from being an unbiased estimate of the values.

Correct me if I'm wrong, but doesn't this proposal compromise between the two in a satisfying way?

A putative new idea for AI control; index here.Corrigibility should allow safe value or policy change. Indifference allows the agent to accept changes without objecting. However, an indifferent agent is similarly indifferent to the learning process.

Classical uncertainty over values has the opposite problem: the AI is motivated to learn more about its values (and preserve the learning process) BUT is also motivated to manipulate its values.

Both these effects can be illustrated on a single graph. Assume that the AI follows utility U is uncertain between utilities v and w, and has a probability p that U=v.

Note that the correct way of achieving this is to define U=IXv+(1−IX)w for some indicator function IX. This allows the agent to correctly solve the naive cake or death problem. However, I'll continue to use the terminology that we're used to, with the understanding that P(U=v) means E(IX).

Then there are four key values: E(v|p=1), E(w|p=0), E(v|p=0), and E(w|p=1) (all expectations and probabilities are taken with respect to the AI's estimates). Since p is the AI's probability that U=v, p=1 means that the AI will behave as a pure v maximiser. Thus E(v|p=1) and E(w|p=1) are the expectations of v and w, respectively, given that the AI is maximising v. And E(v|p=0) and E(w|p=0) are the expectations of the utilities given that the AI is maximising w.

In any reasonable world, E(v|p=1)≥E(v|p=0) and E(w|p=0)≥E(w|p=1) -- the AI cannot maximise a utility better by trying to maximise a different one.

For illustrative purposes, assume E(v|p=1)=1, E(w|p=0)=3, E(v|p=0)=−2, E(w|p=1)=−0.5, and consider the following graph:

The blue line connects E(w|p=0) (at p=0) with E(v|p=1) (at p=1). This is the expected U, plotted against the AI's current p, if the AI expects to be immediately informed of the correct U. It is the maximal possible expected U given p.

The expected purple line connects E(w|p=0) (at p=0) with E(v|p=0) (at p=1). Note that the second point's p-value is not where you might naively expect. This represents the expected U, as a function of p, if the AI were to behave as a pure w-maximiser. The yellow line connects E(v|p=1) (at p=1) with E(w|p=1) (at p=0) and represents the expected U, as a function of p, for an AI that behaves as a pure v-maximiser.

Since the AI has the option of behaving as a pure v-maximiser or w-maximiser, those lines represent the minimal U-utility the AI can achieve. These minimums can actually be obtained: imagine an AI that has a single choice between two options, and these options have differential effects on v and w.

But in general, there will be some but not perfect tradeoff between maximising v and w, and the true expectations for U, as a function of p, will be given by a curve within the triangle defined by the three lines -- a curve like the green one.

Theorem: Curves of expected U-utility must be convex as a function of p.Proof: Let the curve be defined as f(p). Fix any p1<p2<p3. Assume the AI has p=p2. Now update its information so that it knows that after getting that info, either p will be p1 (with probability (p2−p3)/(p1−p3)) or p=p3 (with probability (p1−p2)/(p1−p3)). Because of the probabilities we've chosen, p is still p2 at the moment. Hence the expectation of U, given this information update, is the point of x-coordinate p2 in the line joining (p1,f(p1)) with (p3,f(p3)). Because extra information cannot make the agent worse in expectation, this point must have a y-coordinate higher or equal to f(p2) (it expectation without the information update). This demonstrates the curve is convex.As a side effect of this argument, we can clearly see the value of information. Looking at the "expectation of U" curve (now purple), we can see that extra information can lift its expectation up to the blue line (perfect value information). Therefore the orange segment is the improvements the AI can expect from learning more about is values. This demonstrates that the AI has a urge to preserve its learning process.

## Bias

What is the problem then with classical learning systems? The problem is that, typically, E(w|p=0) and E(v|p=1) are not equal. So though the AI is motivated to learn, it's also motivated to manipulate the learning process.

For instance, v could be "make cake", w "cause deaths", and the update process could be "ask humans which to do".

If Ic is the indicator function for "the human answers Cake" and Id the indicator function for "the human answers death", then

And the AI will attempt to increase Id if it can.

## Versus indifference

How does indifference compare with classical value uncertainty? Well, whenever an indifferent AI changes its utility function, it gets compensatory rewards to ensure the expected utility doesn't change. Therefore its expectation is constant for varying p, as shown by the yellow line:

So, between indifference and classical value learning, the first has no incentive to manipulate the learning process, but also no incentive to preserve it. While classical value learning differs in both ways.

Could we get a compromise between the two? One initial idea would be to use a compensatory reward to move E(w|p=0) and E(v|p=1) to the same level, removing the bias but preserving the desire to learn. This approach seems to have a lot of problems, though, and I can't get it to work properly for the moment.