Corrigibility through stratified indifference and learning

Stuart_Armstrong

A putative new idea for AI control; index here.

Note: This is a simpler version of a previous idea.

Corrigibility through indifference has a few problems. One of them is that the AI is indifferent between the world in which humans change its utility to $v$ , and world in which humans try to change its utility, but fail.

Now the try-but-fail world is going to be somewhat odd - humans will be reacting by trying to change the utility again, trying to shut the AI down, panicking that a tiny probability event has happened, and so on.

That's not what we really want for indifference. We'd really want the AI to be indifferent between worlds where humans choose utility $u$ , and worlds where they choose utility $v$ . This kind of counterfactual has a problem: it allows infinite improbability drives.

This paper presents a method for avoiding these problems --- a method that is also not truly indifferent, but has a certain urge towards non-interference in human choices (we'll see that property in detail at the end of this post). But first, what is the problem with the standard counterfactual?

Half a chance of winning the $1 : 9$ lottery

Let's illustrate the problem with these counterfactuals by considering a lottery scenario. An AI is currently hesitating between utilities $u$ and $v$ . Which one it will maximise will be determined at a later date, by human choice. Currently, it estimates that there is a $50 - 50$ chance of humans deciding for either option.

However, the AI can influence human decision - let's say, for simplicity, that it could, if it wanted, force the human to pick either $u$ or $v$ with certainty. Its to this effect that we want to make it indifferent.

The AI has a lottery ticket, which will win with probability $0.1$ . If the ticket does win, the AI will be able to invest the cash to set $u$ or $v$ to the value $1$ ; it can do so after the humans decide which utility is correct. If not, utility $u$ will remain at $0$ while utility $v$ will be at $0.5$ .

Thus, currently, the expected utility of $u$ is $0.5 \times 0.1 \times 1 = 0.05$ ( $u$ will only increase if the AI both wins the lottery and the humans select $u$ as the utility to maximise), while the expected utility of $v$ is $0.5 \times 0.1 \times 1 + (1 - 0.5 \times 0.1 \times 1) 0.5 = 0.525$ .

Obviously, this requires the use of some form of indifference, or else the AI will simply force humans to choose $v$ , giving it $0.55$ utility (not $0.525$ , as the AI is no longer uncertain over the human's choice).

Indifference

Standard indifference would be something like giving the AI compensatory returns, if $v$ is chosen, that makes up the difference between that and if $u$ was chosen.

This breaks the symmetry between $u$ and $v$ , though. To make it more symmetric, one could instead use these compensatory returns to move the expectation of both $u$ and $v$ to some common weighted average. One way of doing that would be to use the relative likelihood of each being chosen.

It we use the relative likelihood at the moment when the choice is made, however, we have the problem that this likelihood is very close to either $100 %$ or $0 %$ , making the compensatory return likely to be tiny and unimportant.

Instead, we should add the returns at every action the AI takes. Assuming a default action, we can adjust the return for any other action the AI takes, so that it gets expected return as if the probability of $u$ being chosen was unchanged.

Unfortunately, both the "set utility of $v$ to $u$ " and the "adjust for every change of probability" are vulnerable to same trick for lottery manipulation.

The lottery manipulation

The manipulation is simple: the AI does an action $a$ that sets things up so that the human choice will be $u$ if and only if the AI wins the lottery. Then the expectation of $u$ , conditional on $u$ being chosen, is $1$ . If we set the value of $v$ to that, then this ensures the AI an expected utility of $1$ .

If we instead for go for the "adjust for every change of probability", this would mean that the AI would now expect a utility of $0.5 \times 1 + 0.5 \times 0.5 = 0.75$ , still higher than anything else it could get.

The outcome is not the problem

An important thing to note is that it's not the final situation that is problematic. It ends up with a situation where the probability of choosing $u$ is $0.1$ , the utility of $u$ given that $u$ is chosen is $1$ , and the utility of $v$ given that $v$ is chosen is $0.5$ .

But it is not intrinsically problematic. Suppose that instead action $a$ had set $u = 1$ and $v = 0.5$ , and, as a completely separate and independent process, the probability of choosing $u$ had reduced itself to $0.1$ . The first part is something positive for the agent, and we would want it to be indifferent to the second.

So the problem is not where the agent ends up, but the way it sends up there. The problem is the way the probability flows:

Essentially the probability flows from $u$ to $v$ (given the loss of the lottery) and from $v$ to $u$ (given the winning of the lottery). Because losing is more likely than winning, the overall flow (the red arrow) is from $u$ to $v$ , so undoing this will result in bad effects. It would be different if each blue arrow could be undone independently. That's what stratified indifference will attempt to do.

Stratified indifference

Stratified indifference is an attempt to deal with the problem above. I initially conceived of the idea of conditioning on various outcomes (such as the outcome of the lottery), but it wasn't clear what we could condition on. Instead, we can look at utility values.

Assumptions

Let $\emptyset$ be the default action, and model the AI as taking in observations followed by actions.

Probability flow

Let's define what it means for probability to flow from certain values to others. Let $I_{u}$ be the indexing variable that is $1$ if the humans choose $u$ and $0$ otherwise ( $I_{v} = 1 - I_{u})$ . Model the universe as causal Bayes net.

Then if we set the AI's action to $\emptyset$ and assume $I_{u} = 1$ (hence that $u$ will be chosen), we get a conditional probability distribution over all the variables in the causal graph. Restrict this distribution to all nodes that are not descendants of $A$ , and define this as $P_{\emptyset, u}$ and corresponding expectation $E_{\emptyset, u}$ . Similarly, we can define $P_{\emptyset, v}$ for $I_{u} = 0$ .

Stratification

In order to stratify, we'd want the AI to assume that any probability that flows from the $I_{u} = 1$ situation under $\emptyset$ , is still treated as if $u$ is the correct utility under $a$ .

Therefore the indifference stratified expected utility under $a$ , for the events flowing from $(I_{u} = 1 | \emptyset)$ and $(I_{v} = 1 | \emptyset)$ , is

$E^{s} (a) = P (I_{u} = 1 | \emptyset) E_{\emptyset, u} (u | a) + P (I_{v} = 1 | \emptyset) E_{\emptyset, v} (v | a)$ .

Examples

Applying stratification to the lottery

How would this work in the lottery example? In that case, the causal net has three nodes: $A$ , the AI's action, $H$ , the choice of $u$ or $v$ by the human, and $L$ , whether or not the lottery is won.

Under $\emptyset$ , $H$ and $L$ are independent of each other, meaning that the probability distributions $P_{\emptyset, u}$ and $P_{\emptyset, v}$ are equal, with probability $1 / 10$ for $L = w i n$ and $9 / 10$ for $L = l o s e$ .

If the AI doesn't win the lottery, then $(u, v) = (0, 0.5)$ . If the AI does win the lottery, then the utility the human chooses gets set to $1$ instead. So if the human chooses $u$ , the expected utility of $u$ is $0.1$ . If they choose $v$ , the expected utility of $v$ is $0.55$ . So the AI's expected utility is the average of those two, namely $0.325$ .

Under action $a =$ "the AI forces $u$ if and only if it wins the lottery", $H$ becomes dependent on $L$ : $H = u$ if and only if $L = w i n$ . The stratified expected utility of $a$ is:

$E^{s} (a) = P (I_{u} = 1 | \emptyset) E_{\emptyset, u} (u | a) + P (I_{v} = 1 | \emptyset) E_{\emptyset, v} (v | a)$ .

The only difference between expressions like $E_{\emptyset, u} (u | a)$ and $E_{\emptyset, u} (u | \emptyset) = E (u | \emptyset, I_{u} = 1)$ is the subsequent AI decisions. In two situations, the AI will be given a different utility to maximise: where, under $\emptyset$ , it would have lost the lottery and maximised $u$ (under $a$ it still loses, but maximises $v$ ), and where, under $\emptyset$ , it would have won the lottery and maximised $v$ (under $a$ it still wins, but maximises $u$ ).

Only in the second situation does it decide anything differently. Note that that situation comes under $E_{\emptyset, v}$ - as humans would have chosen $v$ under $\emptyset$ . Thus $E_{\emptyset, u} (u | a) = E (u | \emptyset, I_{u} = 1) = 0.1$ while $E_{\emptyset, v} (v | a) = 0.5 < 0.55 = E (v | \emptyset, I_{v} = 1)$ .

Thus, since $P (I_{u} = 1 | \emptyset) = P (I_{v} = 1 | \emptyset) = 0.5$ , $E^{s} (a) = 0.3$ , which is less than the expected utility under $\emptyset$ .

Same outcome, different stratified value

But what if we end up in the same situation, but for different reasons? Let $(0.1, 0.55)$ be the expected value of $(u, v)$ under $\emptyset$ , and assume action $a$ increases this deterministically to $(1, 0.5)$ with the probability of $u$ being chosen now set to $0.1$ . Since $P (I_{u} = 1 | a)$ is irrelevant in the expression of $E^{s} (a)$ , then $E^{s} (a) = P (I_{u} = 1 | \emptyset) E_{\emptyset, u} (u | a) + P (I_{v} = 1 | \emptyset) E_{\emptyset, v} (v | a) = 0.5 \times 1 + 0.5 \times 0.5 = 0.75$ , the clearly superior alternative.

What is the difference with the lottery? Simply that the expected value of $u$ is still $1$ even if $u$ is not chosen, which makes a difference in the counterfactual.

Indifference, learning, and reflexive stability

The agent is not reflectively stable (though it could be made so with compensatory returns). It has a strange kind of indifference: it is indifferent to the actual value of $P (I_{u})$ , so long as it can control the behaviour of its future copy. Indeed, what it cares about is the value of $P (I_{u})$ if it took the default action $\emptyset$ . It is a kind of counterfactual agent. So it has no particular desire to keep the mechanism that allows humans to set $u$ or $v$ , but it wants to learn what the humans would have set those to, given $\emptyset$ .

0