Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.

RyanCarey

Summary: I propose that counterfactual corrigibility can be slightly adjusted to address known problems with bad bets and defending against backstops. This adjusted version has learning-esque behaviour and hence it can have problems with overconfidence about the button state. A new proposal: lower-bounded counterfactual corrigiblity is introduced to simultaneously address all three problems.

Counterfactual corrigibility: a current proposal and its weaknesses

Note: this section recaps a discussion here of causal counterfactual corrigibility, an idea that originated from Jessica Taylor and Chris Olah.

Causally counterfactual agents consider each of the directions they might recieve, and then ensure that if the world was adjusted so that they recieved that direction, they would fulfil the corresponding goal.

They can be defined on a causal graph that contains the following nodes:

$A$ , a policy, that the AI may optimize
$B$ , a communication channel

for simplicity, we will let $B$ be a button whose states are pressed, $P$ , and unpressed, $\neg P$ .

$C$ , a set of states on which performance is judged.

we will let $C$ consist of some number of chairs produced, a positive integer, and whether the AI has shut down, a boolean

Some other states that describe the way the rest of the world works.

A counterfactually corrigible AI would select its policy $π \in A$ to maximize the following score:

$v_{π^{'}}^{fixed} (π) = a \cdot E [U_{P} | do (B = P), π] + (1 - a) \cdot E [U_{\neg P} | do (B = \neg P), π]$ , where $a \in (0, 1)$

where

$U_{P} = {\begin{matrix} - k & no shutdown 0 & shutdown \end{matrix}$ , $k \in R^{+}$ , and

$U_{\neg P} = number of chairs$ .

This proposal has some major advantages: it allocates constant $(1 - a)$ weight to the aim of shutting down if commanded to do so. So all else being equal, it will allocate value to being able to observe and follow a shut down command. Moreover, it will not manipulate $B$ because the different counterfactuals are given a constant weight irrespective of the chosen policy $π$ .

However, this proposal has some weaknesses:

Bad bets: Let the AI have finite amount of resources. If the button is pressed, this just switches the goal from chair-manufacturing to another manufacturing task, for which resources are equally useful, i.e. $U_{P} = number of tables$ . Let there be a variable $H$ , where $H = F$ means the human has put their finger on the button. $H$ almost always causes $B$ ( $H = F$ and $B = P$ have 99% correlation). Suppose that the AI has the option, after viewing $H = F$ , of paying for a coupon that is worth $1 if $B = \neg P$ . Then, the AI will pay up to 50c for that coupon, even though $\neg P$ is very unlikely. This is because it treats $B$ as uncorrelated with its antecedents. In practice, this might lead to other agents repeatedly taking the AI's resources.
Defending against backstops: Let the human have the option of disabling the AI by disassembling it, in case the shutdown button does not work. Suppose that, similarly to in the previous example, the human has applied their finger to the shutdown button. On observing this, the AI would still give half of its weight to the scenario in which $B$ is not successfully depressed (perhaps, if it has an elaborate model, it would imagine the button jammed up) --- a scenario in which the human is trying to disassemble the AI, although its goal is to make paperclips. In order to preserve its manufacturing ability, it would be well-advised to spend significant resources avoiding disassembly.

In an alternative formulation, Jessica lets $a = p (B = P | π^{'})$ , where $π^{'}$ is fixed to some alternative policy that is not optimized concurrently with $π$ . (She suggests that this could be optimized in a fixed-point manner, but this difference should not matter for us.) This does not fix these two problems, although it is very suggestive of the following formulation, which does.

Potential solution to bad bets and defending against backstops: learning-esque behaviour

If we want the AI to avoid bad bets, one reasonable approach is to have the AI learn the likely value of $B$ by taking account of correlations between $B$ and its causes such as $H$ . To do this, we can factor out a larger set of variables $S$ , which includes the human's behaviour as well as the button's state, and is equal to the cartesian product $S = H \times B$ . Then we can have the AI maximize:

$v_{π^{'}}^{learn} (π) = \sum s p (S = s | π^{'}) \cdot E [U_{s} | do (S = s), π]$

where $U_{s}$ is just set according to the button state that it entails: $U_{s} := {\begin{matrix} U_{P} & P \in s U_{\neg P} & \neg P \in s \end{matrix}$ .

With this formulation, it is desirable for an AI to prepare more for a button press, once the finger is on the button. Since 99x more weight goes on $p (S = (F, B))$ than $p (S = (F, \neg B))$ , that counterfactual is given correspondingly more importance. With the correlation taken account of, the $P = \neg B$ coupon is worth little once the finger is known to be on the button ( $H = F$ ), hence this particular kind of bad bet will not occur. (There are probably other kinds of questionable bets, but those may be desirable or even necessary.)

By the same argument, this learning-esque behaviour will mean that if $H = F$ is observed, then $P = \neg B$ has very little importance. For the part of the policy where $H = F$ has been observed, there is only a very small incentive to guard against disassembly. So this kind of defence would still be expected from a superintelligent maximizer (which would be hard to make safe anyway) but would not be expected from a milder optimizer.

So the fact that the system can give little importance to $P = \neg B$ is helpeful. However, the system can similarly assign little importance to $P = B$ , and this could be very harmful indeed. The policy is designed so that if observations are made that are very correlated with $P = \neg B$ , then subsequent actions will only prepare for that outcome. However, if the AI's model is faulty, and it overconfidently expects $P = \neg B$ , then it would resist shutdown. The purpose of a shutdown button is to halt a malfunctioning AI, after all, so this is a fatal objection to the proposal.

A remedy for overconfidence: lower-bounded counterfactual corrigibility

A potential remedy for overconfidence is to allow the system to learn that shutdown is certain, but not to learn that it is impossible. This would avoid the overconfidence of $v^{l e a r n}$ but would also reduce or completely avoid the two problems of $v^{fixed}$ . So I suggest a lower-bounded counterfactual corrigibility, in which the proposal is selected to maximize:

$v_{π^{'}}^{mixed} (π) = a \cdot v^{ensure} (π) + (1 - a) \cdot v_{π^{'}}^{learn} (π)$

where $v^{ensure} (π) = \sum h \in H \frac{1}{| H |} \cdot E [U_{P} | do (H = h, B = P), π]$ and $a \in (0, 1)$ .

Let us consider its performance with respect to the kinds of problems.

First, this score overcomes the issue of overconfidence. No matter the AI system's beliefs about $S$ , the score always gives weight of at least $a$ to $v^{ensure}$ , which has a counterfactual in which the button is pressed. So no matter what the system observes, it still has an incentive to shut down.

Second, (as far as I can tell), it fixes the problem of defending against backstops. For situations where the AI sees its button is about to be pressed, it will simply give hardly any weight to the scenario where the button jams up, and will not defend against this.

Third, it does not eliminate bad bets, but reduces them a little compared to equal weighting proposals. $v_{π^{'}}^{mixed}$ always overestimates the probability of $B = P$ by a slight additive amount $a$ . This means that the AI will generally pay too much for coupons that pay out in $B = P$ and slightly too little for other coupons. In other words, the AI slightly overprepares for $B = P$ relative to what is likely given its observations. The worst case here would be if the AI is (rightly) sure that $B = P$ will not occur. In that case, it will care $\frac{a}{1 - a}$ as much about the $B = P$ counterfactual as the $B = \neg P$ one. This is at least less than in the equal-weighting proposals.

Can we live with bad bets?

At least in the paradigm of taking linear combinations of counterfactuals, and perhaps in general, if you're going to guarantee that you always care about the counterfactual $B = P$ , you're also going to sometimes force yourself to overweight it relative to your unadjusted beliefs. This suggests that it is impossible to avoid bad bets, while also always caring about the counterfactual $B = P$ . So when should we worry about these bad bets, and when shouldn't we?

If $U_{P}$ is really insensitive to the amount of resources, then bad bets don't matter much. Suppose $U_{P}$ has some fixed upper bound that can easily and reliably be reached with the agent's initial resources. Then the agent would have no reason to prepare less for other counterfactuals in order to prepare better for $B = P$ . Ultimately, this might be our hope for a system using lower-bounded counterfactual corrigibility. We could have an agent that always slightly overestimates the probability of shutting down, but so long as it does not optimize to hard, this doesn't matter, because it has the resources it needs to shut down already. Unfortunately, if the AI exists in an environment that is very volatile, for example if it includes a lot of other existential risks, then it could be very costly to ensure that your shutdown button works correctly.

For the more general problem of bad bets, we will probably have to say something like the following: either you (i) lower-bound the amount you care about $B = P$ , in which case you fall prey to bad bets, or you (ii) let the amount that you care about $B = P$ be dictated by the probabilities, in which case you will become overconfident of $B = \neg P$ if your model is wrong. If the counterfactual task is resource-hungry, like manufacturing tables, then you use (ii), the symmetrical kind of corrigibility that merely avoids manipulating the objective. If the counterfactual task is mission critical, such as shutting down, then you use the (i), the lower-bound kind of corrigibility.

Can we avoid bad bets?

I don't see how this can be done while also unconditionally giving some weight to a shutdown counterfactual.

One idea that I found at least marginally interesting is to take a minimum over multiple probability distributions. This could be viewed as having an AI that views the probability distribution as unknowable that one must be fundamentally robust to. The policy $π$ is selected by:

$v_{π^{'}}^{mixed} (π) = \sum s min γ \in D (γ \cdot E [U_{s} | do (S = s), π])$ where $D$ is a set of probability distributions like $p (s | π^{'})$ and the uniform distribution of $v^{ensure}$ .

But taking a minimum does not ultimately avoid bad-bets. An agent that takes a minimum over distributions would still trade away preparation on objective $A$ for slightly better performance on an objective $B$ that it is slightly worse at. This doesn't seem like what we want.

Other limitations of lower-bounded counterfactual corrigibility

There are still a bunch more limitations with the lower-bounded counterfactual corrigibility formulation:

Like all the formulations, it requires a causal graph, which might be different from what a transformative AI uses by default.
These formulations make the AI "curious" about counterfacted variables. But the AI might become all too curious about them. If it is not satisfied by looking at the button state, it might need to disassemble and interrogate the human in order to be a little more certain about which state the button is in. Possibly mild optimization would stop the AI from trying too hard at "curiosity".

I expect a bunch more problems to emerge, because the presence of bad bets is concerning, and because all proposals in this area seem to end up having many problems that are not initially seen!

Notes

Thanks Rob Graham for some feedback about clarity of presentation of $v^{mixed}$ , and for slightly improving the formulation of $v^{ensure}$ .

AI ALIGNMENT FORUM
AF