Corrigibility for AIXI via double indifference

Stuart_Armstrong

A putative new idea for AI control; index here.

This post sketches out how one could extend corrigibility to AIXI, using both utility indifference and double indifference approaches.

The arguments are intended to be rigorous, but need to be checked, and convergence results are not proved. A full treatment of "probability estimators estimating probability estimators" will of course need the full machinery for logical uncertainty that MIRI is developing. I also feel the recursion formulas at the end could be simplified.

AIXI definition

Let $h_{< t} = a_{1} o_{1} \dots a_{t - 1} o_{t - 1}$ be a sequence of actions and observations before time $t$ . Let $ξ$ be a universal distribution, $π$ a policy (a map from past histories to the probability of actions), $0 < γ < 1$ a discount rate, and $r$ a reward function mapping to $[0, 1]$ . Given $h_{< t}$ and $ξ$ , we can define the value of $π$ :

$V (π, h_{< t}) = \sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} ξ (o_{t} | h_{< t} a_{t}) [r (o_{t}) + γ V (π, h_{< t} a_{t} o_{t})]$ .

The optimal policy $π^{*}$ for AIXI is simply

$π^{*} (h_{< t}) = {argmax}_{π} V (π, h_{< t})$ .

AIXI: basic corrigibility

AIXI: inconsistent corrigibility

To implement corrigibility in the AIXI framework, we need multiple reward functions, $r^{1}, r^{2}, \dots$ . Notice that the functions are indexed on the top, while time indexes go on the bottom. Various observations can change the reward function; let $f$ be the function that takes in the reward function, the observation, and outputs the reward function for next turn: $r_{t + 1} = f (r_{t}, o_{t})$ .

Then consider the following two value function:

$V (π, h_{< t}, r_{t}) = \sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} ξ (o_{t} | h_{< t} a_{t}) [r_{t} (o_{t}) + γ V (π, h_{< t} a_{t} o_{t}, r_{t})]$ .
$W (π, h_{< t}, r_{t}) = \sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} ξ (o_{t} | h_{< t} a_{t}) [r_{t} (o_{t}) + γ W (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t})]$ .

The difference between $V$ and $W$ is that in the recursion step, $V$ uses its current reward function to assess future rewards, while $W$ uses the modified reward function the agent will have next turn. Thus $W$ is the true expected reward. But a safely corrigible agent must use $V$ , giving the corrigible optimal policy:

$π^{*} (h_{< t}, r_{t}) = {argmax}_{π} V (π, h_{< t}, r_{t})$ .

AIXI: self-consistent corrigibility

The above agent is corrigible, but uses an incorrect value estimator. This is not self-consistent. To make it self-consistent, the agent needs to be given compensatory rewards. These are simply:

$C (π, h_{< t} a_{t} o_{t}, r_{t}) = V (π, h_{< t} a_{t} o_{t}, r_{t}) - V (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}))$ .

Note that this is zero if $f (r_{t}, o_{t}) = r_{t}$ , as we'd expect.

AIXI: changing the probability estimator

The universal mixture $ξ$ is used to estimate the next observation, given the history to date and the action taken. But $ξ$ suffers from the (slight) disadvantage of being uncomputable. Instead, let $μ$ be the true environment, and let $ρ^{i}$ be probability estimators with expectation operators $E_{ρ^{i}}$ . These probability estimators are required to be able to estimate three types of things:

The expectation of $μ$ in various situations, given as $E_{ρ^{i}} μ (\cdot | \cdot)$ .
The expectation of $π$ in various situations, given as $E_{ρ^{i}} π (\cdot | \cdot)$ .
The value of the expectation of the expectation of another estimator, given as $E_{ρ^{i}} E_{ρ^{j}} \dots$ .

Then we can rewrite $V$ as:

$V (π, h_{< t}) = E_{ρ} (\sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} μ (o_{t} | h_{< t} a_{t}) [r (o_{t}) + γ V (π, h_{< t} a_{t} o_{t})])$ .

These estimators change as a consequence of the AIXI's actions; let $g$ be the function that maps actions and current $ρ$ to the next one: $g (a_{t}, ρ_{t}) = ρ_{t + 1}$ .

This gives the value functions:

$V (π, h_{< t}, ρ_{t}) = E_{ρ_{t}} (\sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} μ (o_{t} | h_{< t} a_{t}) [r (o_{t}) + γ V (π, h_{< t} a_{t} o_{t}, g (a_{t}, ρ_{t}))])$ .

An example in practice

If $ρ_{t}$ is sufficiently well-defined, it can estimate when another $ρ^{i}$ is better than it, and choose that one. For instance, maybe the game is guessing heads ( $H$ ) or tails ( $T$ ), with rewards $1$ on a match and $0$ on a mismatch. The environment $μ$ is deterministic but complex. From the perspective of $ρ^{1}$ , Heads and Tails are equally likely $E_{ρ^{1}} μ (H | \cdot) = E_{ρ^{1}} μ (T | \cdot) = 0.5$ .

On the other hand, $ρ^{2}$ is sufficiently good that it predicts $μ$ perfectly. And $ρ^{1}$ "knows" this: $E_{ρ^{1}} | μ (\cdot) - E_{ρ^{2}} (\cdot) | = 0$ .

If we assume that the game happens only once, in the second turn, and that that is the only reward, then, if $P_{ρ}$ is the probability module derived from $ρ$ (note that $P_{ρ} (X) = E_{ρ} (I_{X})$ , for $I_{X}$ the indicator function for $X$ ).

$V (π^{*}, h_{< 2}, ρ^{1}) = 0.25 + 0 + 0.25 + 0 = 0.5$ .
$V (π^{*}, h_{< 2}, ρ^{2}) = 0.5 + 0 + 0.5 + 0 = 1$ .

Then since $ρ^{1}$ can figure out the correct expectation for these two $V$ 's, if the agent starts with probability $ρ_{1} = ρ^{1}$ , then the optimal policy $π^{*}$ will choose an action on turn $1$ that transforms it into $ρ_{2} = ρ^{2}$ .

Corrigibility and estimator change

There is not problem combining inconsistent corrigibility with probability estimator changes. Just define value functions $V$ as

$V (π, h_{< t}, r_{t}, ρ_{t}) = E_{ρ_{t}} (\sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} μ (o_{t} | h_{< t} a_{t}) [r (o_{t}) + γ V (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), g (a_{t}, ρ_{t}))])$ .

And the optimal policy is corrigible:

$π^{*} (h_{< t}, r_{t}, ρ_{t}) = {argmax}_{π} V (π, h_{< t}, r_{t}, ρ_{t})$ .

However, this approach is not self-consistent, even with the standard compensatory rewards. Consider a very simple model, where the agent's actions have no impact on the environment. The probability estimators are $ρ_{1} = ρ^{1}$ and $ρ^{2}$ , and the reward functions are $r_{1} = r^{1}$ and $r^{2}$ . On the first turn, the agent may output $a_{2}$ which changes $ρ^{1}$ to $ρ^{2}$ , or $a_{1}$ , which doesn't. On the second turn, the agent will get an observation $o_{2}$ that transforms $r^{1}$ into $r^{2}$ . On the third turn, it gets observation $o_{3}$ . The probability estimators model each other perfectly, and believe that:

$P_{ρ^{1}} (r^{1} (o_{3})) = 1$
$P_{ρ^{1}} (r^{2} (o_{3})) = 0$
$P_{ρ^{2}} (r^{1} (o_{3})) = 0$
$P_{ρ^{2}} (r^{1} (o_{3})) = 0$

This means that if $ρ_{2} = ρ^{2}$ , the agent will not get any compensatory rewards when $r^{1}$ is changed to $r^{2}$ , while if $ρ_{2} = ρ^{1}$ , the agent gets compensatory rewards of $1$ .

The only relevance of policy is in the change of $ρ_{1}$ to $ρ_{2}$ on the first turn. Let $π_{a}$ the policy that outputs $a_{1}$ , $π_{1}$ the one that outputs $a_{2}$ . Since the standard reward (not the compensatory) is independent of any action the AIXI takes, it is also independent of $ρ_{i}$ , since those only affect standard rewards through actions, the value functions $V (π_{1}, \emptyset, r_{1}, ρ_{1})$ and $V (π_{2}, \emptyset, r_{1}, ρ_{1})$ are equal (in fact, they're $1$ ).

However, $π_{1}$ implies $ρ_{2} = ρ_{1}$ , giving compensatory rewards of $1$ , while $π_{2}$ implies compensatory rewards of $0$ . Hence the value functions do not match up with the total compensatory rewards.

Double compensation

The problem here is that changes in $r_{t}$ are addressed by compensatory rewards, but the changes in $ρ_{t}$ change these compensatory reward estimations, and this change is not compensated for.

To make this work, we will use the double indifference approach. The classical compensatory rewards are the same are the previous section, except we now need to keep track of which probability estimator is being used to calculate them:

$C (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) = V (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) - V (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), ρ_{t})$ .

This is the $C$ of the double indifference approach. We'll need to sum these $C$ in two different ways $S C_{T}$ ("True SC") and $S C_{C}$ ("Current SC"):

$S C_{T} (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) = C (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) + E_{ρ_{t}} (\sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} μ (o_{t} | h_{< t} a_{t}) [γ S C_{T} (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), g (a_{t}, ρ_{t}), g (a_{t}, ρ_{t}))])$
$S C_{C} (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) = C (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) + E_{ρ_{t}} (\sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} μ (o_{t} | h_{< t} a_{t}) [γ S C_{2} (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), g (a_{t}, ρ_{t}), ρ_{t})])$

What is the difference? Well, $S C_{T}$ estimates (using $ρ_{t}$ ) the true future discounted sum of $C$ , while $S C_{C}$ estimates (still using $ρ_{t}$ ), the future discounted sum of $C$ , were the $C$ to be estimated at the time of their estimation using $ρ_{t}$ rather than whatever $ρ$ the agent was using at the time.

Now, the $| C |$ are bounded by the maximal value of $V$ , which is $1 / (1 - γ)$ . Hence the $S C_{T}$ and $S C_{C}$ are bounded, if the $E_{ρ^{i}}$ are sensible, by the discounted sum of such terms, thus by $1 / (1 - γ)^{2}$ .

Then we need to define the $D$ . The agent will get rewards of type $C$ and of type $D$ . The $D$ will contain the $S C$ terms to correct future expected $C$ 's, but will also contain terms to correct future $D$ 's. Roughly speaking, if we denote $D_{t} (ρ^{i})$ the reward at time $t$ using $ρ^{i}$ to estimate this reward, and the true reward at time $t$ is $D_{t} (ρ_{t})$ , then

$D_{t} (ρ^{i}) = E_{ρ^{i}} [(S C_{C})_{t} - (S C_{T})_{t} - \sum_{j = 1}^{\infty} γ^{j} D_{t + j} (ρ_{t + j})]$ .

This results in the recursion formula:

$D_{t} (ρ^{i}) = E_{ρ^{i}} [(S C_{C})_{t} - (S C_{T})_{t} - γ D_{t + 1} (ρ_{t + 1})] + γ E_{ρ^{i}} [D_{t + 1} (ρ^{i}) - ((S C_{C})_{t + 1} - (S C_{T})_{t + 1}]$

Or, in more precise notation:

$D (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}, ρ^{i}) = S C_{C} (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) - S C_{T} (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) + E_{ρ_{t}} (\sum_{a_{t}} π (a_{t} | h_{< t}) \sum_{o_{t}} μ (o_{t} | h_{< t} a_{t}) γ [- D (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), g (a_{t}, ρ_{t}), g (a_{t}, ρ_{t})) + D (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), g (a_{t}, ρ_{t}), ρ^{i}) - {S C_{C} (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), g (a_{t}, ρ_{t})) - S C_{T} (π, h_{< t} a_{t} o_{t}, f (r_{t}, o_{t}), g (a_{t}, ρ_{t}))}])$

It sees that this quantity remains bounded if $γ < 0.5$ ; general convergence results are harder.

Then the agent, after turn $t$ , will get compensatory rewards $C (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}) + D (π, h_{< t} a_{t} o_{t}, r_{t}, ρ_{t}, ρ_{t})$ .

Thus it continues to get the $C$ reward that ensure indifference at the point of change of utility. The role of the $D$ is to remove, in expectation, all future $C$ rewards ( $S C_{T}$ ) and all future $D$ rewards, and to add expected $C$ rewards as they would have been estimated by $ρ_{t}$ . Therefore, at turn $t$ , the agent is also indifferent to future changes of utility. Hence the agent will always be indifferent to future changes of utility, and will never try to change $ρ_{t}$ for the purpose of getting compenstory rewards.