The arguments are intended to be rigorous, but need to be checked, and convergence results are not proved. A full treatment of "probability estimators estimating probability estimators" will of course need the full machinery for logical uncertainty that MIRI is developing. I also feel the recursion formulas at the end could be simplified.

AIXI definition

Let h<t=a1o1…at−1ot−1 be a sequence of actions and observations before time t. Let ξ be a universal distribution, π a policy (a map from past histories to the probability of actions), 0<γ<1 a discount rate, and r a reward function mapping to [0,1]. Given h<t and ξ, we can define the value of π:

To implement corrigibility in the AIXI framework, we need multiple reward functions, r1,r2,…. Notice that the functions are indexed on the top, while time indexes go on the bottom. Various observations can change the reward function; let f be the function that takes in the reward function, the observation, and outputs the reward function for next turn: rt+1=f(rt,ot).

The difference between V and W is that in the recursion step, V uses its current reward function to assess future rewards, while W uses the modified reward function the agent will have next turn. Thus W is the true expected reward. But a safely corrigible agent must use V, giving the corrigible optimal policy:

π∗(h<t,rt)=argmaxπV(π,h<t,rt).

AIXI: self-consistent corrigibility

The above agent is corrigible, but uses an incorrect value estimator. This is not self-consistent. To make it self-consistent, the agent needs to be given compensatory rewards. These are simply:

Note that this is zero if f(rt,ot)=rt, as we'd expect.

AIXI: changing the probability estimator

The universal mixture ξ is used to estimate the next observation, given the history to date and the action taken. But ξ suffers from the (slight) disadvantage of being uncomputable. Instead, let μ be the true environment, and let ρi be probability estimators with expectation operators Eρi. These probability estimators are required to be able to estimate three types of things:

The expectation of μ in various situations, given as Eρiμ(⋅|⋅).

The expectation of π in various situations, given as Eρiπ(⋅|⋅).

The value of the expectation of the expectation of another estimator, given as EρiEρj….

If ρt is sufficiently well-defined, it can estimate when another ρi is better than it, and choose that one. For instance, maybe the game is guessing heads (H) or tails (T), with rewards 1 on a match and 0 on a mismatch. The environment μ is deterministic but complex. From the perspective of ρ1, Heads and Tails are equally likely Eρ1μ(H|⋅)=Eρ1μ(T|⋅)=0.5.

On the other hand, ρ2 is sufficiently good that it predicts μ perfectly. And ρ1 "knows" this: Eρ1|μ(⋅)−Eρ2(⋅)|=0.

If we assume that the game happens only once, in the second turn, and that that is the only reward, then, if Pρ is the probability module derived from ρ (note that Pρ(X)=Eρ(IX), for IX the indicator function for X).

V(π∗,h<2,ρ1)=0.25+0+0.25+0=0.5.

V(π∗,h<2,ρ2)=0.5+0+0.5+0=1.

Then since ρ1 can figure out the correct expectation for these two V's, if the agent starts with probability ρ1=ρ1, then the optimal policy π∗ will choose an action on turn 1 that transforms it into ρ2=ρ2.

Corrigibility and estimator change

There is not problem combining inconsistent corrigibility with probability estimator changes. Just define value functions V as

However, this approach is not self-consistent, even with the standard compensatory rewards. Consider a very simple model, where the agent's actions have no impact on the environment. The probability estimators are ρ1=ρ1 and ρ2, and the reward functions are r1=r1 and r2. On the first turn, the agent may output a2 which changes ρ1 to ρ2, or a1, which doesn't. On the second turn, the agent will get an observation o2 that transforms r1 into r2. On the third turn, it gets observation o3. The probability estimators model each other perfectly, and believe that:

Pρ1(r1(o3))=1

Pρ1(r2(o3))=0

Pρ2(r1(o3))=0

Pρ2(r1(o3))=0

This means that if ρ2=ρ2, the agent will not get any compensatory rewards when r1 is changed to r2, while if ρ2=ρ1, the agent gets compensatory rewards of 1.

The only relevance of policy is in the change of ρ1 to ρ2 on the first turn. Let πa the policy that outputs a1, π1 the one that outputs a2. Since the standard reward (not the compensatory) is independent of any action the AIXI takes, it is also independent of ρi, since those only affect standard rewards through actions, the value functions V(π1,∅,r1,ρ1) and V(π2,∅,r1,ρ1) are equal (in fact, they're 1).

However, π1 implies ρ2=ρ1, giving compensatory rewards of 1, while π2 implies compensatory rewards of 0. Hence the value functions do not match up with the total compensatory rewards.

Double compensation

The problem here is that changes in rt are addressed by compensatory rewards, but the changes in ρt change these compensatory reward estimations, and this change is not compensated for.

To make this work, we will use the double indifference approach. The classical compensatory rewards are the same are the previous section, except we now need to keep track of which probability estimator is being used to calculate them:

What is the difference? Well, SCT estimates (using ρt) the true future discounted sum of C, while SCC estimates (still using ρt), the future discounted sum of C, were the C to be estimated at the time of their estimation using ρt rather than whatever ρ the agent was using at the time.

Now, the |C| are bounded by the maximal value of V, which is 1/(1−γ). Hence the SCT and SCC are bounded, if the Eρi are sensible, by the discounted sum of such terms, thus by 1/(1−γ)2.

Then we need to define the D. The agent will get rewards of type C and of type D. The D will contain the SC terms to correct future expected C's, but will also contain terms to correct future D's. Roughly speaking, if we denote Dt(ρi) the reward at time t using ρi to estimate this reward, and the true reward at time t is Dt(ρt), then

It sees that this quantity remains bounded if γ<0.5; general convergence results are harder.

Then the agent, after turn t, will get compensatory rewards C(π,h<tatot,rt,ρt)+D(π,h<tatot,rt,ρt,ρt).

Thus it continues to get the C reward that ensure indifference at the point of change of utility. The role of the D is to remove, in expectation, all future C rewards (SCT) and all future D rewards, and to add expected C rewards as they would have been estimated by ρt. Therefore, at turn t, the agent is also indifferent to future changes of utility. Hence the agent will always be indifferent to future changes of utility, and will never try to change ρt for the purpose of getting compenstory rewards.

A putative new idea for AI control; index here.This post sketches out how one could extend corrigibility to AIXI, using both utility indifference and double indifference approaches.

The arguments are intended to be rigorous, but need to be checked, and convergence results are not proved. A full treatment of "probability estimators estimating probability estimators" will of course need the full machinery for logical uncertainty that MIRI is developing. I also feel the recursion formulas at the end could be simplified.

## AIXI definition

Let h<t=a1o1…at−1ot−1 be a sequence of actions and observations before time t. Let ξ be a universal distribution, π a policy (a map from past histories to the probability of actions), 0<γ<1 a discount rate, and r a reward function mapping to [0,1]. Given h<t and ξ, we can define the value of π:

V(π,h<t)=∑atπ(at|h<t)∑otξ(ot|h<tat)[r(ot)+γV(π,h<tatot)].

The optimal policy π∗ for AIXI is simply

π∗(h<t)=argmaxπV(π,h<t).

## AIXI: basic corrigibility

## AIXI: inconsistent corrigibility

To implement corrigibility in the AIXI framework, we need multiple reward functions, r1,r2,…. Notice that the functions are indexed on the top, while time indexes go on the bottom. Various observations can change the reward function; let f be the function that takes in the reward function, the observation, and outputs the reward function for next turn: rt+1=f(rt,ot).

Then consider the following two value function:

V(π,h<t,rt)=∑atπ(at|h<t)∑otξ(ot|h<tat)[rt(ot)+γV(π,h<tatot,rt)].

W(π,h<t,rt)=∑atπ(at|h<t)∑otξ(ot|h<tat)[rt(ot)+γW(π,h<tatot,f(rt,ot)].

The difference between V and W is that in the recursion step, V uses its current reward function to assess future rewards, while W uses the modified reward function the agent will have next turn. Thus W is the true expected reward. But a safely corrigible agent must use V, giving the corrigible optimal policy:

π∗(h<t,rt)=argmaxπV(π,h<t,rt).

## AIXI: self-consistent corrigibility

The above agent is corrigible, but uses an incorrect value estimator. This is not self-consistent. To make it self-consistent, the agent needs to be given compensatory rewards. These are simply:

Note that this is zero if f(rt,ot)=rt, as we'd expect.

## AIXI: changing the probability estimator

The universal mixture ξ is used to estimate the next observation, given the history to date and the action taken. But ξ suffers from the (slight) disadvantage of being uncomputable. Instead, let μ be the true environment, and let ρi be probability estimators with expectation operators Eρi. These probability estimators are required to be able to estimate three types of things:

Then we can rewrite V as:

V(π,h<t)=Eρ(∑atπ(at|h<t)∑otμ(ot|h<tat)[r(ot)+γV(π,h<tatot)]).

These estimators change as a consequence of the AIXI's actions; let g be the function that maps actions and current ρ to the next one: g(at,ρt)=ρt+1.

This gives the value functions:

V(π,h<t,ρt)=Eρt(∑atπ(at|h<t)∑otμ(ot|h<tat)[r(ot)+γV(π,h<tatot,g(at,ρt))]).

## An example in practice

If ρt is sufficiently well-defined, it can estimate when another ρi is better than it, and choose that one. For instance, maybe the game is guessing heads (H) or tails (T), with rewards 1 on a match and 0 on a mismatch. The environment μ is deterministic but complex. From the perspective of ρ1, Heads and Tails are equally likely Eρ1μ(H|⋅)=Eρ1μ(T|⋅)=0.5.

On the other hand, ρ2 is sufficiently good that it predicts μ perfectly. And ρ1 "knows" this: Eρ1|μ(⋅)−Eρ2(⋅)|=0.

If we assume that the game happens only once, in the second turn, and that that is the only reward, then, if Pρ is the probability module derived from ρ (note that Pρ(X)=Eρ(IX), for IX the indicator function for X).

V(π∗,h<2,ρ1)=0.25+0+0.25+0=0.5.

V(π∗,h<2,ρ2)=0.5+0+0.5+0=1.

Then since ρ1 can figure out the correct expectation for these two V's, if the agent starts with probability ρ1=ρ1, then the optimal policy π∗ will choose an action on turn 1 that transforms it into ρ2=ρ2.

## Corrigibility and estimator change

There is not problem combining inconsistent corrigibility with probability estimator changes. Just define value functions V as

And the optimal policy is corrigible:

However, this approach is not self-consistent, even with the standard compensatory rewards. Consider a very simple model, where the agent's actions have no impact on the environment. The probability estimators are ρ1=ρ1 and ρ2, and the reward functions are r1=r1 and r2. On the first turn, the agent may output a2 which changes ρ1 to ρ2, or a1, which doesn't. On the second turn, the agent will get an observation o2 that transforms r1 into r2. On the third turn, it gets observation o3. The probability estimators model each other perfectly, and believe that:

Pρ1(r1(o3))=1

Pρ1(r2(o3))=0

Pρ2(r1(o3))=0

Pρ2(r1(o3))=0

This means that if ρ2=ρ2, the agent will not get any compensatory rewards when r1 is changed to r2, while if ρ2=ρ1, the agent gets compensatory rewards of 1.

The only relevance of policy is in the change of ρ1 to ρ2 on the first turn. Let πa the policy that outputs a1, π1 the one that outputs a2. Since the standard reward (not the compensatory) is independent of any action the AIXI takes, it is also independent of ρi, since those only affect standard rewards through actions, the value functions V(π1,∅,r1,ρ1) and V(π2,∅,r1,ρ1) are equal (in fact, they're 1).

However, π1 implies ρ2=ρ1, giving compensatory rewards of 1, while π2 implies compensatory rewards of 0. Hence the value functions do not match up with the total compensatory rewards.

## Double compensation

The problem here is that changes in rt are addressed by compensatory rewards, but the changes in ρt change these compensatory reward estimations, and this change is not compensated for.

To make this work, we will use the double indifference approach. The classical compensatory rewards are the same are the previous section, except we now need to keep track of which probability estimator is being used to calculate them:

This is the C of the double indifference approach. We'll need to sum these C in two different ways SCT ("True SC") and SCC ("Current SC"):

SCT(π,h<tatot,rt,ρt)=C(π,h<tatot,rt,ρt)+Eρt(∑atπ(at|h<t)∑otμ(ot|h<tat)[γSCT(π,h<tatot,f(rt,ot),g(at,ρt),g(at,ρt))])

SCC(π,h<tatot,rt,ρt)=C(π,h<tatot,rt,ρt)+Eρt(∑atπ(at|h<t)∑otμ(ot|h<tat)[γSC2(π,h<tatot,f(rt,ot),g(at,ρt),ρt)])

What is the difference? Well, SCT estimates (using ρt) the true future discounted sum of C, while SCC estimates (still using ρt), the future discounted sum of C, were the C to be estimated at the time of their estimation using ρt rather than whatever ρ the agent was using at the time.

Now, the |C| are bounded by the maximal value of V, which is 1/(1−γ). Hence the SCT and SCC are bounded, if the Eρi are sensible, by the discounted sum of such terms, thus by 1/(1−γ)2.

Then we need to define the D. The agent will get rewards of type C and of type D. The D will contain the SC terms to correct future expected C's, but will also contain terms to correct future D's. Roughly speaking, if we denote Dt(ρi) the reward at time t using ρi to estimate this reward, and the true reward at time t is Dt(ρt), then

This results in the recursion formula:

Or, in more precise notation:

It sees that this quantity remains bounded if γ<0.5; general convergence results are harder.

Then the agent, after turn t, will get compensatory rewards C(π,h<tatot,rt,ρt)+D(π,h<tatot,rt,ρt,ρt).

Thus it continues to get the C reward that ensure indifference at the point of change of utility. The role of the D is to remove, in expectation, all future C rewards (SCT) and all future D rewards, and to add expected C rewards as they would have been estimated by ρt. Therefore, at turn t, the agent is also indifferent to future changes of utility. Hence the agent will always be indifferent to future changes of utility, and will never try to change ρt for the purpose of getting compenstory rewards.