"Guarded learning" is a model for unbiased learning, the kind of learning where the AI has an incentive to learn its values, but not to bias the direction in which it learns.

The value-uncertain U

Assume the AI is uncertain between n+1 different utilities v0,v1,…,vn. Its actual utility is U=I0v0+I1v1+…Invn where the Ij are indicator functions with Ij≥0 and ∑jIj=1.

The learning process is whatever updates the AI's estimates for the Ij.

Writing the utility in this way means that the utility of future actions will be assessed according to the values of the future agent (as long as the current agent can trust the future agent's assessment of the Ij), thus avoiding the naive cake or death problem.

At any given time, the expectations of I0,I1,…,In define a weighted sum of vj, which can be seen as a point on n-simplex Δn. Define Ut as the utility given by these expectations at time t.

For any w∈Δn, define πw as the policy that maximises the invariant utility function w. Thus if Ut=w, a U-maximiser will not necessarily follow πw, because Ut+1 might be different from w, while πw always maximises w.

At any given time t, define the function ft:Δn→R my mapping w to Et(w|πw). This is the expectation of w at t, given that the AI follows a w-maximising policy. Defining these ft requires some sort of logical or causal counterfactual -- "given that the AI follows πw" -- but this is the only requirement in this setup. The ft are all convex, see the proof at the end of this post.

Value updating and correction

For illustration, consider the case n+1=2. The 1-simplex Δ1 is a line segment, which we're identifying with [0,1]:

The green curve is the plot of ft. The blue line connects ft(v0)=Et(v0|πv0) with ft(v1)=Et(v1|πv1). The purple line is Et(v|πv0) while the yellow line is Et(v|πv1). The ft curve must be contained between these lines.

The AI currently has Ut=0.5v0+0.5v1. If the AI expected to learn more about its values, then its expected utility would rise along the green arrow. Reaching the top, on the blue line, corresponded to a situation where the agent expected to imminently be informed of whether its utility is v0 or v1, and thus transform itself into a v0 or v1 maximiser. This kind of learning, up the green arrow, is what we want to encourage (note that this doesn't make the learning safe, though -- we need to pick the update process with care).

The red arrows point to updates we want to discourage -- we don't want the AI to manipulate the direction of its update to get higher utility (in this case, by moving towards v0).

Correction term

Consider the agent operating as a discrete agent, moving from time t to time t+1. It choose an action at; as a consequence of that, at time t+1, it expects a probability distribution over possible Ut+1. If we assume that there are only finitely many possibilities {Uit+1} (the continuous case is almost identical), then define pi=Pt(Ut+1=Uit+1|at).

Let EUt+1 be the expectation of Ut+1, that is E(Ut+1|at)=∑ipiUit+1.

Then at is an unbiased action (corresponding to moving up the blue arrow) if EUt+1=Ut. One restriction we could do would be to require all valid at to satisfy this. But this might be extremely restrictive. For instance, we probably wouldn't mind if (U0t+1,p0)=(v0,0.51) and (U1t+1,p1)=(v1,0.49). That's the green point here:

Sure, the green point is slightly off-centre, but the AI has gained a lot of information. What's clearly incorrect, is if (U0t+1,p0)=(v0,1). This is the red point in the above graph; the action at has forced the AI's values straight to v0.

One way of dealing with the "force the values to v0" would be to require the AI to use its past values Ut=0.5v0+0.5v1 rather than its future values Ut+1=v0 to evaluate the quality of the update. This would be like imposing the naive cake or death problem on the agent. The expectation would be Et(0.5v0+0.5v1|πv0), which is the orange point on the line here:

This solves the for the AI forcing the value of Ut+1 to a single value, but we need to extend it to (biased) distributions over multiple Uit+1's. The natural way to do that is to define the penalty:

Penalty for biased at=Et(Ut|πt+1)−Et(EUt+1|πt+1), where πt+1 is the subsequent policy.

It's clear the penalty is 0 when EUt+1=Ut. However, the penalty is not always negative! Indeed, upon forcing Ut+1=0.25v0+0.75v1, adding the penalty would move the agent's utility to the orange point, higher than the red one:

Thus the penalty makes certain biased choice, that would otherwise be negative, slightly less negative. What is clear, however, is that, even with a positive penalty, it's always better for the AI to keep Ut+1=Ut than to force Ut+1 to the single value v≠Ut+1 (ie the blue point is still higher than the orange one). This is because it's total reward would then be

Et(v|πv)+(Et(Ut|πv)−Et(v|πv))=Et(Ut|πv)≤Et(Ut|πUt) by the definition of πUt.

So, in order to get a positive expectation from a biased action, the AI has to get a gain in information to overcome this loss.

Subagent stability

Is this penalty design subagent stable? It's not clear. Creating a subagent without the penalty is not an action the AI would go (as that action means its expectation of Ut+1 will become very biased, as the subagent has no restriction on deliberately biasing the values of Ij).

However, an AI with Ut=0.5v1=0+0.5v1 would certainly create a subagent that randomly decided to force either Ut+1=v0 or Ut+1=v1.

Convexity of ft

This section will prove the following theorem:

Theorem: If the AI's utility depends only on its actions, the ft are convex.

Proof:

For any v∈Δn, there is a map gv:Δn→R that maps w to Et(w|πv). This map is linear, and (w,gv(w)) defines an affine hyper-segment that contains ft(v)=gv(v).

Then we need to note that the curve (w,ft(w)) cannot have a transverse intersection with (w,gv(w)) (though they can be tangent on a convex subset). This is because a transverse intersection would imply there exists a w with ft(w)<gv(w), ie Et(w|πw)<Et(w|πv). But this is contradicted because πw is defined to be the best policy for maximising w.

Thus (w,gv(w)) is a supporting hyperplane for (w,ft(w)), and hence, by the supporting hyperplane theorem, (w,ft(w)) is convex.

A putative new idea for AI control; index here."Guarded learning" is a model for unbiased learning, the kind of learning where the AI has an incentive to learn its values, but not to bias the direction in which it learns.

## The value-uncertain U

Assume the AI is uncertain between n+1 different utilities v0,v1,…,vn. Its actual utility is U=I0v0+I1v1+…Invn where the Ij are indicator functions with Ij≥0 and ∑jIj=1.

The learning process is whatever updates the AI's estimates for the Ij.Writing the utility in this way means that the utility of future actions will be assessed according to the values of the future agent (as long as the current agent can trust the future agent's assessment of the Ij), thus avoiding the naive cake or death problem.

At any given time, the expectations of I0,I1,…,In define a weighted sum of vj, which can be seen as a point on n-simplex Δn. Define Ut as the utility given by these expectations at time t.

For any w∈Δn, define πw as the policy that maximises the

invariantutility function w. Thus if Ut=w, a U-maximiser will not necessarily follow πw, because Ut+1 might be different from w, while πw always maximises w.At any given time t, define the function ft:Δn→R my mapping w to Et(w|πw). This is the expectation of w at t, given that the AI follows a w-maximising policy. Defining these ft requires some sort of logical or causal counterfactual -- "given that the AI follows πw" -- but this is the only requirement in this setup. The ft are all convex, see the proof at the end of this post.

## Value updating and correction

For illustration, consider the case n+1=2. The 1-simplex Δ1 is a line segment, which we're identifying with [0,1]:

The green curve is the plot of ft. The blue line connects ft(v0)=Et(v0|πv0) with ft(v1)=Et(v1|πv1). The purple line is Et(v|πv0) while the yellow line is Et(v|πv1). The ft curve must be contained between these lines.

The AI currently has Ut=0.5v0+0.5v1. If the AI expected to learn more about its values, then its expected utility would rise along the green arrow. Reaching the top, on the blue line, corresponded to a situation where the agent expected to imminently be informed of whether its utility is v0 or v1, and thus transform itself into a v0 or v1 maximiser. This kind of learning, up the green arrow, is what we want to encourage (note that this doesn't make the learning safe, though -- we need to pick the update process with care).

The red arrows point to updates we want to discourage -- we don't want the AI to manipulate the direction of its update to get higher utility (in this case, by moving towards v0).

## Correction term

Consider the agent operating as a discrete agent, moving from time t to time t+1. It choose an action at; as a consequence of that, at time t+1, it expects a probability distribution over possible Ut+1. If we assume that there are only finitely many possibilities {Uit+1} (the continuous case is almost identical), then define pi=Pt(Ut+1=Uit+1|at).

Let EUt+1 be the expectation of Ut+1, that is E(Ut+1|at)=∑ipiUit+1.

Then at is an unbiased action (corresponding to moving up the blue arrow) if EUt+1=Ut. One restriction we could do would be to require all valid at to satisfy this. But this might be extremely restrictive. For instance, we probably wouldn't mind if (U0t+1,p0)=(v0,0.51) and (U1t+1,p1)=(v1,0.49). That's the green point here:

Sure, the green point is slightly off-centre, but the AI has gained a lot of information. What's clearly incorrect, is if (U0t+1,p0)=(v0,1). This is the red point in the above graph; the action at has forced the AI's values straight to v0.

One way of dealing with the "force the values to v0" would be to require the AI to use its past values Ut=0.5v0+0.5v1 rather than its future values Ut+1=v0 to evaluate the quality of the update. This would be like

imposingthe naive cake or death problem on the agent. The expectation would be Et(0.5v0+0.5v1|πv0), which is the orange point on the line here:This solves the for the AI forcing the value of Ut+1 to a single value, but we need to extend it to (biased) distributions over multiple Uit+1's. The natural way to do that is to define the penalty:

Penalty for biased at=Et(Ut|πt+1)−Et(EUt+1|πt+1), where πt+1 is the subsequent policy.It's clear the penalty is 0 when EUt+1=Ut. However, the penalty is not always negative! Indeed, upon forcing Ut+1=0.25v0+0.75v1, adding the penalty would move the agent's utility to the orange point, higher than the red one:

Thus the penalty makes certain biased choice, that would otherwise be negative, slightly less negative. What is clear, however, is that, even with a positive penalty, it's always better for the AI to keep Ut+1=Ut than to force Ut+1 to the single value v≠Ut+1 (ie the blue point is still higher than the orange one). This is because it's total reward would then be

So, in order to get a positive expectation from a biased action, the AI has to get a gain in information to overcome this loss.

## Subagent stability

Is this penalty design subagent stable? It's not clear. Creating a subagent without the penalty is not an action the AI would go (as that action means its expectation of Ut+1 will become very biased, as the subagent has no restriction on deliberately biasing the values of Ij).

However, an AI with Ut=0.5v1=0+0.5v1 would certainly create a subagent that randomly decided to force either Ut+1=v0 or Ut+1=v1.

## Convexity of ft

This section will prove the following theorem:

Theorem: If the AI's utility depends only on its actions, the ft are convex.Proof:For any v∈Δn, there is a map gv:Δn→R that maps w to Et(w|πv). This map is linear, and (w,gv(w)) defines an affine hyper-segment that contains ft(v)=gv(v).

Then we need to note that the curve (w,ft(w)) cannot have a transverse intersection with (w,gv(w)) (though they can be tangent on a convex subset). This is because a transverse intersection would imply there exists a w with ft(w)<gv(w), ie Et(w|πw)<Et(w|πv). But this is contradicted because πw is defined to be the best policy for maximising w.

Thus (w,gv(w)) is a supporting hyperplane for (w,ft(w)), and hence, by the supporting hyperplane theorem, (w,ft(w)) is convex.