A putative new idea for AI control; index here.

"Guarded learning" is a model for unbiased learning, the kind of learning where the AI has an incentive to learn its values, but not to bias the direction in which it learns.

The value-uncertain $U$

Assume the AI is uncertain between $n + 1$ different utilities $v_{0}, v_{1}, \dots, v_{n}$ . Its actual utility is $U = I_{0} v_{0} + I_{1} v_{1} + \dots I_{n} v_{n}$ where the $I_{j}$ are indicator functions with $I_{j} \geq 0$ and $\sum_{j} I_{j} = 1$ .

The learning process is whatever updates the AI's estimates for the $I_{j}$ .

Writing the utility in this way means that the utility of future actions will be assessed according to the values of the future agent (as long as the current agent can trust the future agent's assessment of the $I_{j}$ ), thus avoiding the naive cake or death problem.

At any given time, the expectations of $I_{0}, I_{1}, \dots, I_{n}$ define a weighted sum of $v_{j}$ , which can be seen as a point on $n$ -simplex $Δ^{n}$ . Define $U_{t}$ as the utility given by these expectations at time $t$ .

For any $w \in Δ^{n}$ , define $π_{w}$ as the policy that maximises the invariant utility function $w$ . Thus if $U_{t} = w$ , a $U$ -maximiser will not necessarily follow $π_{w}$ , because $U_{t + 1}$ might be different from $w$ , while $π_{w}$ always maximises $w$ .

At any given time $t$ , define the function $f_{t} : Δ^{n} \to R$ my mapping $w$ to $E_{t} (w | π_{w})$ . This is the expectation of $w$ at $t$ , given that the AI follows a $w$ -maximising policy. Defining these $f_{t}$ requires some sort of logical or causal counterfactual -- "given that the AI follows $π_{w}$ " -- but this is the only requirement in this setup. The $f_{t}$ are all convex, see the proof at the end of this post.

Value updating and correction

For illustration, consider the case $n + 1 = 2$ . The $1$ -simplex $Δ^{1}$ is a line segment, which we're identifying with $[0, 1]$ :

The green curve is the plot of $f_{t}$ . The blue line connects $f_{t} (v_{0}) = E_{t} (v_{0} | π_{v_{0}})$ with $f_{t} (v_{1}) = E_{t} (v_{1} | π_{v_{1}})$ . The purple line is $E_{t} (v | π_{v_{0}})$ while the yellow line is $E_{t} (v | π_{v_{1}})$ . The $f_{t}$ curve must be contained between these lines.

The AI currently has $U_{t} = 0.5 v_{0} + 0.5 v_{1}$ . If the AI expected to learn more about its values, then its expected utility would rise along the green arrow. Reaching the top, on the blue line, corresponded to a situation where the agent expected to imminently be informed of whether its utility is $v_{0}$ or $v_{1}$ , and thus transform itself into a $v_{0}$ or $v_{1}$ maximiser. This kind of learning, up the green arrow, is what we want to encourage (note that this doesn't make the learning safe, though -- we need to pick the update process with care).

The red arrows point to updates we want to discourage -- we don't want the AI to manipulate the direction of its update to get higher utility (in this case, by moving towards $v_{0}$ ).

Correction term

Consider the agent operating as a discrete agent, moving from time $t$ to time $t + 1$ . It choose an action $a_{t}$ ; as a consequence of that, at time $t + 1$ , it expects a probability distribution over possible $U_{t + 1}$ . If we assume that there are only finitely many possibilities ${U_{t + 1}^{i}}$ (the continuous case is almost identical), then define $p_{i} = P_{t} (U_{t + 1} = U_{t + 1}^{i} | a_{t})$ .

Let $E U_{t + 1}$ be the expectation of $U_{t + 1}$ , that is $E (U_{t + 1} | a_{t}) = \sum_{i} p_{i} U_{t + 1}^{i}$ .

Then $a_{t}$ is an unbiased action (corresponding to moving up the blue arrow) if $E U_{t + 1} = U_{t}$ . One restriction we could do would be to require all valid $a_{t}$ to satisfy this. But this might be extremely restrictive. For instance, we probably wouldn't mind if $(U_{t + 1}^{0}, p_{0}) = (v_{0}, 0.51)$ and $(U_{t + 1}^{1}, p_{1}) = (v_{1}, 0.49)$ . That's the green point here:

Sure, the green point is slightly off-centre, but the AI has gained a lot of information. What's clearly incorrect, is if $(U_{t + 1}^{0}, p_{0}) = (v_{0}, 1)$ . This is the red point in the above graph; the action $a_{t}$ has forced the AI's values straight to $v_{0}$ .

One way of dealing with the "force the values to $v_{0}$ " would be to require the AI to use its past values $U_{t} = 0.5 v_{0} + 0.5 v_{1}$ rather than its future values $U_{t + 1} = v_{0}$ to evaluate the quality of the update. This would be like imposing the naive cake or death problem on the agent. The expectation would be $E_{t} (0.5 v_{0} + 0.5 v_{1} | π_{v_{0}})$ , which is the orange point on the line here:

This solves the for the AI forcing the value of $U_{t + 1}$ to a single value, but we need to extend it to (biased) distributions over multiple $U_{t + 1}^{i}$ 's. The natural way to do that is to define the penalty:

Penalty for biased $a_{t}$ $= E_{t} (U_{t} | π_{t + 1}) - E_{t} (E U_{t + 1} | π_{t + 1})$ , where $π_{t + 1}$ is the subsequent policy.

It's clear the penalty is $0$ when $E U_{t + 1} = U_{t}$ . However, the penalty is not always negative! Indeed, upon forcing $U_{t + 1} = 0.25 v_{0} + 0.75 v_{1}$ , adding the penalty would move the agent's utility to the orange point, higher than the red one:

Thus the penalty makes certain biased choice, that would otherwise be negative, slightly less negative. What is clear, however, is that, even with a positive penalty, it's always better for the AI to keep $U_{t + 1} = U_{t}$ than to force $U_{t + 1}$ to the single value $v \neq U_{t + 1}$ (ie the blue point is still higher than the orange one). This is because it's total reward would then be

$E_{t} (v | π_{v}) + (E_{t} (U_{t} | π_{v}) - E_{t} (v | π_{v})) = E_{t} (U_{t} | π_{v}) \leq E_{t} (U_{t} | π_{U_{t}})$ by the definition of $π_{U_{t}}$ .

So, in order to get a positive expectation from a biased action, the AI has to get a gain in information to overcome this loss.

Subagent stability

Is this penalty design subagent stable? It's not clear. Creating a subagent without the penalty is not an action the AI would go (as that action means its expectation of $U_{t + 1}$ will become very biased, as the subagent has no restriction on deliberately biasing the values of $I_{j}$ ).

However, an AI with $U_{t} = 0.5 v_{1} = 0 + 0.5 v_{1}$ would certainly create a subagent that randomly decided to force either $U_{t + 1} = v_{0}$ or $U_{t + 1} = v_{1}$ .

Convexity of $f_{t}$

This section will prove the following theorem:

Theorem: If the AI's utility depends only on its actions, the $f_{t}$ are convex.

Proof:

For any $v \in Δ^{n}$ , there is a map $g_{v} : Δ^{n} \to R$ that maps $w$ to $E_{t} (w | π_{v})$ . This map is linear, and $(w, g_{v} (w))$ defines an affine hyper-segment that contains $f_{t} (v) = g_{v} (v)$ .

Then we need to note that the curve $(w, f_{t} (w))$ cannot have a transverse intersection with $(w, g_{v} (w))$ (though they can be tangent on a convex subset). This is because a transverse intersection would imply there exists a $w$ with $f_{t} (w) < g_{v} (w)$ , ie $E_{t} (w | π_{w}) < E_{t} (w | π_{v})$ . But this is contradicted because $π_{w}$ is defined to be the best policy for maximising $w$ .

Thus $(w, g_{v} (w))$ is a supporting hyperplane for $(w, f_{t} (w))$ , and hence, by the supporting hyperplane theorem, $(w, f_{t} (w))$ is convex.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

3

Guarded learning

3

The value-uncertain $U$

Value updating and correction

Correction term

Subagent stability

Convexity of $f_{t}$

3

Guarded learning

3

The value-uncertain U

Value updating and correction

Correction term

Subagent stability

Convexity of ft

The value-uncertain $U$

Convexity of $f_{t}$