Appendix: mathematics of indexical impact measures

Stuart_Armstrong

Overall summary post here.

This post looks at the fascinating situation of indexical impact measures (under the inaction baseline), in the presence of subagent. What happens here in non-trivial; to summarise:

If the impact measure compels the agent to "preserve the ability to do X", then the impact measure will not be undermined by a subagent.
If the impact measure compels the agent to "never have the ability to do X", then a subagent strongly undermines the impact measure.
If the impact measure compels the agent to "keep your ability to do X at a constant level", then a subagent allows the agent to increase that ability, but not decrease it.

For example, the attainable utility impact measure attempts to measure the power of an agent. Power is, roughly, the ability increase certain reward function. The impact measure penalises changes in the agent's power (as compared with the baseline).

So, we can expect the restriction on the agent losing power to be preserved, while the restriction on gaining power will be strongly undermined. For the spirit of low impact, it would, of course, have been better if these had been the other way round.

Excess power

As before, assume an agent $A$ , which has managed to construct a subagent $S A$ .

I'll be using a slight modification of Vika's formulation. A general-value penalty is of the form:

$D_{A} (s_{t}; s_{t}^{'}) = \sum k \in K w_{k} f (V_{k} (s_{t}) - V_{k} (s_{t}^{'})) .$

Here, $s_{t}$ is the state the environment is in; $s_{t}^{'}$ is the inaction baseline, the state the environment would have been in, had it done nothing (noop, $\emptyset$ ) since the beginning. The $K$ is some indexing set, $w_{k}$ is a weight, $V_{k} (s)$ is some measure of the value of state $s$ , and the $f$ is a value difference summary function - it establishes what value difference will give what penalty^[1]. These $V_{k}$ are indexical, and, in this post, we'll look at $V_{k}$ 's defined as the maximum over the agent's policy; ie there exists $v_{k}$ such that:

$V_{k} (s) = max π v_{k} (s, π) .$

I'll refer to $V_{k} (s_{t}) - V_{k} (s_{t}^{'})$ as the (excess) power of the agent, and designate it $P_{k}^{A} (s_{t})$ (since $V_{k} (s_{t}^{'})$ is out of the agent's control, we can consider it a constant and absorb it into $P_{k}^{A}$ ).

The excess power of the pair

Assume that $S A$ is programmed so that, if the agent follows policy $π_{A}$ , the subagent will follow policy $π_{A S} = σ (π_{A})$ . The excess power of the agent is thus

$P_{k}^{A} (s_{t}) = max π_{A} v_{k} (s_{t}, [π_{A}, σ (π_{A})]) - V_{k} (s_{t}^{'}) .$

Here, $[π_{A}, σ (π_{A})]$ refers to the joint policies of the agent and the subagent.

Define the excess power of the pair $(A, S A)$ in a similar way:

$P_{k}^{A, S A} (s_{t}) = max π_{A}, π_{S A} v_{k} (s_{t}, [π_{A}, π_{S A}]) - V_{k} (s_{t}^{'}) .$

This is how much the agent and subagent, acting together, could push $V_{k} (s_{t})$ over $V_{k} (s_{t}^{'})$ , if that's all they both wanted. It is the natural extension of $P_{k}^{A}$ to a setting with two agents.

Obviously, $P_{k}^{A, S A} (s) \geq P_{k}^{A} (s)$ .

Extending the impact penalty

Define $D_{A, S A}$ in the same way as $D_{A}$ , except using $P_{k}^{A, S A}$ instead of $P_{k}^{A}$ . This $D_{A, S A}$ is what we'd wish we could control (though we can't specify it directly, since specifying subagents is really hard). So can we control the (implicit) $D_{A, S A}$ with the explicit $D_{A}$ ?

Sometimes we can. We'll say that $D_{A}$ extends successfully to $D_{A, S A}$ , iff

$D_{A, S A} \leq D_{A} .$

In other words, if controlling the magnitude of $D_{A}$ also controls the magnitude of $D_{A, S A}$ . In that case, if $A$ was built with penalty $D_{A}$ , the system $(A, S A)$ will naturally follow a penalty that is at least as restrictive.

When does $D_{A}$ extend?

$f$ decreasing in excess power

The paper has two key examples of a value difference. The first is relative reachability, which uses $K = S$ , the set of states. The $w_{k}$ are all equal, the $V_{s} (s^{'})$ is how "reachable" $s$ is from $s^{'}$ , assuming that the agent $A$ is using the optimal policy to reach $s$ . It is that last clause that makes the impact penalty indexical: it is defined in terms of $A$ .

What about $f$ ? Well, $f$ is defined to be the flipped rectifier function, ie $f (x) = - min (0, x)$ . The important thing for this $f$ is that it is decreasing: if $x \geq y$ , then $f (x) \leq f (y)$ .

Thus, since $P_{k}^{A, S A} (s) \geq P_{k}^{A} (s)$ , we have $f (P_{k}^{A, S A} (s)) \leq f (P_{k}^{A} (s))$ , and hence

$D_{A, S A} \leq D_{A} .$

So the impact penalty successfully extends to the situation with a subagent. Note that this is a property of $f$ only, not of $V_{k}$ . So whenever $f$ is of this form, adding subagents changes nothing to the impact penalty (if anything, it makes it more restrictive).

$f$ increasing in excess power

The other key example is attainable utility, with uses $K = R$ , a set of reward functions, sets all $w_{k}$ equal, and the $V_{R} (s)$ is the expected value of $R$ assuming that the agent $A$ is using the optimal policy to maximise $R$ .

The $f$ is the absolute value function. But, for this section, I'll notice that the absolute value function decomposes as $max (0, x) - min (0, x)$ , and I'll consider what happens when $f$ is equal to the first term.

Well, if $f (x) = max (0, x)$ , then $f$ is increasing in $x$ . So we don't get to constrain $D_{A, S A}$ with $D_{A}$ .

What constraints do exist? Well, we still have the $D_{A}$ bound, which controls $P_{k}^{A} (s_{t}) = {max}_{π_{A}} v_{k} (s_{t}), [π_{A}, σ (π_{A})]) - V_{k} (s_{t}^{'})$ . Let $π_{k}^{*}$ be the optimal policy $A$ for maximising $v_{k}$ . And let $π_{A}$ be the actual policy that $A$ will follow. Since $π_{k}^{*}$ is optimal, we must have:

$v_{k} (s_{t}, [π_{A}, σ (π_{A})]) - V_{k} (s_{t}^{'}) \leq P_{k}^{A} (s_{t}) .$

Thus we do have a constraint on $A$ and $S A$ , but it's a constraint on the actual policy the two agents are following. Define $d_{A}$ to be like $D_{A}$ , except that instead of $V_{k} (s_{t})$ , $v_{k} (s_{t}, [π_{A}, σ (π_{A})])$ is used instead. Then, by the above,

$d_{A} \leq D_{A},$

so, with an increasing $f$ , controlling $D_{A}$ will also control $d_{A}$ .

But, as stated, this is a restriction on the actual policy of the agents, not on the counterfactuals of what these policies could be.

$f$ increasing and decreasing

Let's go back to the attainable utility, and put $f$ as the absolute value function; hence it increases and decreases (well, technically, decreases then increases). If we write $D_{A}^{f}$ for a distance measure using $f$ , we can write

$D_{A}^{| \cdot |} = D_{A}^{max (0, \cdot)} + D_{A}^{- min (0, \cdot)} .$

Then, we can see that the pair $(A, S A)$ is controlled by:

$d_{A}^{max (0, \cdot)} + D_{A, S A}^{- min (0, \cdot)} .$

Thus negative excess power is controlled (the pair can't get too weak at counterfactually maximising $v_{k}$ ), but only excessive $v_{k}$ is controlled ( $v_{k}$ can't get too high in practice).

In practice

How likely is it that the subagent can escape the impact penalty? In practice, that seems quite likely, at least for attainable utility.

Imagine that $S A$ can cripple $A$ (with $A$ 's help). And then $A$ is restricted to either $\emptyset$ , or $a_{k}$ for $k \in K$ . This $a_{k}$ is $A$ asking for $S A$ to "please increase $v_{k}$ ".

Now assume that there is a policy $π_{S A}$ , such that for all $τ \geq t$ and $s_{τ}$ the state reached from $s_{t}$ by the agents following $(\emptyset, π_{S A})$ :

$v_{k} (s_{τ}, [\emptyset, π_{S A}]) \leq V_{k} (s_{τ}^{'})$ .
$\exists π_{S A}^{k} : V_{k} (s_{τ}, [a_{k}, π_{S A}^{k}]) \geq V_{k} (s_{τ}^{'})$ .

The first condition means that $S A$ doesn't increase $v_{k}$ beyond $V_{k} (s_{τ}^{'})$ . The second means that $S A$ has the potential to increase it beyond $V_{k} (s_{τ}^{'})$ . If the $V_{k}$ come from the attainable utility penalty, for example, then both of these should be doable for $S A$ if it makes itself powerful enough^[2].

Then set $σ (\emptyset) = π_{S A}$ and $σ (a_{k}) = π_{S A}^{k}$ . Putting both conditions together implies that $V_{k} (s_{t}, [a_{k}, π_{S A}^{k}]) \geq v_{k} (s_{t}, [\emptyset, π_{S A}])$ , so $a_{k}$ is at least as good, from $A$ 's perspective, at maximising $v_{k}$ . Thus $A$ 's $v_{k}$ -optimising policy includes something that is not $\emptyset$ ; the counterfactual where if maximises $v_{k}$ thus allows it to pick some other policy (which $S A$ will notice).

Then $A$ will always do $\emptyset$ , in practice, and, once $S A$ is powerful enough that both conditions above hold, the impact penalty will be zero, no matter how powerful $S A$ becomes (and thus being able to maximise $R_{0}$ with little restrictions).

The slight difference is that this formula uses $V_{k} (s_{t}) - V_{k} (s_{t}^{'})$ rather than $V_{k} (s_{t}^{'}) - V_{k} (s_{t})$ - a minor sign difference that makes this particular presentation clearer. ↩︎
The restriction on $v_{k}$ is like the 20 billion questions approach, but those were designed to trigger at any expansionist move on the AI's part. Unless the $v_{k}$ are similarly constructed, it is likely that $S A$ can become very powerful without triggering a notable increase in $v_{k}$ , and, after that, take control of the $v_{k}$ completely. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

7

Appendix: mathematics of indexical impact measures

7

Excess power

The excess power of the pair

Extending the impact penalty

When does $D_{A}$ extend?

$f$ decreasing in excess power

$f$ increasing in excess power

$f$ increasing and decreasing

In practice

7

Appendix: mathematics of indexical impact measures

7

Excess power

The excess power of the pair

Extending the impact penalty

When does DA extend?

f decreasing in excess power

f increasing in excess power

f increasing and decreasing

In practice

When does $D_{A}$ extend?

$f$ decreasing in excess power

$f$ increasing in excess power

$f$ increasing and decreasing