Building and using the subagent

Stuart_Armstrong

Overall summary post here.

When designing a low-impact agent $A$ , one typically wants them to maximise $R_{0} - ρ$ , where $R_{0}$ is some positive reward function we want $A$ to increase (ie the actual goal) and $ρ$ is some impact measure or impact penalty. Keeping $ρ$ low is supposed to ensure that $A$ doesn’t make a big difference to the world.

I’ve shown that some impact measures may have issues with subagents. We can decompose the issue of subagents into two questions:

Can $A$ build $S A$ , an “ $R_{0} - ρ$ ”-maximising subagent, without triggering a high $ρ$ ?
Does $S A$ allow $A$ to cheat the impact measure and get higher reward?

I’ll consider three impact measures: the twenty billion questions measure from my old low-impacts paper, the relative reachability measure from Krakovna et al's paper, and the attainable utility measure from Turner et al's paper.

Then the answers to those two questions for those three impact measures are generally^[1]:

$\begin{matrix} ρ & Possible to build S A ? & Incentive to build S A ? Twenty billion Q & Yes & No R-Reachability & Yes & Depends on baseline Attainable utility & Probably & Yes \end{matrix}$

Building the subagent

The twenty billion questions impact measure does not prevent the agent from building a subagent within its box. That’s because that measure is keyed to changes in the world, not to the power level of the agent. Until there are changes in the world, it will not trigger.

The relative reachability impact measure similarly won’t penalise the construction of a subagent (as long as the “states” it uses to define reachability don’t include specific information about the agent and any subagents). The reasoning is similar: the penalty triggers when an “irreversible” state is reached, not when the agent becomes powerful.

Building the subagent for the attainable utility impact measure is harder, since this typically increases the power of the agent. But the question is not “will any subagent trigger the impact measure?”, but “can the agent build a subagent in some way that doesn’t trigger the impact measure?” The example here shows that this is possible in some cases. Given that we don't have any impossibility results here, we can expect that a superintelligent agent will likely find a way of doing it.

Using the subagent

Does having the subagent help to evade the impact measure (and hence increase $R_{0}$ more)? For twenty billion questions, the answer is generally no^[2]. Having $S A$ as an independent agent is no different from having $S A$ as a direct extension of $A$ . If $S A$ changes anything about the various variables in $ρ$ , that triggers the impact penalty just as much as if $A$ had done so.

But the situation is different for the attainable utility penalty. That’s because that penalty is indexed to the agent. So, for $A$ , the impact penalty is roughly “ensure that you don’t get too much power”. But for $S A$ , the impact penalty is “ensure that agent $A$ doesn’t get too much power”. Given that $A$ is actively cooperating with $S A$ , that second penalty is much easier to satisfy, while still maximising $R_{0}$ : see here for how $S A$ can then become an essentially un-restricted $R_{0}$ -maximiser.

For relative reachability, it depends on the baseline. For the inaction baseline, the $ρ$ is entirely defined in terms of a distance metric between the current state $s_{t}$ and a given state $s_{t}^{'}$ that is independent of the agent. So, like the twenty billion questions, subagents don't allow the agent to evade the impact penalty.

But if the stepwise inaction baseline is used, then $s_{t}^{'}$ is defined by "what would have happened if $A$ had done nothing the previous turn (and the future $n$ turns)". This is also an indexed penalty, so $S A$ can easily maximise $R_{0}$ while $A$ does nothing, thus ensuring $s_{t} = s_{t}^{'}$ and the penalty vanishes.

Since all impact measures can apply to many different setups, we can't say anything fully definitive about every situation that uses a particular impact measure. ↩︎
Technically, $A$ might like to construct an $S A$ far away, beyond the “reach” of $ρ$ . But that’s because $A$ might like to move itself beyond the reach of $ρ$ - the subagent is just an indirect way of doing so. ↩︎

[-]Vika6y30

Thanks Stuart for your thought-provoking post! I think your point about the effects of the baseline choice on the subagent problem is very interesting, and it would be helpful to separate it more clearly from the effects of the deviation measure (which are currently a bit conflated in the table). I expect that AU with the inaction baseline would also avoid this issue, similarly to RR with an inaction baseline. I suspect that the twenty billion questions measure with the stepwise baseline would have the subagent issue too.

I'm wondering whether this issue is entirely caused by the stepwise baseline (which is indexed on the agent, as you point out), or whether the optionality-based deviation measures (RR and AU) contribute to it as well. So far I'm adding this to my mental list of issues with the stepwise baseline (along with the "car on a winding road" scenario) that need to be fixed.

10

Building and using the subagent

10

Building the subagent

Using the subagent