Building and using the subagent

Thanks Stuart for your thought-provoking post! I think your point about the effects of the baseline choice on the subagent problem is very interesting, and it would be helpful to separate it more clearly from the effects of the deviation measure (which are currently a bit conflated in the table). I expect that AU with the inaction baseline would also avoid this issue, similarly to RR with an inaction baseline. I suspect that the twenty billion questions measure with the stepwise baseline would have the subagent issue too.

I'm wondering whether this issue is entirely caused by the stepwise baseline (which is indexed on the agent, as you point out), or whether the optionality-based deviation measures (RR and AU) contribute to it as well. So far I'm adding this to my mental list of issues with the stepwise baseline (along with the "car on a winding road" scenario) that need to be fixed.

Building the subagent

The twenty billion questions impact measure does not prevent the agent from building a subagent within its box. That’s because that measure is keyed to changes in the world, not to the power level of the agent. Until there are changes in the world, it will not trigger.

The relative reachability impact measure similarly won’t penalise the construction of a subagent (as long as the “states” it uses to define reachability don’t include specific information about the agent and any subagents). The reasoning is similar: the penalty triggers when an “irreversible” state is reached, not when the agent becomes powerful.

Building the subagent for the attainable utility impact measure is harder, since this typically increases the power of the agent. But the question is not “will any subagent trigger the impact measure?”, but “can the agent build a subagent in some way that doesn’t trigger the impact measure?” The example here shows that this is possible in some cases. Given that we don't have any impossibility results here, we can expect that a superintelligent agent will likely find a way of doing it.

Using the subagent

Does having the subagent help to evade the impact measure (and hence increase

R_{0}

more)? For twenty billion questions, the answer is generally no^[2]. Having

S A

as an independent agent is no different from having

S A

as a direct extension of

A

. If

S A

changes anything about the various variables in

ρ

, that triggers the impact penalty just as much as if

A

had done so.

But the situation is different for the attainable utility penalty. That’s because that penalty is indexed to the agent. So, for

A

, the impact penalty is roughly “ensure that you don’t get too much power”. But for

S A

, the impact penalty is “ensure that agent

A

doesn’t get too much power”. Given that

A

is actively cooperating with

S A

, that second penalty is much easier to satisfy, while still maximising

R_{0}

: see here for how

S A

can then become an essentially un-restricted

R_{0}

-maximiser.

For relative reachability, it depends on the baseline. For the inaction baseline, the

ρ

is entirely defined in terms of a distance metric between the current state

s_{t}

and a given state

s_{t}^{'}

that is independent of the agent. So, like the twenty billion questions, subagents don't allow the agent to evade the impact penalty.

But if the stepwise inaction baseline is used, then

s_{t}^{'}

is defined by "what would have happened if

A

had done nothing the previous turn (and the future

n

turns)". This is also an indexed penalty, so

S A

can easily maximise

R_{0}

while

A

does nothing, thus ensuring

s_{t} = s_{t}^{'}

and the penalty vanishes.

Since all impact measures can apply to many different setups, we can't say anything fully definitive about every situation that uses a particular impact measure. ↩︎
Technically, $A$ might like to construct an $S A$ far away, beyond the “reach” of $ρ$ . But that’s because $A$ might like to move itself beyond the reach of $ρ$ - the subagent is just an indirect way of doing so. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

10

Building the subagent

Using the subagent