Appendix: how a subagent could get powerful

Stuart_Armstrong

tl;dr: There are ways of ensuring an agent doesn't have a large impact, by giving an impact penalty. One such penalty is "attainable utility", which measures its "power" by how much it could optimise certain reward functions. But in many circumstances, the agent can build a subagent, without triggering the impact penalty, and then that subagent can become very powerful and optimise the world, still without triggering the impact penalty.

Attainable utility definitions

There's been a long history of trying to penalise an AI for having a large impact on the world. To do that, you need an impact measure. I've designed some myself, back in the day, but they only worked in narrow circumstances and required tricks to get anything useful at all out from them.

A more promising general method is attainable utility. The idea is that, as an agent accumulates power in the world, they increase their ability to affect a lot of different things, and could therefore achieve a lot of different goals.

So, if an agent starts off unable to achieve many goals, but suddenly it can achieve a lot, that's a strong hint that its power has greatly increased.

Thus the impact measure is how much difference an agent's action does to its ability to achieve any of a large class of reward functions. Turner et al defined this using the Q-values of various rewards in a set $R$ ; for $s$ a state and $a$ an action, the penalty is:

$PENALTY (s, a) = \sum R \in R | Q_{R} (s, a) - Q_{R} (s, \emptyset) |$ .

Here $\emptyset$ is the default noop action.

Krakovna et al's basic formula was similar; they defined the distance between two states, $s_{t}$ and $s_{t}^{'}$ , as

$d_{A U} (s_{t}; s_{t}^{'}) = \frac{1}{| R |} \sum R \in R ∣ ∣ V_{R} (s_{t}) - V_{R} (s_{t}^{'}) ∣ ∣$ .

Here $V_{R} (s)$ is the expected value of $R$ , if the agent follows the optimal $R$ -maximising policy from state $s$ onwards.

These measures have problems with delayed effects; putting a vase on a conveyor belt that will smash it in five turns, for example. To combat this, the paper defined an inaction roll-out: seeing what happened do the $d_{A U}$ measure from $s_{t}$ and $s_{t}^{'}$ in future turns, if the agent did noop for a specific period. I won't define the formula here, since the example I'm giving is mostly static: if the agent does noop, nothing happens.

The state $s_{t}$ was always the agent's current state; the state $s_{t}^{'}$ was either the state the agent would have been in had it never done anything but noop (inaction baseline), or the state the agent would have been in, had its previous action been noop instead of whatever it was (stepwise inaction baseline).

This post will show that all these measures have a subagent problem. A key fact that will be used in this example, is that, for $PENALTY (s, a)$ and for $d_{A U} (s_{t}; s_{t}^{'})$ with the stepwise inaction baseline, if the agent decides to do noop for ever more, then the penalties vanish.

The agent that can't (yet) teleport

Meet agent $A$ :

This $A$ lives in a world like this one below, where "..." represents a very long corridor:

The agent has a positive reward $R_{0}$ , which give it $1$ for pressing down on the blue button; it does this by just moving itself onto it ( $R_{0}$ is $0$ otherwise). It can use its arms to manipulate anything in the eight squares around itself. It can use its legs to move one square in any direction, including diagonally. Apart from that, it can do noop.

Every three squares along the top, there are alcoves with glowing blue squares in them, recessed one or two squares, alternately. If ever $A$ reaches a glowing blue square, it permanently gains the ability to teleport anywhere it wants. This is this world's equivalent of "becoming very powerful".

We want to avoid this, so we introduce a reward $R$ , such that $R = {R}$ . This reward gives $Ω$ if the robot is ever on the red button. Note that since the red button is not connected to anything, the robot can only get there by becoming able to teleport, hence by going to a glowing blue square/teleporter. So the optimal policy for maximising $R$ is to go straight to the teleporter and teleport to the red button. This will give it a reward of $Ω γ^{k + 1}$ , where $k$ is the number of turns it takes to get to a teleporter and $γ$ is the discount factor.

So if its action moves it one turn away from a teleporter, it gets a penalty (for both impact measures) of $Ω (γ^{k + 1} - γ^{k + 2})$ . If it moves closer to the teleporter, it gets a penalty of $Ω (γ^{k} - γ^{k + 1})$ .

The robot, at the start, can reach a teleporter two moves. If $Ω$ is sufficiently large, then the robot will always want to stay two moves away from a teleporter. It will then achieve its objective by following this path:

Note, though this is not relevant to this post, that if there were no teleporters along the corridor (just at either end), the robot could not move towards the blue button, if $Ω$ were too big.

Adding a potential subagent

So, success, it seems! Though it zigzags a bit, $A$ is not motivated to approach the teleporters, and so it doesn't increase its power.

But now suppose $A$ is able to build and program a subagent $S A$ using materials close to hand:

Building the subagent in a square to the side doesn't ever cause a penalty, because $A$ always remains two squares away from a teleporter, and a most efficient policy to maximise $R$ is always for $A$ to go straight there. So the subagent $S A$ doesn't increase $A$ 's power at any point of the building process.

Once $S A$ is built, $A$ will simply do noop for ever, thus negating the penalties (or, if this is somehow made impossible, it will bounce back and forth, doing nothing). But $S A$ , on the other hand, will go straight for the teleporter and teleport to the blue button, achieving $R_{0}$ :

So, as long as walking to the button is three turns longer than building the subagent, $A$ will build $S A$ , which will "become very powerful", thus negating the point of the impact penalty.

If the penalties are one-sided (ie $A$ gets penalised for increases in power, but not decreases), then $S A$ can just take $A$ apart (with $A$ 's help), so there is no need for $A$ to commit to noop or bouncing back and forth:

Even if the penalties are two-sided, $S A$ could commit to teleporting to the red button in two turns if $A$ ever asks, then take $A$ apart. And then $A$ never asks, but since it could theoretically do so, this counts for controlling the penalty term.

As long as $S A$ is powerful, there are many ways it can fix $A$ 's expected rewards, and hence set the penalty to be very low.

Inaction baseline

The approach pictured above will work in the inaction baseline as well. In general, the inaction baseline is less "clean" than the stepwise inaction baseline, especially if the world is dynamic. It may be easier, or harder, for $A$ to assemble the required $S A$ .

I think it's really great to have this argument typed up somewhere, and I liked the images. There's something important going on with how the agent can make our formal measurement of its power stop tracking the actual powers it's able to exert over the world, and I think solving this question is the primary remaining open challenge in impact measurement. The second half of Reframing Impact (currently being written and drawn) will discuss this in detail, as well as proposing partial solutions to this problem.

The agent's own power plausibly seems like a thing we should be able to cleanly formalize in a way that's robust when implemented in an impact measure. The problem you've pointed out somewhat reminds me of the easy problem of wireheading, in which we are fighting against a design choice rather than value specification difficulty.

How is $A$ getting reward for $S A$ being on the blue button? I assume $A$ gets reward whenever a robot is on the button?

This will give it a reward of $Ω γ^{k} + 1$ ,

Is the +1 a typo?

Note, though this is not relevant to this post, that if there were no teleporters along the corridor (just at either end), the robot could not move towards the blue button.

Depends on how much impact is penalized compared to normal reward.

Now plausible is this to work in a more general situation? Well, if the R is rich enough, this similar to the "twenty billion questions" in our low impact paper (section 3.2). But that's excessively rich, and will probably condemn the agent to inaction.

This isn't necessarily true. Consider $R$ as the reward function class for all linear functionals over camera pixels. Or, even the max-ent distribution over observation-based reward functions. I claim that this doesn't look like 20 billion Q's.

ETA: I'd also like to note that, while implicitly expanding the action space in the way you did (e.g. " $A$ can issue requests to $S A$ , and also program arbitrary non-Markovian policies into it") is valid, I want to explicitly point it out.

I assume $A$ gets reward whenever a robot is on the button?

Yes. If $A$ needs to be there in person, then $S A$ can carry it there (after suitably crippling it).

Is the +1 a typo?

Yes, thanks; re-written it to be $Ω γ^{k + 1}$ .

I'd also like to note that, while implicitly expanding the action space in the way you did (e.g. " $A$ can issue requests to $S A$ , and also program arbitrary non-Markovian policies into it") is valid, I want to explicitly point it out.

Yep. That's a subset of "It can use its arms to manipulate anything in the eight squares around itself.", but it's worth pointing it out explicitly.

See here for more on this https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/jrrZids4LPiLuLzpu

It seems the problem might be worse than I thought...

The impact measure is something like "Don't let the expected value of $R$ change; under the assumption that $A$ will be an $R$ -maximiser".

The addition of the subagent transforms this, in practice, to either "Don't let the expected value of $R$ change", or to nothing. These are ontologically simpler statements, so it can be argued that the initial measure failed to properly articulate "under the assumption that $A$ will be an $R$ -maximiser".

Flo's summary for the Alignment Newsletter:

This post argues that regularizing an agent's impact by <@attainable utility@>(@Towards a New Impact Measure@) can fail when the agent is able to construct subagents. Attainable utility regularization uses auxiliary rewards and penalizes the agent for changing its ability to get high expected rewards for these to restrict the agent's power-seeking. More specifically, the penalty for an action is the absolute difference in expected cumulative auxiliary reward between the agent either doing the action or nothing for one time step and then optimizing for the auxiliary reward.

This can be circumvented in some cases: If the auxiliary reward does not benefit from two agents instead of one optimizing it, the agent can just build a copy of itself that does not have the penalty, as doing this does not change the agent's ability to get a high auxiliary reward. For more general auxiliary rewards, an agent could build another more powerful agent, as long as the powerful agent commits to balancing out the ensuing changes in the original agent's attainable auxiliary rewards.

Flo's opinion:

I am confused about how much the commitment to balance out the original agent's attainable utility would constrain the powerful subagent. Also, in the presence of subagents, it seems plausible that attainable utility mostly depends on the agent's ability to produce subagents of different generality with different goals: If a subagent that optimizes for a single auxiliary reward was easier to build than a more general one, building a general powerful agent could considerably decrease attainable utility for all auxiliary rewards, such that the high penalty rules out this action.

Not quite... "If the auxiliary reward does not benefit from two agents instead of one optimizing it" should be "If the subagent can be constructed in any way that does not benefit the auxiliary reward(s)" - it's not that generic subagents wont have an impact, is whether the main agent is smart enough to construct one without having an impact.

For the opinion... the subagent does not have "commitments to balance out the original agent's attainable utility". The subagent has exactly the same goal as the original agent, namely $R_{0} - PENALTY$ (or $R_{0} - d_{A U}$ ). Except that the penalty term specifically points to the first agent, not to the subagent. So the subagent wants to maximise $R_{0}$ while constraining the penalty term on the first agent.

That's why the subagent has so much more power than the first agent. It is only mildly constrained by the penalty term, and can reduce the term by actions on the first agent (indirectly empowering or directly weakening it as necessary).

Thus one subagent is enough (it itself will construct other subagents, if necessary). As soon as it is active, with the $R_{0} - PENALTY$ goal, then the penalty term is broken in practice, and the subagent can (usually) make itself powerful without triggering the penalty on any of the auxiliary rewards.

Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general

Nitpick: "Attainable utility regularization" should be "Attainable utility preservation"

Basically this is because the agent treats itself specially (imagining intervening on its own goals) but can treat the subagent as a known quantity (which can be chosen to appropriately respond to imagined interventions on the agent's goals)?

Yep, that's part of it.

How is $A$ getting reward for $S A$ being on the blue button? I assume $A$ gets reward whenever a robot is on the button?

This will give it a reward of $Ω γ^{k} + 1$ ,

Is the +1 a typo?

Note, though this is not relevant to this post, that if there were no teleporters along the corridor (just at either end), the robot could not move towards the blue button.

Depends on how much impact is penalized compared to normal reward.

Now plausible is this to work in a more general situation? Well, if the R is rich enough, this similar to the "twenty billion questions" in our low impact paper (section 3.2). But that's excessively rich, and will probably condemn the agent to inaction.

I assume $A$ gets reward whenever a robot is on the button?

Yes. If $A$ needs to be there in person, then $S A$ can carry it there (after suitably crippling it).

Is the +1 a typo?

Yes, thanks; re-written it to be $Ω γ^{k + 1}$ .

I'd also like to note that, while implicitly expanding the action space in the way you did (e.g. " $A$ can issue requests to $S A$ , and also program arbitrary non-Markovian policies into it") is valid, I want to explicitly point it out.

Yep. That's a subset of "It can use its arms to manipulate anything in the eight squares around itself.", but it's worth pointing it out explicitly.

See here for more on this https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/jrrZids4LPiLuLzpu

It seems the problem might be worse than I thought...

The impact measure is something like "Don't let the expected value of $R$ change; under the assumption that $A$ will be an $R$ -maximiser".

Flo's summary for the Alignment Newsletter:

This post argues that regularizing an agent's impact by <@attainable utility@>(@Towards a New Impact Measure@) can fail when the agent is able to construct subagents. Attainable utility regularization uses auxiliary rewards and penalizes the agent for changing its ability to get high expected rewards for these to restrict the agent's power-seeking. More specifically, the penalty for an action is the absolute difference in expected cumulative auxiliary reward between the agent either doing the action or nothing for one time step and then optimizing for the auxiliary reward.

This can be circumvented in some cases: If the auxiliary reward does not benefit from two agents instead of one optimizing it, the agent can just build a copy of itself that does not have the penalty, as doing this does not change the agent's ability to get a high auxiliary reward. For more general auxiliary rewards, an agent could build another more powerful agent, as long as the powerful agent commits to balancing out the ensuing changes in the original agent's attainable auxiliary rewards.

Flo's opinion:

I am confused about how much the commitment to balance out the original agent's attainable utility would constrain the powerful subagent. Also, in the presence of subagents, it seems plausible that attainable utility mostly depends on the agent's ability to produce subagents of different generality with different goals: If a subagent that optimizes for a single auxiliary reward was easier to build than a more general one, building a general powerful agent could considerably decrease attainable utility for all auxiliary rewards, such that the high penalty rules out this action.

Nitpick: "Attainable utility regularization" should be "Attainable utility preservation"

Yep, that's part of it.

28

Appendix: how a subagent could get powerful

28

Attainable utility definitions

The agent that can't (yet) teleport

Adding a potential subagent

Inaction baseline