I think it's really great to have this argument typed up somewhere, and I liked the images. There's something important going on with how the agent can make our formal measurement of its power stop tracking the actual powers it's able to exert over the world, and I think solving this question is the primary remaining open challenge in impact measurement. The second half of Reframing Impact (currently being written and drawn) will discuss this in detail, as well as proposing partial solutions to this problem.

The agent's own power plausibly seems like a thing we should be able to cleanly formalize in a way that's robust when implemented in an impact measure. The problem you've pointed out somewhat reminds me of the easy problem of wireheading, in which we are fighting against a design choice rather than value specification difficulty.

How is $A$ getting reward for $S A$ being on the blue button? I assume $A$ gets reward whenever a robot is on the button?

This will give it a reward of $Ω γ^{k} + 1$ ,

Is the +1 a typo?

Note, though this is not relevant to this post, that if there were no teleporters along the corridor (just at either end), the robot could not move towards the blue button.

Depends on how much impact is penalized compared to normal reward.

Now plausible is this to work in a more general situation? Well, if the R is rich enough, this similar to the "twenty billion questions" in our low impact paper (section 3.2). But that's excessively rich, and will probably condemn the agent to inaction.

This isn't necessarily true. Consider $R$ as the reward function class for all linear functionals over camera pixels. Or, even the max-ent distribution over observation-based reward functions. I claim that this doesn't look like 20 billion Q's.

ETA: I'd also like to note that, while implicitly expanding the action space in the way you did (e.g. " $A$ can issue requests to $S A$ , and also program arbitrary non-Markovian policies into it") is valid, I want to explicitly point it out.

Reply

[-]Stuart_Armstrong6y*20

I assume $A$ gets reward whenever a robot is on the button?

Yes. If $A$ needs to be there in person, then $S A$ can carry it there (after suitably crippling it).

Is the +1 a typo?

Yes, thanks; re-written it to be $Ω γ^{k + 1}$ .

I'd also like to note that, while implicitly expanding the action space in the way you did (e.g. " $A$ can issue requests to $S A$ , and also program arbitrary non-Markovian policies into it") is valid, I want to explicitly point it out.

Yep. That's a subset of "It can use its arms to manipulate anything in the eight squares around itself.", but it's worth pointing it out explicitly.

Reply

[-]Stuart_Armstrong6y10

See here for more on this https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/jrrZids4LPiLuLzpu

It seems the problem might be worse than I thought...

Reply

[-]Stuart_Armstrong6y10

The impact measure is something like "Don't let the expected value of $R$ change; under the assumption that $A$ will be an $R$ -maximiser".

The addition of the subagent transforms this, in practice, to either "Don't let the expected value of $R$ change", or to nothing. These are ontologically simpler statements, so it can be argued that the initial measure failed to properly articulate "under the assumption that $A$ will be an $R$ -maximiser".

Reply

[-]Rohin Shah6y30

Flo's summary for the Alignment Newsletter:

This post argues that regularizing an agent's impact by <@attainable utility@>(@Towards a New Impact Measure@) can fail when the agent is able to construct subagents. Attainable utility regularization uses auxiliary rewards and penalizes the agent for changing its ability to get high expected rewards for these to restrict the agent's power-seeking. More specifically, the penalty for an action is the absolute difference in expected cumulative auxiliary reward between the agent either doing the action or nothing for one time step and then optimizing for the auxiliary reward.

This can be circumvented in some cases: If the auxiliary reward does not benefit from two agents instead of one optimizing it, the agent can just build a copy of itself that does not have the penalty, as doing this does not change the agent's ability to get a high auxiliary reward. For more general auxiliary rewards, an agent could build another more powerful agent, as long as the powerful agent commits to balancing out the ensuing changes in the original agent's attainable auxiliary rewards.

Flo's opinion:

I am confused about how much the commitment to balance out the original agent's attainable utility would constrain the powerful subagent. Also, in the presence of subagents, it seems plausible that attainable utility mostly depends on the agent's ability to produce subagents of different generality with different goals: If a subagent that optimizes for a single auxiliary reward was easier to build than a more general one, building a general powerful agent could considerably decrease attainable utility for all auxiliary rewards, such that the high penalty rules out this action.

Reply

[-]Stuart_Armstrong6y40

Not quite... "If the auxiliary reward does not benefit from two agents instead of one optimizing it" should be "If the subagent can be constructed in any way that does not benefit the auxiliary reward(s)" - it's not that generic subagents wont have an impact, is whether the main agent is smart enough to construct one without having an impact.

For the opinion... the subagent does not have "commitments to balance out the original agent's attainable utility". The subagent has exactly the same goal as the original agent, namely $R_{0} - PENALTY$ (or $R_{0} - d_{A U}$ ). Except that the penalty term specifically points to the first agent, not to the subagent. So the subagent wants to maximise $R_{0}$ while constraining the penalty term on the first agent.

That's why the subagent has so much more power than the first agent. It is only mildly constrained by the penalty term, and can reduce the term by actions on the first agent (indirectly empowering or directly weakening it as necessary).

Thus one subagent is enough (it itself will construct other subagents, if necessary). As soon as it is active, with the $R_{0} - PENALTY$ goal, then the penalty term is broken in practice, and the subagent can (usually) make itself powerful without triggering the penalty on any of the auxiliary rewards.

Reply

[-]Stuart_Armstrong6y10

Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general

Reply

[-]TurnTrout6y10

Nitpick: "Attainable utility regularization" should be "Attainable utility preservation"

Reply

[-]Charlie Steiner6y20

Basically this is because the agent treats itself specially (imagining intervening on its own goals) but can treat the subagent as a known quantity (which can be chosen to appropriately respond to imagined interventions on the agent's goals)?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

28

Appendix: how a subagent could get powerful

28

Attainable utility definitions

The agent that can't (yet) teleport

Adding a potential subagent

Inaction baseline