The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies

2Slider

4Alex Turner

New Comment

As I understand expanding candy into A and B but not expanding the other will make the ratios go differently.

In probablity one can have the assumtion of equiprobability, if you have no reason to think one is more likely than other then it might be reaosnable to assume they are equally likely.

If we knew what was important and what not we would be sure about the optimality. But since we think we don't know it or might be in error about it we are treating that the value could be hiding anywhere. It seems to work in a world where each node is pretty comparably likely to contain value. I guess it comes from the effect of the relevant utility functions being defined in the terms of states we know about.

As I understand expanding candy into A and B but not expanding the other will make the ratios go differently.

What do you mean?

If we knew what was important and what not we would be sure about the optimality. But since we think we don't know it or might be in error about it we are treating that the value could be hiding anywhere.

I'm not currently trying to make claims about what variants we'll actually be likely to specify, if that's what you mean. Just that in the reasonably broad set of situations covered by my theorems, the vast majority of variants of every objective function will make power-seeking optimal.

Edit, 5/16/23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents a cripplingly unrealistic picture of the role of reward functions in reinforcement learning. Reward functions are not "goals", real-world policies are not "optimal", and the mechanistic function of reward is (usually) to provide policy gradients to update the policy network.I expect this post to harm your alignment research intuitions unless you've already inoculated yourself by deeply internalizing and understandingReward is not the optimization target. If you're going to read one alignment post I've written, read that one.Follow-up work (Parametrically retargetable decision-makers tend to seek power) moved away from optimal policies and treated reward functions more realistically.Environmental Structure Can Cause Instrumental ConvergenceBut how strong is this effect, quantitatively?

In

Environmental Structure Can Cause Instrumental Convergence, I speculated that we should be able to get quantitative lower bounds on how many objectives incentivize power-seeking actions:About a week later, I had my answer:

Scaling law for instrumental convergence (informal):if policy set ΠA lets you do "n times as many things" than policy set ΠB lets you do, then foreveryreward function,A is optimal over B for at leastnn+1of its permuted variants (i.e.orbit elements).For example, ΠA might contain the policies where you stay alive, and ΠB may be the other policies: the set of policies where you enter one of several death states.

(Conjecture which I think I see how to prove: for

almost allreward functions, A isstrictlyoptimal over B for at least nn+1 of its permuted variants.)Basically, when you could apply the previous results, but "multiple times"FN: quotes, you can get lower bounds on how often the larger set of things is optimal:

And in way larger environments - like the

real world, where there are trillions and trillions of things you can do if you stay alive, and not much you can do otherwise - nearlyallorbit elements will make survival optimal.I see this theory as beginning tolink the richness of the agent's environment, with the difficulty of aligning that agent: for optimal policies, instrumental convergence strengthens proportionally to the ratio of control if you survivecontrol if you die.

## Why this is true

Optional section.The proofs are currently in an Overleaf; let me know if you want access. But here's one intuition, using the

`candy/chocolate/reward`

example environment.`candy`

is strictly optimal.`candy`

is strictly optimal over both`chocolate`

and`hug`

.`candy`

and`chocolate`

, and one switching reward for`candy`

and`hug`

.`Wait!`

is strictly optimal.`Wait!`

is strictly optimal over`candy`

, than those for which`candy`

is strictly optimal over`Wait!`

.`Start`

's child states (`candy/Wait!`

) is strictly optimal, or they're both optimal. If they're both optimal,`Wait!`

is optimal. Otherwise,`Wait!`

makes up at least 23 of the orbit elements for which strict optimality holds.## Conjecture

Fractional scaling law for instrumental convergence (informal):if staying alive lets you do n "things" and dying lets you do m≤n "things", then foreveryreward function,staying alive is optimal for at leastnn+mof its orbit elements.I'm reasonably confident this is true, but I haven't worked through the combinatorics yet. This would slightly strengthen the existing lower bounds in certain situations. For example, suppose dying gives you 2 choices of terminal state, but living gives you 51 choices. The current result only lets you prove that at least 5050+2=2526 of the orbit incentivizes survival. The fractional lower bound would slightlyimprove this to 5151+2=5153.

## Invariances

In certain ways, the results are indifferent to e.g. increased precision in agent sensors: it doesn't matter if dying gives you 1 option and living gives you n options, or if dying gives you 2 options and living gives you 2n options.

Similarly, you can do the inverse operations to simplify subgraphs in a way that respects the theorems.

This is the start of a theory on what state abstractions "respect" the theorems, although there's still a lot I don't understand there. (I've barely thought about it so far.)

## Note of caution, redux

Last time, in addition to the "how do combinatorics work?" question I posed, I wrote several qualifications:

Let's take care of that last one. I was actually being too cautious, since the existing results already show us how to reason across multiple situations. The reason is simple: suppose we use my results to prove that when the agent maximizes average per-timestep reward, it's strictly optimal for at least 99.99% of objective variants to stay alive. This is because the death states are strictly suboptimal for these variants. For all of these variants,

no matter the situationthe agent finds itself in, it'll be optimal to try to avoid the strictly suboptimal death states.This doesn't mean that these variants always incentivize moves which are formally POWER-seeking, but it does mean that we can sometimes prove what optimal policies tend to do across a range of situations.

So now we find ourselves with a slimmer list of qualifications:

It turns out to be surprisingly easy to do away with (2). We'll get to that next time.

For (3), environments which "almost" have the right symmetries should also "almost" obey the theorems. To give a quick, non-legible sketch of my reasoning:

So I don't currently view (3) as a huge deal. I'll probably talk more about that another time.

This should bring us to interfacing with (1) ("how smart is the agent? How does it think, and what options will it tend to choose?" -

this seems hard) and (4) ("for what kinds of reward specification procedures are there way more ways to incentivize power-seeking, than there are ways tonotincentivize power-seeking?" -this seems more tractable).## Conclusion

This scaling law deconfuses me about why it seems so hard to specify nontrivial real-world objectives which don't have incorrigible shutdown-avoidance incentives when maximized.

FN quotes: I'm using scare quotes regularly because there aren't short English explanations for the exact technical conditions. But this post is written so that the high-level takeaways should be right.

Thanks to Connor Leahy, Rohin Shah, Adam Shimi, and John Wentworth for feedback on this post.