Review

Instrumental Convergence For Realistic Agent Objectives

4Oliver Habryka

2Alex Turner

3Jacob Pfau

4Alex Turner

3Jacob Pfau

3Alex Turner

1Koen Holtman

2Alex Turner

New Comment

8 comments, sorted by Click to highlight new comments since: Today at 7:56 PM

Am I correct to assume that the discussion of StarCraft and Minecraft are discussing single-player variants of those games?

It seems to me that in a competitive, 2-player, minimize-resource-competition StarCraft, you would want to go kill your opponent so that they could no longer interfere with your resource loss? More generally, I think competitions to minimize resources might still usually involve some sort of power-seeking. I remember reading somewhere that 'losing chess' involves normal-looking (power-seeking?) early game moves.

I'm implicitly assuming a fixed opponent policy, yes.

Without being overly familiar with SC2—you don't have to kill your opponent to get to 0 resources, do you? From my experience with other RTS games, I imagine you can just quickly build units and deplete your resources, and then your opponent can't make you accrue more resources. Is that wrong?

Yes, I agree that in the simplest case, SC2 with default starting resources, you just build one or two units and you're done. However, I don't see why this case should be understood as generically explaining the negative alpha weights setting. Seems to me more like a case of an excessively simple game?

Consider the set of games starting with various quantities of resources and negative alpha weights. As starting resources increase, you will be incentivised to go attack your opponent to interfere with their resource depletion. Indeed, if the reward is based on end-of-game resource minimisation, you end up participating in an unbounded resource-maximisation competition trying to guarantee control over your opponent; then you spend your resources safely after crippling your opponent? In the single player setting, you will be incentivised to build up your infrastructure so as to spend your resources more quickly.

It seems to me the multi-player case involves power-seeking. Then, it seems like negative alpha weights don't generically imply anything about the existence of power-seeking incentives?

(I'm actually not clear on whether the single-player case should be seen as power-seeking or not? Maybe it depends on your choice of discount rate, gamma? You are building up infrastructure, i.e. unit-producing buildings, which seems intuitively power-seeking. But the number of long-term possibilities available to you following spending resources on infrastructure is reduced -- assuming gamma=1 -- OTOH the number of short-term possibilities may be higher given infrastructure, so you may have increased power assuming gamma<1?)

I agree that in certain conceivable games which are not baseline SC2, there will be different power-seeking incentives for negative alpha weights. My commentary wasn't intended as a generic takeaway about negative feature weights in particular.

But in the game which actually is SC2, where you don't start with a huge number of resources, negative alpha weights don't incentivize power-seeking. You do need to think about the actual game being considered, before you can conclude that negative alpha weighs imply such-and-such a behavior.

But the number of long-term possibilities available to you following spending resources on infrastructure is reduced

I think that either or considering *suboptimal *power-seeking resolves the situation. The reason that building infrastructure intuitively seems like power-seeking is that we are not optimal logically omniscient agents; all possible future trajectores do not lay out immediately before our minds. But the suboptimal power-seeking metric (Appendix C in *Optimal Policies Tend To Seek Power*) does match intuition here AFAICT, where cleverly building infrastructure has the effect of navigating the agent to situations with more cognitively exploitable opportunities.

instrumental convergence basically disappears for agents with utility functions over action-observation histories.

Wait, I am puzzled. Have you just completely changed your mind about the preconditions needed to get a power-seeking agent? The way the above reads is: just add some observation of actions to your realistic utility function, and you instrumental convergence problem is solved.

u-AOH (utility functions over action-observation histories):

No ICu-OH (utility functions over observation histories):

Strong IC

There are many utility functions in u-AOH that simply ignore the A part of the history, so these would then have **Strong IC** because they are u-OH functions. So are you are making a subtle mathematical point about how these will average away to zero (given various properties of infinite sets), or am I missing something?

Edit, 5/16/23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents an unrealistic picture of the role of reward functions in reinforcement learning, conflating "utility" with "reward." Reward functions are not "goals", reward functions are not "objectives" of the policy network, real-world policies are not "optimal", and the mechanistic function of reward is (usually) to provide policy gradients to update the policy network.I expect this post to harm your alignment research intuitions unless you've already inoculated yourself by deeply internalizing and understandingReward is not the optimization target. If you're going to read one alignment post I've written, read that one.Follow-up work (Parametrically retargetable decision-makers tend to seek power) moved away from optimal policies and treated reward functions more realistically.The current power-seeking theorems say something like:

This kind of argument assumes that (the set of utility functions we might specify) is closed under permutation. This is unrealistic, because practically speaking we reward agents based off of observed features of the agent's environment.

For example, Pac-Man eats dots and gains points. A football AI scores a touchdown and gains points. A robot hand solves a Rubik's cube and gains points. But most

permutationsof these objectives are implausible because they're high-entropy, they're very complex, they assign high reward to one state and low reward to another state without a simple generating rule that grounds out in observed features. Practical objective specification doesn't allow that many degrees of freedom in what states get what reward.I explore how instrumental convergence works in this case. I also walk through how these new results retrodict the fact that instrumental convergence basically disappears for agents with utility functions over action-observation histories.

## Case Studies

## Gridworld

Consider the following environment, where the agent can either stay put or move along a purple arrow.

Suppose the agent gets some amount of reward each timestep, and it's choosing a policy to maximize its average per-timestep reward. Previous results tell us that for generic reward functions over states, at least half of them incentivize going right. There are two terminal states on the left, and three on the right, and 3 > 2; we conclude that at least floor(3/2)floor(3/2)+1=12 of objectives incentivize going right.

But it's damn hard to have so many degrees of freedom that you're specifying a potentially independent utility number for each state.

^{[1]}Meaningful utility functions will be featurized in some sense—only depending on certain features of the world state, and of how the outcomes transpired, etc. If the featurization is linear, then it's particularly easy to reason about power-seeking incentives.Let feat(s):=⎛⎜⎝1 if s = △, 0 else1 if s = ◯, 0 else1 if s = ★, 0 else⎞⎟⎠. That is, the featurization only cares what shape the agent is standing on. Suppose the agent makes decisions in a way which depends only on the featurized reward of a state: R(s)=feat(s)⊤α, where α∈R3 expresses the feature coefficients. Then the relevant terminal states are only {triangle, circle, star}, and we conclude that 23 of coefficient vectors incentivize going right. This is true more precisely in the orbit sense: For every coefficient vector α, at least

^{[2]}23 of its permuted variants make the agent prefer to go right.This particular featurization

increasesthe strength of the orbit-level incentives—whereas before, we could only guarantee 12-strength power-seeking tendency, now we guarantee 23-level.^{[3]}^{[4]}There's another point I want to make in this tiny environment.

Suppose we find an environmental symmetry ϕ which lets us apply the original power-seeking theorems to raw reward functions over the world state. Letting es∈R6 be a column vector with an entry of 1 at state s and 0 elsewhere, in this environment, we have the symmetry enforced by ϕ⋅State distributions, left{e△,eleft}={e◯,eright}⊊State distributions, right{e◯,eright,e★}.

Given a state featurization, and given that we know that there's a state-level environmental symmetry ϕ, when can we conclude that there's also feature-level power-seeking in the environment?

Here, we're asking "if reward is only allowed to depend on how often the agent visits each shape, and we know that there's a raw state-level symmetry, when do we know that there's a shape-feature embedding from (left shape feature vectors) into (right shape feature vectors)?"

In terms of "what choice lets me access 'more' features?", this environment is relatively easy—look, there are twice as many shapes on the right. More formally, we have:

Feature vectors on the left⎧⎪⎨⎪⎩⎛⎜⎝1△0◯0★⎞⎟⎠,⎛⎜⎝0△0◯0★⎞⎟⎠⎫⎪⎬⎪⎭Feature vectors on the right⎧⎪⎨⎪⎩⎛⎜⎝0△1◯0★⎞⎟⎠,⎛⎜⎝0△0◯0★⎞⎟⎠,⎛⎜⎝0△0◯1★⎞⎟⎠⎫⎪⎬⎪⎭,where the left set can be permuted two separate waysinto the right set (since the zero vector isn't affected by feature permutations).

But I'm gonna play dumb and walk through to illustrate a more important point about how power-seeking tendencies are guaranteed when featurizations respect the structure of the environment.

Consider the state s△. We permute it to be s◯ using ϕ (because ϕ(s△)=s◯), and then featurize it to get a feature vector with 1◯ and 0 elsewhere.

Alternatively, suppose we first featurize s△ to get a feature vector with 1△ and 0 elsewhere. Then we swap which features are which, by switching △ and ◯. Then we get a feature vector with 1◯ and 0 elsewhere—the same result as above.

The shape featurization plays nice with the actual nitty-gritty environment-level symmetry. More precisely, a sufficient condition for feature-level symmetries: (Featurizing and then swapping which features are which) commutes with (swapping which states are which and then featurizing).

^{[5]}And where there are feature-level symmetries, just apply the normal power-seeking theorems to conclude that there are decision-making tendencies to choose sets of larger features.In a different featurization, suppose the featurization is the agent's x/y coordinates. R(sx,y)=α1x+α2y.

Given the

startstate, if the agent goesup, its reachable feature vector is just {(x=0 y=1)}, whereas the agent can induce (x=1 y=0) if it goesright. Therefore, wheneverupis strictly optimal for a featurized reward function, we can permute that reward function's feature weights by swapping the x- and y-coefficients (α1 and α2, respectively). Again, this new reward function is featurized, and it makes goingrightstrictly optimal. So the usual arguments ensure that at least half of these featurized reward functions make it optimal to go right.But sometimes these similarities won't hold, even when it initially looks like they "should"!

In this environment, the agent can induce the feature vectors {(x:−1y:0),(x:−1y:−1)} if it goes

left. However, it can induce {(x:1y:0),(x:1y:1)} if it goesright.There is no way of switching feature labels so as to copy theleftfeature set into therightfeature set!There's no way to just apply a feature permutation to theleftset, and thereby produce a subset of therightfeature set. Therefore, the theorems don't apply, and so they don't guarantee anything about how most permutations of every reward function incentivize some kind of behavior.On reflection, this makes sense. If α1=α2=−1, then there's no way the agent will want to go

right.Instead, it'll go for the negative feature values offered by goingleft. This will hold forallpermutations of this feature labelling, too. So the orbit-level incentivescan'thold.If the agent can be made to "hate everything" (all feature weights αi are negative), then it will pursue opportunities which give it negative-valued feature vectors, or at least strive for the oblivion of the zero feature vector. Vice versa for if it positively values all features.

## StarCraft II

Consider a deep RL training process, where the agent's episodic reward is featurized into a weighted sum of the different resources the agent has at the end of the game, with weight vector α. For simplicity, we fix an opponent policy and a learning regime (number of epochs, learning rate, hyperparameters, network architecture, and so on). We consider the effects of varying the reward feature coefficients α.

Outcomes of interest:Game state trajectories.AI decision-making function:f(T∣α) returns the probability that, given our fixed learning regime and reward feature vector α, the training process produces a policy network whose rollouts instantiate some trajectory τ∈T.What the theorems say:aren'torbit elements α with positive entries but where the learned policy tends to just die, and so we don't even have to check that the permuted variants ϕ⋅α of such feature vectors are also plausible. Power-seeking occurs.This reasoning

depends on which kinds of feature weights are plausible, and so wouldn't have been covered by the previous results.## Minecraft

Similar setup to StarCraft II, but now the agent's episode reward is α1⋅(Amount of iron ore in chests within 100 blocks of spawn after 2 in-game days)+α2⋅(Same but for coal), where α1,α2∈R are scalars (together, they form the coefficient vector α∈R2).

Outcomes of interest:Game state trajectories.AI decision-making function:f(T∣α) returns the probability that, given our fixed learning regime and feature coefficients α, the training process produces a policy network whose rollouts instantiate some trajectory τ∈T.What the theorems say:notgain power because it has no optimization pressure steering it towards the few action sequences which gain the agent power.The analysis so far is nice to make a bit more formally, but it isn't really pointing out anything that we couldn't have figured out pre-theoretically. I think I can sketch out more novel reasoning, but I'll leave that to a future post.

## Beyond The Featurized Case

Consider some arbitrary set D⊆Rd of "plausible" utility functions over d outcomes. If we have the usual big set B of outcome lotteries (which possibilities are, in the view of this theory, often attained via "power-seeking"), and B contains n copies of some smaller set A via environmental symmetries ϕ1,…,ϕn, then when are there orbit-level incentives

withinD—when will most reasonable variants of utility functions make the agent more likely to select B rather than A?When the environmental symmetries can be applied to the A-preferring-variants, in a way which produces another plausible objective. Slightly more formally, if, for every plausible utility function u∈D where the agent has a greater chance of selecting A than of selecting B, we have the membership ϕi⋅u∈D for all i=1,...,n. (The formal result is Lemma B.7 in this Overleaf.)

This covers the totally general case of arbitrary sets of utility function classes we might use. (And, technically, "utility function" is decorative at this point—it just stands in for a parameter which we use to retarget the AI policy-production process.)

The general result highlights how D := { plausible objective functions } affects what conclusions we can draw about orbit-level incentives. All else equal, being able to specify more plausible objective functions for which f(B∣u)≥f(A∣u) means that we're more likely to to ensure closure under certain permutations. Similarly, adding plausible A-dispreferring objectives makes it harder to satisfy f(B∣u)<f(A∣u)⟹ϕi⋅u∈D, which makes it harder to ensure closure under certain permutations, which makes it harder to prove instrumental convergence.

## Revisiting How The Environment Structure Affects Power-Seeking Incentive Strength

In

Seeking Power is Convergently Instrumental in a Broad Class of Environments, I wrote:In particular, for the MDP case, I wrote:

This is equivalent to a featurization which takes in an action-observation history, ignores the actions, and spits out time-discounted observation counts. The utility function is then over observations (which are just states in the MDP case). Here, the symmetries can only be over states, and not histories, and no matter how expressive the plausible state-based-reward-set DS is, it can't compete with the exponentially larger domain of the observation-history-based-utility-set DOH, and so the featurization has

limited how strong instrumental convergence can getby projecting the high-dimensional u-OH into the lower-dimensional u-State.But when we go from u-AOH to u-OH, we're throwing away even more information—information about the actions! This is also a sparse projection. So what's up?

When we throw away info about actions, we're breaking some symmetries which made instrumental convergence disappear in the u-AOH case. In any deterministic environment, there are equally many u-AOH which make me want to go e.g. left (and, say, die) as which make me want to go right (and survive). This is guaranteed by symmetries which swap the value of an optimal AOH with the value of an AOH going the other way:

But when we restrict the utility function to not care about actions, now you can only modify how it cares about observation histories. Here, the AOH environmental symmetry ϕAOH which previously ensured balanced statistical incentives, no longer enjoys closure under DOH, and so the restricted plausible set theorem no longer works, and instrumental convergence appears when restricting from u-AOH to u-OH.

I thank Justis Mills for feedback on a draft.## Appendix: tracking key limitations of the power-seeking theorems

From last time:

I think it's reasonably clear how to apply the results to realistic objective functions. I also think our objective specification procedures are quite expressive, and so the closure condition will hold and the results go through in the appropriate situations.

^{^}It's not hard to have this many degrees of freedom in such a small toy environment, but the toy environment is pedagogical. It's practically impossible to have full degrees of freedom in an environment with a trillion states.

^{^}"At least", and not "exactly." If α is a constant feature vector, it's optimal to go right for every permutation of α (trivially so, since α's orbit has a single element—itself).

^{^}Even under my more aggressive conjecture about "fractional terminal state copy containment", the unfeaturized situation would only guarantee 35-strength orbit incentives, strictly weaker than 23-strength.

^{^}Certain trivial featurizations can decrease the strength of power-seeking tendencies, too. For example, if the featurization is 2-dimensional: (1 if the agent is dead, 0 otherwise1 if the agent is alive, 0 otherwise), this will tend to produce 1:1 survive/die orbit-level incentives, whereas the incentives for raw reward functions may be 1,000:1 or stronger.

^{^}There's something abstraction-adjacent about this result (proposition D.1 in the linked Overleaf paper). The result says something like "do the grooves of the agent's world model featurization, respect the grooves of symmetries in the structure of the agent's environment?", and if they do,

bam, sufficient condition for power-seeking under the featurized model. I think there's something important here about how good world-model-featurizations should work, but I'm not sure what that is yet.I do know that "the featurization should commute with the environmental symmetry" is something I'd thought—in basically those words—no fewer than 3 times, as early as summer2021, without explicitly knowing what that should even

mean.^{^}Lemma B.7 in this Overleaf—compile

`quantitative-paper.tex`

.