Instrumental convergence in single-agent systems

simonsdsuo

Thanks for doing these experiments and writing this up. It's so good to have concrete proposals and numerical experiments for concepts like power because power as a concept is super central to alignment, and concrete proposals and numerical experiments move the discourse around these concepts forward.

There is negotiating tactic in which one side makes a strong public pre-commitment not to accept any deal except one that is extremely favorable to them. So e.g. if Fred is purchasing a used car from me and realizes that both of us would settle for a sale price anywhere between $5000 and $10,000, then he might make a public pre-commitment not to purchase the car for more than $5000. Assuming that the pre-commitment is real and that I can independently verify that it is real, my best move then is really to sell the car for $5000. It seems like in this situation Bob has decreased his optionality pretty significantly (he no longer has the option of paying more than $5000 without suffering losses), but increased his power (he has kind of succeeded in out-maneuvering me).

A second thought experiment: in terms of raw optionality, isn't it the case that a person really can only decrease in power over the course of their life? Since our lives are finite, every decision we make locks us into something that we weren't locked into before. Even if there are certain improbably accomplishments that, when attained, increase our capacity to achieve goals so significantly that this outweighs all the options that were cut off, still wouldn't it be the case that babies would have more "power" than adults according to the optionality definition?

A final example: why should we average over possible reward functions? A paperclip maximizer might be structured in a way that makes it extremely poorly suited to any goal except for paperclip maximization, and yet a strongly superhuman paperclip maximizer would seem to be "powerful" by the common usage of that word.

Interested in your thoughts.

[-]Edouard Harris3y*30

Thanks for you comment. These are great questions. I'll do the best I can to answer here, feel free to ask follow-ups:

On pre-committing as a negotiating tactic: If I've understood correctly, this is a special case of the class of strategies where you sacrifice some of your own options (bad) to constrain those of your opponent (good). And your question is something like: which of these effects is strongest, or do they cancel each other out?

It won't surprise you that I think the answer is highly context-dependent, and that I'm not sure which way it would actually shake out in your example with Fred and Bob and the $5000. But interestingly, we did in fact discover an instance of this class of "sacrificial" strategies in our experiments!

You can check out the example in Part 3 if you're interested. But briefly, what happens is that when the agents get far-sighted enough, one of them realizes that there is instrumental value in having the option to bottle up the other agent in a dead-end corridor (i.e., constraining that other agent's options). But it can only actually do this by positioning itself at the mouth of the corridor (i.e., sacrificing its own options). Here is a full-size image of both agents' POWERs in this situation. You can see from the diagram that Agent A prefers to preserve its own options over constraining Agent H's options in this case. But crucially, Agent A values the option of being able to constrain Agent H's options.

In the language of your negotiating example, there is instrumental value in preserving one's option to pre-commit. But whether actually pre-committing is instrumentally valuable or not depends on the context.
On babies being more powerful than adults: Yes, I think your reasoning is right. And it would be relatively easy to do this experiment! All you'd need would be to define a "death" state, and set your transition dynamics so that the agent gets sent to the "death" state after N turns and can never escape from it afterwards. I think this would be a very interesting experiment to run, in fact.
On paperclip maximizers: This is a very deep and interesting question. One way to think about this schematically might be: a superintelligent paperclip maximizer will go through a Phase One, in which it accumulates its POWER; and then a Phase Two in which it spends the POWER it's accumulated. During the accumulation phase, the system might drive towards a state where (without loss of generality) the Planet Earth is converted into a big pile of computronium. This computronium-Earth state is high-POWER, because it's a common "way station" state for paperclip maximizers, thumbtack maximizers, safety pin maximizers, No. 2 pencil maximizers, and so on. (Indeed, this is what high POWER means.)

Once the system has the POWER it needs to reach its final objective, it will begin to spend that POWER in ways that maximize its objective. This is the point at which the paperclip, thumbtack, safety pin, and No. 2 pencil maximizers start to diverge from one another. They will each push the universe towards sharply different terminal states, and the more progress each maximizer makes towards its particular terminal state, the fewer remaining options it leaves for itself if its goal were to suddenly change. Like a male praying mantis, a maximizer ultimately sacrifices its whole existence for the pursuit of its terminal goal. In other words: zero POWER should be the end state of a pure X-maximizer!^[1]

My story here is hypothetical, but this is absolutely an experiment on can do (at small scale, naturally). The way to do it would be to run several rollouts of an agent, and plot the POWER of the agent at each state it visits during the rollout. Then we can see whether most agent trajectories have the property where their POWER first goes up (as they, e.g., move to topological junction points) and then goes down (as they move from the junction points to their actual objectives).

Thanks again for your great questions. Incidentally, a big reason we're open-sourcing our research codebase is to radically lower the cost of converting thought experiments like the above into real experiments with concrete outcomes that can support or falsify our intuitions. The ideas you've suggested are not only interesting and creative, they're also cheaply testable on our existing infrastructure. That's one reason we're excited to release it!

^{^}
Note that this assumes the maximizer is inner aligned to pursue its terminal goal, the terminal goal is stable on reflection, and all the usual similar incantations.

[-]Algon3y40

Random question: What's the relationship between the natural abstractions thesis and instrumental convergence? If many agents find particular states instrumentally useful, then surely that implies that the abstractions that would best aid them in reasoning about the world would mostly focus on stuff related to those states.

Like if you mostly find being in the center of an area useful, you're going to focus in on abstractions that measure how far you are from the central point rather than the colour of the area you're in or so on.

Edit: In which case, does instrumental convergence imply the natural abstractions thesis?

[-]Edouard Harris3y30

Yes, I think this is right. It's been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a "correct" model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.

We've focused on simple gridworlds here, partly because they're visual, but also because they're tractable. But I suspect there's a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.

^{^}

We mean specifically utility here, not reward. While in general, reward isn’t the real target of optimization, in the particular case of the results we'll be showing here, we can treat them as identical, and we do that in the text.

(Technical details: we can treat utility and reward identically here because, in the results we’re choosing to show, we’ll be exclusively working with optimal policies that have been learned via value iteration on reward functions that are sampled from a uniform distribution [0, 1] that’s iid over states. Therefore, given the environment and discount factor, a sampled reward function is sufficient to uniquely determine the agent's optimal policy — except on a set that has measure zero over the distribution of reward functions we’re considering. And that in turn means that each sampled reward function, when combined with the other known constraints on the agent, almost always supplies a complete explanation for the agent’s actions — which is the most a utility function can ever do.)

^{^}

For simplicity, in this work we’ll only consider reward functions that depend on states, and never reward functions that directly depend on both states and actions. In other words, our reward functions will only ever have the form $R (s)$ , and never $R (s, a)$ .

^{^}

Note that these are statements about the relative POWERs of an agent with a given planning horizon. Absolute POWER values always increase as the planning horizon of the agent increases, as you can verify by, e.g., comparing the POWER numbers of Fig 2 against those of Fig 3. This occurs because an agent’s optimal state-value function increases monotonically as we increase $γ$ : an optimal far-sighted agent is able to consider strictly more options, so it will never do any worse than an optimal short-sighted one.

^{^}

Note that the colors of the gridworld cells in the animation indicate the highest and lowest POWER values within each frame, per footnote ^[3].

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

Instrumental convergence in single-agent systems

16

Summary of the sequence

1. Introduction

2. Single-agent POWER

2.1 Definition

2.2 Illustration

3. Results

3.1 Effect of the planning horizon

3.2 POWER at bigger scales

4. Discussion