In 2008, Steve Omohundro's foundational Basic AI Drives made important conjectures about what superintelligent goal-directed AIs might do, including gaining as much power as possible to best achieve their goals. Toy models have been constructed in which Omohundro's conjectures bear out, and the supporting philosophical arguments are intuitive. The conjectures have recently been the center of debate between well-known AI researchers.
Instrumental convergence has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a well-understood cause. The goal of this post (and accompanying paper) is to change that.
My results strongly suggest that, within the Markov decision process formalism (the staple of reinforcement learning), the structure of the agent's environment means that most goals incentivize gaining power over that environment. Furthermore, maximally gaining power over an environment is bad for other agents therein. That is, power seems constant-sum after a certain point.
I'm going to provide the intuitions for a mechanistic understanding of power and instrumental convergence, and then informally show how optimal action usually means trying to stay alive, gain power, and take over the world; read the paper for the rigorous version. Lastly, I'll talk about why these results excite me.
I claim that
The structure of the agent's environment means that most goals incentivize gaining power over that environment.
By environment, I mean the thing the agent thinks it's interacting with. Here, we're going to think about dualistic environments where you can see the whole state, where there are only finitely many states to see and actions to take, and where the rules are deterministic. Also, future stuff gets geometrically discounted; at discount rate , this means stuff in one turn is half as important as stuff now, stuff in two turns is a quarter as important, and so on. Pac-Man is an environment structured like this: you see the game screen (the state), you take an action, and then you deterministically get a result (another state). There's only finitely many screens, and only finitely many actions – they all had to fit onto the arcade controller, after all!
When I talk about "goals", I'm talking about reward functions over states: each way-the-world-could-be gets assigned some point value. The canonical way of earning points in Pac-Man is just one possible reward function for the game.
Instrumental convergence supposedly exists for sufficiently wide varieties of goals, so today we'll think about the most variety possible: the distribution of goals where each possible state is uniformly randomly assigned a reward in the interval (although the theorems hold for a lot more distributions than this). Sometimes, I'll say things like "most agents do ", which means "maximizing total discounted reward usually entails doing when your goals are drawn from the uniform distribution". We say agents are "farsighted" when the discount rate is sufficiently close to 1 (the agent doesn't prioritize immediate reward over delayed gratification).
You can do things in the world and take different paths through time. Let's call these paths "possibilities"; they're like filmstrips of how the future could go.
If you have more control over the future, you're usually choosing among more paths-through-time. This lets you more precisely control what kinds of things happen later. This is one way to concretize what people mean when they use the word 'power' in everyday speech, and will be the definition used going forward: the ability to achieve goals in general. In other words, power is the average attainable utility across a distribution of goals.
This definition seems philosophically reasonable: if you have a lot of money, you can make more things happen and have more power. If you have social clout, you can spend that in various ways to better tailor the future to various ends. Dying means you can't do much at all, and all else equal, losing a limb decreases your power.
Exercise: spend a few minutes considering whether real-world intuitive examples of power are explained by this definition.
Once you feel comfortable that it's at least a pretty good definition, we can move on.
Imagine a simple game with three choices: eat candy, eat a chocolate bar, or hug a friend.
The power of a state is how well agents can generally do by starting from that state. It's important to note that we're considering power from behind a "veil of ignorance" about the reward function. We're averaging the best we can do for a lot of different individual goals.
Each reward function has an optimal possibility, or path-through-time. If chocolate has maximal reward, then the optimal possibility is .
Since the distribution randomly assigns a value in to each state, an agent can expect to average reward. This is because you're choosing between three choices, each of which has some value between and . The expected maximum of draws from uniform is ; you have three draws here, so you expect to be able to get reward. Now, some reward functions do worse than this, and some do better; but on average, they get reward. You can test this out for yourself.
If you have no choices, you expect to average reward: sometimes the future is great, sometimes it's not. Conversely, the more things you can choose between, the closer this gets to (i.e., you can do well by all goals, because each has a great chance of being able to steer the future how you want).
Plans that help you better reach a lot of goals are called instrumentally convergent. To travel as quickly as possible to a randomly selected coordinate on Earth, one likely begins by driving to the nearest airport. Although it's possible that the coordinate is within driving distance, it's not likely. Driving to the airport would then be instrumentally convergent for travel-related goals.
We define instrumental convergence as optimal agents being more likely to take one action than another at some point in the future. I want to emphasize that when I say "likely", I mean from behind the veil of ignorance. Suppose I say that it's 50% likely that agents go left, and 50% likely they go right. This doesn't mean any agent has the stochastic policy of 50% left / 50% right. This means that, when drawing goals from our distribution, 50% of the time optimal pursuit of the goal entails going left, and 50% of the time it entails going right.
Consider either eating candy now, or earning some reward for waiting a second before choosing between chocolate and hugs.
Let's think about how optimal action tends to change as we start caring about the future more. Think about all the places you can be after just one turn:
We could be in two places. Imagine we only care about the reward we get next turn. How many goals choose over ? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal. About half of nearsighted agents go to and half go to . There isn't much instrumental convergence yet. Note that this is also why nearsighted agents tend not to seek power.
Now think about where we can be in two turns:
We could be in three places. Supposing we care more about the future, more of our future control is coming from . In other words, about two thirds of our power is coming from our ability to . But is instrumentally convergent? If the agent is farsighted, the answer is yes (why?).
In the limit of farsightedness, the chance of each possibility being optimal approaches (each terminal state has an equal chance to be maximal).
There are two important things happening here.
Important Thing #1
Instrumental convergence doesn't happen in all environments. An agent starting at blue isn't more likely to go up or down at any given point in time.
There's also never instrumental convergence when the agent doesn't care about the future at all (when ). However, let's think back to what happens in the waiting environment:
As the agent becomes farsighted, the and possibilities become more likely.
We can show that instrumental convergence exists in an environment if and only if a path through time becomes more likely as the agent cares more about the future.
Important Thing #2
The more control-at-future-timesteps an action provides, the more likely it is to be selected. What an intriguing "coincidence"!
So, it sure seems like gaining power is a good idea for a lot of agents!
Having tasted a few hints for why this is true, we'll now walk through the intuitions a little more explicitly. This, in turn, will show some pretty cool things: most agents avoid dying in Pac-Man, keep the Tic-Tac-Toe game going as long as possible, and avoid deactivation in real life.
Let's focus on an environment with the same rules as Tic-Tac-Toe, but considering the uniform distribution over reward functions. The agent (playing ) keeps experiencing the final state over and over when the game's done. We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.
Whenever we make a move that ends the game, we can't reach anything else – we have to stay put. Since each final state has the same chance of being optimal, a move which doesn't end the game is more likely than a move which does. Let's look at part of the game tree, with instrumentally convergent moves shown in green.
Starting on the left, all but one move leads to ending the game, but the second-to-last move allows us to keep choosing between five more final outcomes. For reasonably farsighted agents at the first state, the green move is ~50% likely to be optimal, while each of the others are only best for ~10% of goals. So we see a kind of "self-preservation" arising, even in Tic-Tac-Toe.
Remember how, as the agent gets more farsighted, more of its control comes from choosing between and , while also these two possibilities become more and more likely?
The same thing is happening in Tic-Tac-Toe. Let's think about what happens as the agent cares more about later and later time steps.
The initial green move contributes more and more control, so it becomes more and more likely as we become more farsighted. This doesn't seem like a coincidence.
Power-seeking is instrumentally convergent.
Reasons for excitement
The direct takeaway
I'm obviously not "excited" that power-seeking happens by default, but I'm excited that we can see this risk more clearly. I'm also planning on getting this work peer-reviewed before purposefully entering it into the aforementioned mainstream debate, but here are some of my preliminary thoughts.
Imagine you have good formal reasons to suspect that typing random strings will usually blow up your computer and kill you. Would you then say, "I'm not planning to type random strings", and proceed to enter your thesis into a word processor? No. You wouldn't type anything yet, not until you really, really understand what makes the computer blow up sometimes.
The overall concern raised by [the power-seeking theorem] is not that we will build powerful RL agents with randomly selected goals. The concern is that random reward function inputs produce adversarial power-seeking behavior, which can produce perverse incentives such as avoiding deactivation and appropriating resources. Therefore, we should have specific reason to believe that providing the reward function we had in mind will not end in catastrophe.
Speaking to the broader debate taking place in the AI research community, I think a productive posture here will be investigating and understanding these results in more detail, getting curious about unexpected phenomena, and seeing how the numbers crunch out in reasonable models. I think that even though the alignment community may have superficially understood many of these conclusions, there are many new concepts for the broader AI community to explore.
Incidentally, if you're a member of this broader community and have questions, please feel free to email me at .
AI alignment research can often have a slippery feeling to it. We're trying hard to become less confused about basic concepts, and there's only everything on the line.
What are "agents"? Do people even have "values", and should we try to get the AI to learn them? What does it mean to be "corrigible", or "deceptive"? What are our machine learning models even doing? I mean, sometimes we get a formal open question (and this theory of possibilities has a few of those), but not usually.
We have to do philosophical work while in a state of significant confusion and ignorance about the nature of intelligence and alignment. We're groping around in the dark with only periodic flashes of insight to guide us.
In this context, we were like,
wow, it seems like every time I think of optimal plans for these arbitrary goals, the AI can best complete them by gaining a ton of power to make sure it isn't shut off. Everything slightly wrong leads to doom, apparently?
and we didn't really know why. Intuitively, it's pretty obvious that most agents don't have deactivation as their dream outcome, but we couldn't actually point to any formal explanations, and we certainly couldn't make precise predictions.
On its own, Goodhart's law doesn't explain why optimizing proxy goals leads to catastrophically bad outcomes, instead of just less-than-ideal outcomes.
I've heard that, from this state of ignorance, alignment proposals shouldn't rely on instrumental convergence being a thing (and I agree). If you're building superintelligent systems for which slight mistakes apparently lead to extinction, and you want to evaluate whether your proposal to avoid extinction will work, you obviously want to deeply understand why extinction happens by default.
We're now starting to have this kind of understanding. I suspect that power-seeking is the thing that makes capable goal-directed agency so dangerous. If we want to consider more benign alternatives to goal-directed agency, then deeply understanding why goal-directed agency is bad is important for evaluating alternatives. This work lets us get a feel for the character of the underlying incentives of a proposed system design.
Defining power as "the ability to achieve goals in general" seems to capture just the right thing. I think it's good enough that I view important theorems about power (as defined in the paper) as philosophically insightful.
Considering power in this way seems to formally capture our intuitive notions about what resources are. For example, our current position in the environment means that having money allows us to exert more control over the future. That is, our current position in the state space means that having money allows more possibilities and greater power (in the formal sense). However, possessing green scraps of paper would not be as helpful if one were living alone near Alpha Centauri. In a sense, resource acquisition can naturally be viewed as taking steps to increase one's power.
Power might be important for reasoning about the strategy-stealing assumption (and I think it might be similar to what Paul means by "flexible influence over the future"). Evan Hubinger has already noted the utility of the distribution of attainable utility shifts for thinking about value-neutrality in this context (and power is another facet of the same phenomenon). If you want to think about whether, when, and why mesa optimizers might try to seize power, this theory seems like a valuable tool.
And, of course, we're going to use this notion of power to design an impact measure.
The formalization of instrumental convergence seems to be correct. We're able to now make detailed predictions about e.g. how the difficulty of getting reward affects the level of farsightedness at which seizing power tends to make sense. This also might be relevant for thinking about myopic agency, as the broader theory formally describes how optimal action tends to change with the discount factor.
Another useful conceptual distinction is that power and instrumental convergence aren't the same thing; we can construct environments where the state with the highest power is not instrumentally convergent from another state.
ETA: Here's an excerpt from the paper:
So, just because a state has more resources, doesn't technically mean the agent will go out of its way to reach it. Here's what the relevant current results say: parts of the future allowing you to reach more terminal states are instrumentally convergent, and the formal POWER contributions of different possibilities are approximately proportionally related to instrumental convergence. As I said in the paper,
The formalization of power seems reasonable, consistent with intuitions for all toy MDPs examined. The formalization of instrumental convergence also seems correct. Practically, if we want to determine whether an agent might gain power in the real world, one might be wary of concluding that we can simply "imagine'' a relevant MDP and then estimate e.g. the "power contributions'' of certain courses of action. However, any formal calculations of POWER are obviously infeasible for nontrivial environments.
To make predictions using these results, we must combine the intuitive correctness of the power and instrumental convergence formalisms with empirical evidence (from toy models), with intuition (from working with the formal object), and with theorems (like theorem 46, which reaffirms the common-sense prediction that more cycles means asymptotic instrumental convergence, or theorem 26, fully determining the power in time-uniform environments). We can reason, "for avoiding shutdown to not be heavily convergent, the model would have to look like such-and-such, but it almost certainly does not...''.
I think the Tic-Tac-Toe reasoning is helpful: it's instrumentally convergent to reach parts of the future which give you more control from your current vantage point. I'm working on expanding the formal results to include some version of this. I've since further clarified some claims made in the initial version of this post.
The broader theory of possibilities lends signficant insight into the structure of Markov decision processes; it feels like a piece of basic theory that was never discovered earlier, for whatever reason. More on this another time.
What excites me the most is a little more vague: there's a new piece of AI alignment we can deeply understand, and understanding breeds understanding.
This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund.
Logan Smith (elriggs) spent an enormous amount of time writing Mathematica code to compute power and measure in arbitrary toy MDPs, saving me from needing to repeatedly do quintuple+ integrations by hand. I thank Rohin Shah for his detailed feedback and brainstorming over the summer, and Tiffany Cai for the argument that arbitrary possibilities have expected value (and so optimal average control can't be worse than this). Zack M. Davis, Chase Denecke, William Ellsworth, Vahid Ghadakchi, Ofer Givoli, Evan Hubinger, Neale Ratzlaff, Jess Riedel, Duncan Sabien, Davide Zagami, and TheMajor gave feedback on drafts of this post.
It seems reasonable to expect the key results to generalize in spirit to larger classes of environments, but keep in mind that the claims I make are only proven to apply to finite deterministic MDPs. ↩︎
Specifically, consider any continuous bounded distribution distributed identically over the state space : . The kind of power-seeking and Tic-Tac-Toe-esque instrumental convergence I'm gesturing at should also hold for discontinuous bounded nondegenerate .
The power-seeking argument works for arbitrary distributions over reward functions (with instrumental convergence also being defined with respect to that distribution) – identical distribution enforces "fairness" over the different parts of the environment. It's not as if instrumental convergence might not exist for arbitrary distributions – it's just that proofs for them are less informative (because we don't know their structure a priori).
For example, without identical distribution, we can't say that agents (roughly) tend to preserve the ability to reach as many 1-cycles as possible; after all, you could just distribute reward on an arbitrary 1-cycle and 0 reward for all other states. According to this "distribution", only moving towards the 1-cycle is instrumentally convergent. ↩︎
Power is not the same thing as number of possibilities! Power is average attainable utility; you might have a lot of possibilities, but not be able to choose between them for a long time, which decreases your control over the (discounted) future.
Also, remember that we're assuming dualistic agency: the agent can choose whatever sequence of actions it wants. That is, there aren't "possibilities" it's unable to take. ↩︎
We need to take care when applying theorems to real life, especially since the power-seeking theorem assumes the state is fully observable. Obviously, this isn't true in real life, but it seems reasonable to expect the theorem to generalize appropriately. ↩︎
I'll talk more in future posts about why I presently think power-seeking is the worst part of goal-directed agency. ↩︎