Draft report on existential risk from power-seeking AI

by Joe Carlsmith1 min read28th Apr 202113 comments

25

Instrumental ConvergenceAI
Frontpage

I’ve written a draft report evaluating a version of the overall case for existential risk from misaligned AI, and taking an initial stab at quantifying the risk from this version of the threat. I’ve made the draft viewable as a public google doc here. Feedback would be welcome. 

This work is part of Open Philanthropy’s “Worldview Investigations” project. However, the draft reflects my personal (rough, unstable) views, not the “institutional views” of Open Philanthropy.

15 comments, sorted by Highlighting new comments since Today at 8:42 AM
New Comment

I commented a portion of a copy of your power-seeking writeup. 

I like the current doc a lot. I also feel that it seems to not consider some big formal hints and insights we've gotten from my work over the past two years.[1]

Very recently, I was able to show the following strong result:

Some researchers have speculated that capable reinforcement learning (RL) agents  are often incentivized to seek resources and power in pursuit of their objectives. While seeking power in order to optimize a misspecified objective, agents might be incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: human power-seeking instincts seem idiosyncratic, and these urges need not be present in RL agents. We formalize a notion of power within the context of Markov decision processes (MDPs). We prove sufficient conditions for when optimal policies tend to seek power over the environment. For most prior beliefs one might have about the agent's reward function, one should expect optimal agents to seek power by keeping a range of options available and, when the discount rate is sufficiently close to 1, by preferentially retaining access to more terminal states. In particular, these strategies are optimal for most reward functions.

I'd be interested in discussing this more with you; the result isn't publicly available yet, but I'd be happy to take time to explain it. The result tells us significant things about generic optimal behavior in the finite MDP setting, and I think it's worth deciphering these hints as best we can to apply them to more realistic training regimes. 

For example, if you train a generally intelligent RL agent's reward predictor network on a curriculum of small tasks, my theorems prove that these small tasks are probably best solved by seeking power within the task. This is an obvious way the agent might notice the power abstraction, become strategically aware, and start pursuing power instrumentally.

Another angle to consider is, how do power-seeking tendencies mirror other convergent phenomena, like convergent evolution, universal features, etc, and how does this inform our expectations about PS agent cognition? As in the MDP case, I suspect that similar symmetry-based considerations are at play. 

For example, consider a DNN being trained on a computer vision task. These networks often learn edge detectors in early layers (this also occurs in the visual cortex). So fix the network architecture, data distribution, loss function, and the edge detector weights. Now consider a range of possible label distributions; for each, consider the network completion which minimizes expected loss. You could also consider the expected output of some learning algorithm, given that it finetunes the edge detector network for some number of steps. I predict that for a wide range of "reasonable" label distributions, these edge detectors promote effective loss minimization, more so than if these weights were randomly set. In this sense, having edge detectors is "empowering."

And the reason this might crop up is that having these early features is a good idea for many possible symmetries of the label distribution. My thoughts here are very rough at this point in time, but I do feel like the analysis would benefit from highlighting parallels to convergent evolution, etc.

(This second angle is closer to conceptual alignment research than it is to weighing existing work, but I figured I'd mention it.)

Great writeup, and I'd love to chat some time if you're available.

[1] My work has revolved around formally understanding power-seeking in a simple setting (finite MDPs) so as to inform analyses like these. Public posts include:

I have two comments on section 4:

This section examines why we might expect it to be difficult to create systems of this kind that don’t seek to gain and maintain power in unintended ways.

First, I like your discussion in section 4.3.3. The option of controlling circumstances is too often overlooked I feel.

However, your further analysis of the level of difficulty seems to be based mostly on the assumption that we must, or at least will, treat an AI agent as a black box that is evolved, rather than designed. Section 4.5:

[full alignment] is going to be very difficult, especially if we build them by searching over systems that satisfy external criteria, but which we don’t understand deeply, and whose objectives we don’t directly control.

There is a whole body of work which shows that evolved systems are often power-seeking. But at the same time within the ML and AI safety literature, there is also a second body of work on designing systems which are not power seeking at all, or have limited power seeking incentives, even though they contain a machine-learning subsystem inside them. I feel that you are ignoring the existence and status of this second body of work in your section 4 overview, and that this likely creates a certain negative bias in your estimates later on.

Some examples of designs that explicitly try to avoid or cap power-seeking are counterfactual oracles, and more recently imitation learners like this one, and my power-limiting safety interlock here. All of these have their disadvantages and failure modes, so if you are looking for perfection they would disappoint you, but if you are looking for tractable x-risk management, I feel there is reason for some optimism.

BTW, the first page of chapter 7 of Russell's Human Compatible makes a similar point, flatly declaring that we would be toast if we made the mistake of viewing our task as controlling a black box agent that was given to us.

Hi Koen, 

Glad to hear you liked section 4.3.3. And thanks for pointing to these posts -- I certainly haven't reviewed all the literature, here, so there may well be reasons for optimism that aren't sufficiently salient to me.

Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the system has been designed/mechanistically understood at a different level.

Re: “there is a whole body of work which shows that evolved systems are often power-seeking” -- anything in particular you have in mind here?

Re: “there is a whole body of work which shows that evolved systems are often power-seeking” -- anything in particular you have in mind here?

For AI specific work, the work by Alex Turner mentioned elsewhere in this comment section comes to mind, as backing up a much larger body of reasoning-by-analogy work, like Omohundro (2008). But the main thing I had in mind when making that comment, frankly, was the extensive literature on kings and empires. In broader biology, many genomes/organisms (bacteria, plants, etc) will also tend to expand to consume all available resources, if you put them in an environment where they can, e.g. without balancing predators.

I really like the report, although maybe I'm not a neutral judge, since I was already inclined to agree with pretty much everything you wrote. :-P

My own little AGI doom scenario is very much in the same mold, just more specific on the technical side. And much less careful and thorough all around. :)

For benefits of generality (4.3.2.1), an argument I find compelling is that if you're trying to invent a new invention or design a new system, you need a cross-domain system-level understanding of what you're trying to do and how. Like at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain / world-model.

By “power” I mean something like: the type of thing that helps a wide variety of agents pursue a wide variety of objectives in a given environment. For a more formal definition, see Turner et al (2020).

I think the draft tends to use the term power to point to an intuitive concept of power/influence (the thing that we expect a random agent to seek due to the instrumental convergence thesis). But I think the definition above (or at least the version in the cited paper) points to a different concept, because a random agent has a single objective (rather than an intrinsic goal of getting to a state that would be advantageous for many different objectives). Here's a relevant passage by Rohin Shah from the Alignment Newsletter (AN #78) pertaining to that definition of power:

You might think that optimal agents would provably seek out states with high power. However, this is not true. Consider a decision faced by high school students: should they take a gap year, or go directly to college? Let’s assume college is necessary for (100-ε)% of careers, but if you take a gap year, you could focus on the other ε% of careers or decide to go to college after the year. Then in the limit of farsightedness, taking a gap year leads to a more powerful state, since you can still achieve all of the careers, albeit slightly less efficiently for the college careers. However, if you know which career you want, then it is (100-ε)% likely that you go to college, so going to college is very strongly instrumentally convergent even though taking a gap year leads to a more powerful state.

[EDIT: I should note that I didn't understand the cited paper as originally published (my interpretation of the definition is based on an earlier version of this post). The first author has noted that the paper has been dramatically rewritten to the point of being a different paper, and I haven't gone over the new version yet, so my comment might not be relevant to it.]

I think the draft tends to use the term power to point to an intuitive concept of power/influence (the thing that we expect a random agent to seek due to the instrumental convergence thesis). But I think the definition above (or at least the version in the cited paper) points to a different concept, because a random agent has a single objective (rather than an intrinsic goal of getting to a state that would be advantageous for many different objectives)

This is indeed a misunderstanding. My paper analyzes the single-objective setting; no intrinsic power-seeking drive is assumed. 

I probably should have written the "because ..." part better. I was trying to point at the same thing Rohin pointed at in the quoted text.

Taking a quick look at the current version of the paper, my point still seems to me relevant. For example, in the environment in figure 16, with a discount rate of ~1, the maximally POWER-seeking behavior is to always stay in the same first state (as noted in the paper), from which all the states are reachable. This is analogous to the student from Rohin's example who takes a gap year instead of going to college.

Right. But what does this have to do with your “different concept” claim?

A person does not become less powerful (in the intuitive sense) right after paying college tuition (or right after getting a vaccine) due to losing the ability to choose whether to do so. [EDIT: generally, assuming they make their choices wisely.]

I think POWER may match the intuitive concept when defined over certain (perhaps very complicated) reward distributions; rather than reward distributions that are IID-over-states (which is what the paper deals with).

Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of actions results in a different state (i.e. the graph of states is a tree with a constant branching factor), the POWER of all the states that the agent can get to in a given time step is equal; when POWER is defined over an IID-over-states reward distribution.

Two clarifications. First, even in the existing version, POWER can be defined for any bounded reward function distribution - not just IID ones. Second, the power-seeking results no longer require IID. Most reward function distributions incentivize POWER-seeking, both in the formal sense, and in the qualitative "keeping options open" sense. 

To address your main point, though, I think we'll need to get more concrete. Let's represent the situation with a state diagram.

Let's say that up is gap year, and right is go to college right away.

Both you and Rohin are glossing over several relevant considerations, which might be driving misunderstanding. For one:

Power depends on your time preferences. If your discount rate is very close to 1 and you irreversibly close off your ability to pursue  percent of careers, then yes, you have decreased your POWER by going to college right away. If your discount rate is closer to 0, then college lets you pursue more careers quickly, increasing your POWER for most reward function distributions. 

You shouldn't need to contort the distribution used by POWER to get reasonable outputs. Just be careful that we're talking about the same time preferences. (I can actually prove that in a wide range of situations, the POWER of state 1 vs the POWER of state 2 is ordinally robust to choice of distribution. I'll explain that in a future post, though.)

My position on "is POWER a good proxy for intuitive-power?" is that yes, it's very good, after thinking about it for many hours (and after accounting for sub-optimality; see the last part of appendix B). I think the overhauled power-seeking post should help, but perhaps I have more explaining to do. 

Also, I perceive an undercurrent of "goal-driven agents should tend to seek power in all kinds of situations; your formalism suggests they don't; therefore, your formalism is bad", which is wrong because the premise is false. (Maybe this isn't your position or argument, but I figured I'd mention it in case you believe that)

Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of actions results in a different state (i.e. the graph of states is a tree with a constant branching factor), the POWER of all the states that the agent can get to in a given time step is equal; when POWER is defined over an IID-over-states reward distribution.

This is superficially correct, but we have to be careful because 

  1. the theorems don't deal with the partially observable case,
  2. this implies an infinite state space (not accounted for by the theorems),
  3. a more complete analysis would account for facts like the probable featurization of the environment. For the real world case, we'd probably want to consider a planning agent's world model as featurizing over some set of learned concepts, in which case the intuitive interpretation should come back again. See also how John Wentworth's abstraction agenda may tie in with this work.
  4. different featurizations and agent rationalities would change the sub-optimal POWER computation (see the last 'B' appendix of the current paper), since it's easier to come up with good plans in certain situations than in others.
  5. The theorems now apply to the fully general, non-IID case. (not publicly available yet)

Basically, satisfactory formal analysis of this kind of situation is more involved than you make it seem. 

You shouldn't need to contort the distribution used by POWER to get reasonable outputs.

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

This is superficially correct, but we have to be careful because

  1. the theorems don't deal with the partially observable case,
  2. this implies an infinite state space (not accounted for by the theorems),

The "complicated MDP environment" argument does not need partial observability or an infinite state space; it works for any MDP where the state graph is a finite tree with a constant branching factor. (If the theorems require infinite horizon, add self-loops to the terminal states.)

This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

Not necessarily true - you're still considering the IID case. 

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. 

Yes, if you insist in making really weird modelling choices (and pretending the graph still well-models the original situation, even though it doesn't), you can make POWER say weird things. But again, I can prove that up to a large range of perturbation, most distributions will agree that some obvious states have more POWER than other obvious states. 

Your original claim was that POWER isn't a good formalization of intuitive-power/influence. You seem to be arguing that because there exists a situation "modelled" by an adversarially chosen environment grounding such that POWER returns "counterintuitive" outputs (are they really counterintuitive, given the information available to the formalism?), therefore POWER is inappropriately sensitive to the reward function distribution. Therefore, it's not a good formalization of intuitive-power. 

I deny both of the 'therefores.' 

The right thing to do is just note that there is some dependence on modelling choices, which is another consideration to weigh (especially as we move towards more sophisticated application of the theorems to e.g. distributions over mesa objectives and their attendant world models). But you should sure that the POWER-seeking conclusions hold under plausible modelling choices, and not just the specific one you might have in mind. I think that my theorems show that they do in many reasonable situations (this is a bit argumentatively unfair of me, since the theorems aren't public yet, but I'm happy to give you access). 

If this doesn't resolve your concern and you want to talk more about this, I'd appreciate taking this to video, since I feel like we may be talking past each other.

EDIT: Removed a distracting analogy.

Just to summarize my current view: For MDP problems in which the state representation is very complex, and different action sequences always yield different states, POWER-defined-over-an-IID-reward-distribution is equal for all states, and thus does not match the intuitive concept of power.

At some level of complexity such problems become relevant (when dealing with problems with real-world-like environments). These are not just problems that show up when one adverserially constructs an MDP problem to game POWER, or when one makes "really weird modelling choices". Consider a real-world inspired MDP problem where a state specifies the location of every atom. What makes POWER-defined-over-IID problematic in such an environment is the sheer complexity of the state, which makes it so that different action sequences always yield different states. It's not "weird modeling decisions" causing the problem.

I also (now) think that for some MDP problems (including many grid-world problems), POWER-defined-over-IID may indeed match the intuitive concept of power well, and that publications about such problems (and theorems about POWER-defined-over-IID) may be very useful for the field. Also, I see that the abstract of the paper no longer makes the claim "We prove that, with respect to a wide class of reward function distributions, optimal policies tend to seek power over the environment", which is great (I was concerned about that claim).