## AI ALIGNMENT FORUMAF

Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at turneale[at]oregonstate[dot]edu.

# Sequences

Reframing Impact

Draft report on existential risk from power-seeking AI

This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

Not necessarily true - you're still considering the IID case.

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state.

Yes, if you insist in making really weird modelling choices (and pretending the graph still well-models the original situation, even though it doesn't), you can make POWER say weird things. But again, I can prove that up to a large range of perturbation, most distributions will agree that some obvious states have more POWER than other obvious states.

Your original claim was that POWER isn't a good formalization of intuitive-power/influence. You seem to be arguing that because there exists a situation "modelled" by an adversarially chosen environment grounding such that POWER returns "counterintuitive" outputs (are they really counterintuitive, given the information available to the formalism?), therefore POWER is inappropriately sensitive to the reward function distribution. Therefore, it's not a good formalization of intuitive-power.

I deny both of the 'therefores.'

The right thing to do is just note that there is some dependence on modelling choices, which is another consideration to weigh (especially as we move towards more sophisticated application of the theorems to e.g. distributions over mesa objectives and their attendant world models). But you should sure that the POWER-seeking conclusions hold under plausible modelling choices, and not just the specific one you might have in mind. I think that my theorems show that they do in many reasonable situations (this is a bit argumentatively unfair of me, since the theorems aren't public yet, but I'm happy to give you access).

If this doesn't resolve your concern and you want to talk more about this, I'd appreciate taking this to video, since I feel like we may be talking past each other.

EDIT: Removed a distracting analogy.

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

LeCun claims too much. It's true that the case of animals like orangutans points to a class of cognitive architectures which seemingly don't prioritize power-seeking. It's true that this is some evidence against power-seeking behavior being common amongst relevant cognitive architectures. However, it doesn't show that instrumental subgoals are much weaker drives of behavior than hardwired objectives.

One reading of this "drives of behavior" claim is that it has to be tautological; by definition, instrumental subgoals are always in service of the (hardwired) objective. I assume that LeCun is instead discussing "all else equal, will statistical instrumental tendencies ('instrumental convergence') be more predictive of AI behavior than its specific objective function?".

But "instrumental subgoals are much weaker drives of behavior than hardwired objectives" is not the only possible explanation of "the lack of domination behavior in non-social animals"! Maybe the orangutans aren't robust to scale. Maybe orangutans do implement non power-seeking cognition, but maybe their cognitive architecture will be hard or unlikely for us to reproduce in a machine - maybe the distribution of TAI cognitive architectures we should expect, is far different from what orangutans are like.

I do agree that there's a very good point in the neighborhood of the quoted argument. My steelman of this would be:

Some animals, like humans, seem to have power-seeking drives. Other animals, like orangutans, do not. Therefore, it's possible to design agents of some intelligence which do not seek power. Obviously, we will be trying to design agents which do not seek power. Why, then, should we expect such agents to be more like humans than like orangutans?

(This is loose for a different reason, in that it presupooses a single relevant axis of variation between humans and orangutans. Is a personal computer more like a human, or more like an orangutan? But set that aside for the moment.)

I think he's overselling the evidence. However, on reflection, I wouldn't pick out the point for such strong ridicule.

Draft report on existential risk from power-seeking AI

Two clarifications. First, even in the existing version, POWER can be defined for any bounded reward function distribution - not just IID ones. Second, the power-seeking results no longer require IID. Most reward function distributions incentivize POWER-seeking, both in the formal sense, and in the qualitative "keeping options open" sense.

To address your main point, though, I think we'll need to get more concrete. Let's represent the situation with a state diagram.

Both you and Rohin are glossing over several relevant considerations, which might be driving misunderstanding. For one:

Power depends on your time preferences. If your discount rate is very close to 1 and you irreversibly close off your ability to pursue  percent of careers, then yes, you have decreased your POWER by going to college right away. If your discount rate is closer to 0, then college lets you pursue more careers quickly, increasing your POWER for most reward function distributions.

You shouldn't need to contort the distribution used by POWER to get reasonable outputs. Just be careful that we're talking about the same time preferences. (I can actually prove that in a wide range of situations, the POWER of state 1 vs the POWER of state 2 is ordinally robust to choice of distribution. I'll explain that in a future post, though.)

My position on "is POWER a good proxy for intuitive-power?" is that yes, it's very good, after thinking about it for many hours (and after accounting for sub-optimality; see the last part of appendix B). I think the overhauled power-seeking post should help, but perhaps I have more explaining to do.

Also, I perceive an undercurrent of "goal-driven agents should tend to seek power in all kinds of situations; your formalism suggests they don't; therefore, your formalism is bad", which is wrong because the premise is false. (Maybe this isn't your position or argument, but I figured I'd mention it in case you believe that)

Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of actions results in a different state (i.e. the graph of states is a tree with a constant branching factor), the POWER of all the states that the agent can get to in a given time step is equal; when POWER is defined over an IID-over-states reward distribution.

This is superficially correct, but we have to be careful because

1. the theorems don't deal with the partially observable case,
2. this implies an infinite state space (not accounted for by the theorems),
3. a more complete analysis would account for facts like the probable featurization of the environment. For the real world case, we'd probably want to consider a planning agent's world model as featurizing over some set of learned concepts, in which case the intuitive interpretation should come back again. See also how John Wentworth's abstraction agenda may tie in with this work.
4. different featurizations and agent rationalities would change the sub-optimal POWER computation (see the last 'B' appendix of the current paper), since it's easier to come up with good plans in certain situations than in others.
5. The theorems now apply to the fully general, non-IID case. (not publicly available yet)

Basically, satisfactory formal analysis of this kind of situation is more involved than you make it seem.

Draft report on existential risk from power-seeking AI

Right. But what does this have to do with your “different concept” claim?

Draft report on existential risk from power-seeking AI

I think the draft tends to use the term power to point to an intuitive concept of power/influence (the thing that we expect a random agent to seek due to the instrumental convergence thesis). But I think the definition above (or at least the version in the cited paper) points to a different concept, because a random agent has a single objective (rather than an intrinsic goal of getting to a state that would be advantageous for many different objectives)

This is indeed a misunderstanding. My paper analyzes the single-objective setting; no intrinsic power-seeking drive is assumed.

Draft report on existential risk from power-seeking AI

I commented a portion of a copy of your power-seeking writeup.

I like the current doc a lot. I also feel that it seems to not consider some big formal hints and insights we've gotten from my work over the past two years.[1]

Very recently, I was able to show the following strong result:

Some researchers have speculated that capable reinforcement learning (RL) agents  are often incentivized to seek resources and power in pursuit of their objectives. While seeking power in order to optimize a misspecified objective, agents might be incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: human power-seeking instincts seem idiosyncratic, and these urges need not be present in RL agents. We formalize a notion of power within the context of Markov decision processes (MDPs). We prove sufficient conditions for when optimal policies tend to seek power over the environment. For most prior beliefs one might have about the agent's reward function, one should expect optimal agents to seek power by keeping a range of options available and, when the discount rate is sufficiently close to 1, by preferentially retaining access to more terminal states. In particular, these strategies are optimal for most reward functions.

I'd be interested in discussing this more with you; the result isn't publicly available yet, but I'd be happy to take time to explain it. The result tells us significant things about generic optimal behavior in the finite MDP setting, and I think it's worth deciphering these hints as best we can to apply them to more realistic training regimes.

For example, if you train a generally intelligent RL agent's reward predictor network on a curriculum of small tasks, my theorems prove that these small tasks are probably best solved by seeking power within the task. This is an obvious way the agent might notice the power abstraction, become strategically aware, and start pursuing power instrumentally.

Another angle to consider is, how do power-seeking tendencies mirror other convergent phenomena, like convergent evolution, universal features, etc, and how does this inform our expectations about PS agent cognition? As in the MDP case, I suspect that similar symmetry-based considerations are at play.

For example, consider a DNN being trained on a computer vision task. These networks often learn edge detectors in early layers (this also occurs in the visual cortex). So fix the network architecture, data distribution, loss function, and the edge detector weights. Now consider a range of possible label distributions; for each, consider the network completion which minimizes expected loss. You could also consider the expected output of some learning algorithm, given that it finetunes the edge detector network for some number of steps. I predict that for a wide range of "reasonable" label distributions, these edge detectors promote effective loss minimization, more so than if these weights were randomly set. In this sense, having edge detectors is "empowering."

And the reason this might crop up is that having these early features is a good idea for many possible symmetries of the label distribution. My thoughts here are very rough at this point in time, but I do feel like the analysis would benefit from highlighting parallels to convergent evolution, etc.

(This second angle is closer to conceptual alignment research than it is to weighing existing work, but I figured I'd mention it.)

Great writeup, and I'd love to chat some time if you're available.

[1] My work has revolved around formally understanding power-seeking in a simple setting (finite MDPs) so as to inform analyses like these. Public posts include:

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

Apparently VMs are the way to go for pdf support on linux.

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

It's a spaced repetition system that focuses on incremental reading. It's like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the "flashcard" (usually 'clozes', demonstrated below).

Incremental reading is nice because you can come back to information over time as you learn more, instead of having to understand enough to make an Anki card right away.

In the context of this post, I'm reading some of the papers, making extracts, making flashcards from the extracts, and retaining at least one or two key points from each paper. Way better than retaining 1-2 points from all 70 summaries!

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

I agree. I've put it in my SuperMemo and very much look forward to going through it. Thanks Peter & Owen!

Testing The Natural Abstraction Hypothesis: Project Intro