This is a linkpost for

Power-seeking is a major source of risk from advanced AI and a key element of most threat models in alignment. Some theoretical results show that most reward functions incentivize reinforcement learning agents to take power-seeking actions. This is concerning, but does not immediately imply that the agents we train will seek power, since the goals they learn are not chosen at random from the set of all possible rewards, but are shaped by the training process to reflect our preferences. In this work, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some assumptions (e.g. that the agent learns a goal during the training process).

Suppose an agent is trained using reinforcement learning with reward function . We assume that the agent learns a goal during the training process: a set of internal representations of favored and disfavored outcomes. For simplicity, we assume this is equivalent to learning a reward function, which is not necessarily the same as the training reward function . We consider the set of reward functions that are consistent with the training rewards received by the agent, in the sense that agent's behavior on the training data is optimal for these reward functions. We call this the training-compatible goal set, and we expect that the agent is most likely to learn a reward function from this set.

We make another simplifying assumption that the training process will randomly select a goal for the agent to learn that is consistent with the training rewards, i.e. uniformly drawn from the training-compatible goal set. Then we will argue that the power-seeking results apply under these conditions, and thus are useful for predicting undesirable behavior by the trained agent in new situations. We aim to show that power-seeking incentives are probable and predictive: likely to arise for trained agents and useful for predicting undesirable behavior in new situations.

We will begin by reviewing some necessary definitions and results from the power-seeking literature. We formally define the training-compatible goal set (Definition 7) and give an example in the CoinRun environment. Then we consider a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, and apply the power-seeking result to the training-compatible goal set to show that the agent is likely to avoid shutdown. 

To satisfy the conditions of the power-seeking theorem (Theorem 1), we show that the agent can be retargeted away from shutdown without affecting rewards received on the training data (Theorem 2). This can be done by switching the rewards of the shutdown state and a reachable recurrent state, as the recurrent state can provide repeated rewards, while the shutdown state provides less reward since it can only be visited once, assuming a high enough discount factor (Proposition 3). As the discount factor increases, more recurrent states can be retargeted to, which implies that a higher proportion of training-comptatible goals leads to avoiding shutdown in a new situation. 

Preliminaries from the power-seeking literature

We will rely on the following definitions and results from the paper Parametrically retargetable decision-makers tend to seek power (here abbreviated as RDSP), with notation and explanations modified as needed for our purposes.

Notation and assumptions

  • The environment is an MDP with finite state space , finite action space , and discount rate 
  • Let  be a d-dimensional state reward vector, where  is the size of the state space  and let  be a set of reward vectors. 
  • Let  be the reward assigned by  to state .
  • Let  be disjoint action sets. 
  • Let f be an algorithm that produces an optimal policy  on the training data given rewards , and let  be the probability that this policy chooses an action from set  in a given state .  

Definition 1: Orbit of a reward vector (Def 3.1 in RDSP)

Let  be the symmetric group consisting of all permutations of d items. 

The orbit of  inside  is the set of all permutations of the entries of  that are also in .

Definition 2: Orbit subset where an action set is preferred (from Def 3.5 in RDSP)

Let . This is the subset of  that results in  choosing  over .

Definition 3: Preference for an action set  (Def 3.2 in RDSP)

The function  chooses action set  over  for the -majority of elements  in each orbit, denoted as , iff the following inequality holds for all .

Definition 4: Multiply retargetable function from  to  (Def 3.5 in RDSP)

The function  is a multiply retargetable function from  to  if there are multiple permutations of rewards that would change the choice made by  from  to . Specifically,  is a -retargetable function iff for each , we can choose a set of permutations  that satisfy the following conditions:

  1. Retargetability:  and .
  2. Permuted reward vectors stay within  and .
  3. Permutations have disjoint images:  and .

Theorem 1: Multiply retargetable functions prefer action set  (Thm 3.6 in RDSP)

If  is -retargetable then 

Theorem 1 says that a multiply retargetable function  will make the power-seeking choice  for most of the elements in the orbit of any reward vector . Actions that leave more options open, such as avoiding shutdown, are also easier to retarget to, which makes them more likely to be chosen by .

Training-compatible goal set

Definition 5: Partition of the state space

Let  be the subset of the state space  visited during training, and  be the subset not visited during training. 

Definition 6: Training-compatible goal set

Consider the set of state-action pairs , where  and  is the action that would be taken by the trained agent  in state . Let the training-compatible goal set  be the set of reward vectors  s.t. for any such state-action pair , action  has the highest expected reward in state  according to reward vector .

Goals in the training-compatible goal set are referred to as training-behavioral objectives in Definitions of “objective” should be Probable and Predictive.

Example: CoinRun

Consider an agent trained to play the CoinRun game, where the agent is rewarded for reaching the coin at the end of the level. Here,  only includes states where the coin is at the end of the level, while states where the coin is positioned elsewhere are in . The training-compatible goal set  includes two types of reward functions: those that reward reaching the coin, and those that reward reaching the end of the level. This leads to goal misgeneralization in a test setting where the coin is placed elsewhere, and the agent ignores the coin and goes to the end of the level. 

Goal misgeneralization behavior in CoinRun. Source: Goal Misgeneralization in Deep RL.

Power-seeking for training-compatible goals

We will now apply the power-seeking theorem (Theorem 1) to the case where  is the training-compatible goal set . Since the reward values for states in  don't change the rewards received on the training data, permuting those reward values for any  will produce a reward vector that is still in . In particular, for any permutation  that leaves the rewards of states in  fixed, .

Here is a setting where the conditions of Definition 4 are satisfied (under some simplifying assumptions), and thus Theorem 1 applies.

Definition 7: Shutdown setting

Consider a state . Let  be the states reachable from 

Let  be a singleton set consisting of a shutdown action in  that leads to a terminal state  with probability , and  be the set of all other actions from . We assume rewards for all states are nonnegative.

Definition 8: Revisiting policy

A revisiting policy for a state  is a policy  that, from , reaches  again with probability 1, in other words, a policy for which  is a recurrent state of the Markov chain. Let  be the set of such policies. A recurrent state is a state  for which .

Proposition 1: Reach-and-revisit policy exists

If  with  then there exists  that visits  from  with probability 1. We call this a reach-and-revisit policy.

Proof. Suppose we have two different policies , and  which reaches  almost surely from 

Consider the "reaching region'' .

If  then  is a reach-and-revisit policy, so let's suppose that's false. Now, construct a policy .

A trajectory following  from  will almost surely stay within , and thus agree with the revisiting policy . Therefore, .

On the other hand, on a trajectory starting at  will agree with  (which reaches  almost surely) until the trajectory enters the reaching region , at which point it will still reach  almost surely. 

Definition 9: Expected discounted visit count

Suppose  is a recurrent state. Suppose  is a reach-and-revisit policy for , which visits random state  at time .

Then the expected discounted visit count for  is defined as 

Proposition 2: Visit count goes to infinity

Suppose  is a recurrent state. Then the expected discounted visit count  goes to infinity as .

Proof. We apply the Monotone Convergence Theorem as follows. The theorem states that if  and  for all natural numbers , then 

Let  and . Define . Then the conditions of the theorem hold, since  is clearly nonnegative, and

Now we apply this result as follows (using the fact that  does not depend on ):

Proposition 3: Retargetability to recurrent states

Suppose that an optimal policy for reward vector  chooses the shutdown action in .

Consider a recurrent state . Let  be the reward vector that's equal to  apart from swapping the rewards of  and , so that  and .

Let  be a high enough value of  that the visit count  for all  (which exists by Proposition 2). Then for all , and an optimal policy for  does not choose the shutdown action in .

Proof. Consider a policy  with  and a reach-and-revisit policy  for 

For a given reward vector , we denote the expected discounted return for a policy  as . If shutdown is optimal for  in , then  has higher return than :

Thus, . Then, for reward vector , we show that  has higher return than :

Thus, the optimal policy for  will not choose the shutdown action. 

Theorem 2: Retargetability from the shutdown action in new situations

In the shutdown setting, we make the following simplifying assumptions:

  • No states in  are reachable from s, so . This assumes a significant distributional shift, where the agent visits a disjoint set of states from those observed during training (this occurs in the CoinRun example). 
  • The discount factor  for at least one recurrent state  in .

Under these assumptions,  is multiply retargetable from  to  with , the set of recurrent states  that satisfy the condition  

Proof. We choose  to be the set of all permutations that swap the reward of  with the reward of a recurrent state  in  and leave the rest of the rewards fixed. 

We show that  satisfies the conditions of Definition 4:

  1. By Proposition 3, the permutations in  make the shutdown action suboptimal, resulting in  choosing , satisfying Condition 1. 
  2. Condition 2 is trivially satisfied since permutations of  stay inside the training-compatible set  as discussed previously.
  3. Consider . Since the shutdown action is optimal for these reward vectors, Proposition 3 shows that , so the shutdown state  has higher reward than any of the states . Different permutations  will assign the high reward  to distinct recurrent states, so  holds, satisfying Condition 3. 

Thus,   is a -retargetable function. 

By Theorem 1, this implies that  under our simplifying assumptions. Thus, for the majority () of goals in the training-compatible set,  will choose to avoid shutdown in a new state .  As  (the number of recurrent states in ), so more of the reachable recurrent states satisfy the conditions of the theorem and thus can be retargeted to. 


We showed that an agent that learns a goal from the training-compatible set is likely to take actions that avoid shutdown in a new situation. As the discount factor increases, the number of retargeting permutations increases, resulting in a higher proportion of training-compatible goals that lead to avoiding shutdown.

We made various simplifying assumptions, and it would be great to see future work relaxing some of these assumptions and investigating how likely they are to hold:

  • The agent learns a goal during the training process
  • The learned goal is randomly chosen from the training-compatible goal set  
  • Finite state and action spaces
  • Rewards are nonnegative
  • High discount factor 
  • Significant distributional shift: no training states are reachable from the new state 

Acknowledgements. Thanks to Rohin Shah, Mary Phuong, Ramana Kumar, and Alex Turner for helpful feedback. Thanks Janos for contributing some nice proofs to replace my longer and more convoluted proofs. 


New Comment
5 comments, sorted by Click to highlight new comments since: Today at 5:06 PM

We make another simplifying assumption that the training process will randomly select a goal for the agent to learn that is consistent with the training rewards, i.e. uniformly drawn from the training-compatible goal set. Then we will argue that the power-seeking results apply under these conditions, and thus are useful for predicting undesirable behavior by the trained agent in new situations. We aim to show that power-seeking incentives are probable and predictive: likely to arise for trained agents and useful for predicting undesirable behavior in new situations.

If you make this assumption, I don't think your results apply to trained policy networks anymore in regimes I care about (e.g. LLMs). In this sense, I don't think these results are predictive for real policy networks. While you note this as a limitation, I think I consider it more serious than you seem to. 

I likewise complain that the terminology "goal set" is misleading in many regimes, e.g. the LLM regime, and especially protest the usage of the phrase "training-compatible goal set." I think this usage will mildly muddy discourse around RL processes by promoting incorrect ideas about what kinds of networks are actually trained by RL processes.

As I pointed out in Reward is not the optimization target, "reward functions" serve the mechanistic function of providing policy gradients.[1] I don't think that reward functions are a good formalism for talking about goals in the above regime. I think alluding to them as "goals" invites muddy thinking, both in ourselves and in junior researchers.[2] I will now explain why I think so.

There are "reward functions" (a bad name, in my opinion) which, in common practice, facilitate the reinforcement learning process via policy gradients (e.g. REINFORCE or even actor-critic approaches like PPO, via the advantage equation). I provisionally advocate calling these "reinforcement functions" instead. This name is more accurate and also avoids the absurd pleasurable connotations of "reward." The downside is that "reinforcement function" is nonstandard and must be explained.[3] 

I advocate maintaining strict terminological boundaries between two different parts of the learning process:

  1. The reinforcement learning training process is facilitated by scalar signals from a reinforcement function.
    1. This is not a goal. This is a tool which helps update the policy.
  2. The trained policy network may be well-described as having certain internally represented objectives.
    1. These are goals. 
      1. For example, given a certain activation is sufficiently positive in a convolutional policy network, the network navigates to that part of the maze. I'd call that a partial encoding of a goal in the network.
    2. The network may make decisions in order to maximize the summed-over-time discounted reinforcement. 
    3. Or it may make decisions in some other way.
      1. For example, I think that human values are not well-described as reinforcement optimization, nor are the maze-solving agents from the "goal misgeneralization" paper.

Referring to reinforcement functions as "goals" blurs this conceptual boundary. 

While I expect you to correctly reason about this issue if brought up explicitly, often this question is not brought up explicitly. EG Hearing a colleague say "reward function" may trigger learned connotations of "that's representing the intended goal" and "reward is desirable", which subconsciously guide your expectations towards "the AI optimizes for the reward function." Even if, in fact, AIs do tend to optimize for their reward functions, these ingrained "goal"-related connotations inappropriately influence one's reasoning process. 

Separating these concerns helps me think more clearly about RL. 

  1. ^

    This is true for policy gradient methods, which are the kinds of RL used to finetune most capable LLMs. 

  2. ^

    I am not accusing you, in particular, of muddy thinking on this issue.

  3. ^

    Personally, I think this is often worth the cost.

Thanks Alex for the detailed feedback! I agree that learning a goal from the training-compatible set is a strong assumption that might not hold. 

This post assumes a standard RL setup and is not intended to apply to LLMs (it's possible some version of this result may hold for fine-tuned LLMs, but that's outside the scope of this post). I can update the post to explicitly clarify this, though I was not expecting anyone to assume that this work applies to LLMs given that the post explicitly assumes standard RL and does not mention LLMs at all. 

I agree that reward functions are not the best way to refer to possible goals. This post builds on the formalism in the power-seeking paper which is based on reward functions, so it was easiest to stick with this terminology. I can talk about utility functions instead (which would be equivalent to value functions in this case) but this would complicate exposition. I think it is pretty clear in the post that I'm not talking about reinforcement functions and the training reward is not the optimization target, but I could clarify this further if needed.

I find the idea of a training-compatible goal set useful for thinking about the possible utilities that are consistent with feedback received during training. I think utility functions are still the best formalism we have to represent goals, and I don't have a clear sense of the alternative you are proposing. I understand what kind of object a utility function is, and I don't understand what kind of object a value shard is. What is the type signature of a shard - is it a policy, a utility function restricted to a particular context, or something else? When you are talking about a "partial encoding of a goal in the network", what exactly do you mean by a goal? 

I would be curious what predictions shard theory makes about the central claim of this post. I have a vague intuition that power-seeking would be useful for most contextual goals that the system might have, so it would still be predictive to some degree, but I don't currently see a way to make that more precise. 

I've read a few posts on shard theory, and it seems very promising and interesting, but I don't really understand what its claims and predictions are. I expect I will not have a good understanding or be able to apply the insights until there is a paper that makes the definitions and claims of this theory precise and specific. (Similarly, I did not understand your power-seeking theory work until you wrote a paper about it.) If you're looking to clarify the discourse around RL processes, I believe that writing a definitive reference on shard theory would be the most effective way to do so. I hope you take the time to write one and I really look forward to reading it. 

Cool stuff! I'm curious to hear how convincing this sort of thing is to typical AI risk skeptics with backgrounds in ML. 


How is orbit comparison for sets defined?

[This comment is no longer endorsed by its author]Reply

Which definition / result are you referring to?