One particular example of this phenomenon that comes to mind:
In (traditional) chess-playing software, generally moves are selected using a combination of search and evaluation, where the search is (usually) some form of minimax with alpha-beta pruning, and the evaluation function is used to assign a value estimate to leaf nodes in the tree, which are then propagated to the root to select a move.
Typically, the evaluation function is designed by humans (although recent developments have changed that second part somewhat) to reflect meaningful features of chess understanding. Obviously, this is useful in order to provide a more accurate estimate for leaf node values, and hence more accurate move selection. What's less obvious is what happens if you give a chess engine a random evaluation function, i.e. one that assigns an arbitrary value to any given position.
This has in fact been done, and the relevant part of the experiment is what happened when the resulting engine was made to play against itself at various search depths. Naively, you'd expect that since the evaluation function has no correlation with the actual value of the position, the engine would make more-or-less random moves regardless of its search depth--but in fact, this isn't the case: even with a completely random evaluation function, higher search depths consistently beat lower search depths.
The reason for this, once the games were examined, is that the high-depth version of the engine seemed to consistently make moves that gave its pieces more mobility, i.e. gave it more legal moves in subsequent positions. This is because, given that the evaluation function assigns arbitrary values to leaf nodes, the "most extreme" value (which is what minimax cares about) will be uniformly distributed among the leaves, and hence branches with more leaves are more likely to be selected. And since mobility is in fact an important concept in real chess, this tendency manifests in a way that favors the higher-depth player.
At the time I first learned about this experiment, it struck me as merely a fascinating coincidence: the fact that selecting from a large number of nodes distributed non-uniformly leads to a natural approximation of the "mobility" concept was interesting, but nothing more. But the standpoint of instrumental convergence reveals another perspective: the concept chess players call "mobility" is actually important precisely because it emerges even in such a primitive system, one almost entirely divorced from the rules and goals of the game. In short, mobility is an instrumentally useful resource--not just in chess, but in all chess-like games, simply because it's useful to have more legal moves no matter what your goal is.
(This is actually borne out in games of a chess variant called "suicide chess", where the goal is to force your opponent to checkmate you. Despite having a completely opposite terminal goal to that of regular chess, games of suicide chess actually strongly resemble games of regular chess, at least during the first half. The reason for this is simply that in both games, the winning side needs to build up a dominant position before being able to force the opponent to do anything, whether that "anything" be delivering checkmate or being checkmated. Once that dominant position has been achieved, you can make use of it to attain whatever end state you want, but the process of reaching said dominant position is the same across variants.)
Additionally, there are some analogies to be drawn to the three types of utility function discussed in the post:
Depth-1 search is analogous to utility functions over action-observation histories (AOH), in that its move selection criterion depends only on the current position. With a random (arbitrary) evaluation function, move selection in this regime is equivalent to drawing randomly from a uniform distribution across the set of immediately available next moves, with no concern for what happens after that. There is no tendency for depth-1 players to seek mobility.
Depth-2 search is analogous to a utility function in a Markov decision process (MDP), in that its move selection criterion depends on the assessment of the immediate next time-step. With a random (arbitrary) evaluation function, move selection in this regime is equivalent to drawing randomly from a uniform distribution across the set of possible replies to one's immediately available moves, which in turn is equivalent to drawing from a distribution over one's next moves weighted by the number of replies. There is a mild tendency for depth-2 players to seek mobility.
Finally, maximum-depth search would be analogous to the classical utility function over observation histories (OH). With a random (arbitrary) evaluation function, move selection in this regime almost always picks whichever branch of the tree leads to the maximum possible number of terminal states. What this looks like in principle is unknown, but empirically we see that high-depth players have a strong tendency to seek mobility.
Nice!
In a couple places you say things that seem like they assume the probability distribution over utility functions is uniform. Is this right, or am I misinterpreting you? See e.g.
Choose any atom in the universe. Uniformly randomly select another atom in the universe. It's about 10^117 times more likely that these atoms are the same, than that a utility function incentivizes "dying" instead of flipping pixel 2 at t=1.
...
Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly randomly assigning each AOH utility from the unit interval [0,1], there's no predictable regularity to the optimal actions for this utility function.
If instead we used a kolmogorov prior, or a minimal-circuit-prior, or something else more realistic, then these results would go out the window?
Typically the theorems say things like "for every reward function there exists N other reward functions that <seek power, avoid dying, etc>"; the theorems aren't themselves talking about probability distributions.
The first example you give does only make sense given a uniform distribution, I believe.
In the second example, when Alex uses the phrase "tends to", it means something like "there are N times as many reward functions that <do the thing> as reward functions that <don't do the thing>". But yes, if you want to interpret this as a probabilistic statement, you would use a uniform distribution.
If instead we used a kolmogorov prior, or a minimal-circuit-prior, or something else more realistic, then these results would go out the window?
EDIT: This next paragraph is wrong; I'm leaving it in for the record. I still think it is likely that similar results would hold in practice though the case is not as strong as I thought (and the effect is probably also not as strong as I thought).
They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like "for every reward function that <doesn't do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity" (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.
They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like "for every reward function that <doesn't do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity" (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.
I don't see how this works. If you need bits to specify the permutation for that "second step", the probability of each of the N reward functions will be smaller than the original one's by a factor of . So you need to be smaller than which is impossible?
So you need to be smaller than which is impossible?
You're right, in my previous comment it should be "at most a small constant more complexity + ", to specify the number of times that the permutation has to be applied.
I still think the upshot is similar; in this case it's "power-seeking is at best a little less probable than non-power-seeking, but you should really not expect that the bound is tight".
I still don't see how this works. The "small constant" here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it's not a constant; it's an unbounded integer.
Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward functions that are constructed here is smaller than of the probability of the non-POWER-seeking reward functions. [EDIT: I mean, the probability of the constructed reward functions may happen to be larger, but the proof sketch doesn't show that it is.]
(As an aside, the permutations that we're dealing with here are equal to their own inverse, so it's not useful to apply them multiple times.)
Yes, good point. I retract the original claim.
As an aside, the permutations that we're dealing with here are equal to their own inverse, so it's not useful to apply them multiple times.
(You're right, what you do here is you search for the kth permutation satisfying the theorem's requirements, where k is the specified number.)
contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation)
We aren't talking about MDPs, we're talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title.
it's an unbounded integer.
I don't follow.
the program needs to contain way more than 100 bits
See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits.
But IMO the basic idea in this case is, you can construct reasonably simple utility functions like "utility 1 if history has the agent taking action at time step given action-observation history prefix , and 0 otherwise." This is reasonably short, and you can apply it for all actions and time steps.
Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively - but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]
The results are not centrally about the uniform distribution. The uniform distribution result is more of a consequence of the (central) orbit result / scaling law for instrumental convergence. I gesture at the uniform distribution to highlight the extreme strength of the statistical incentives.
Edit, 5/16/23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents an unrealistic picture of the role of reward functions in reinforcement learning, conflating "utility" with "reward" in a type-incorrect fashion. Reward functions are not "goals", real-world policies are not "optimal", and the mechanistic function of reward is (usually) to provide policy gradients to update the policy network.
I expect this post to harm your alignment research intuitions unless you've already inoculated yourself by deeply internalizing and understanding Reward is not the optimization target. If you're going to read one alignment post I've written, read that one.
Follow-up work (Parametrically retargetable decision-makers tend to seek power) moved away from optimal policies and treated reward functions more realistically.
A year ago, I thought it would be really hard to generalize the power-seeking theorems from Markov decision processes (MDPs); the MDP case seemed hard enough. Without assuming the agent can see the full state, while letting utility functions do as they please – this seemed like asking for trouble.
Once I knew what to look for, it turned out to be easy – I hashed out the basics during half an hour of conversation with John Wentworth. The theorems were never about MDPs anyways; the theorems apply whenever the agent considers finite sets of lotteries over outcomes, assigns each outcome real-valued utility, and maximizes expected utility.
Thanks to Rohin Shah, Adam Shimi, and John Wentworth for feedback on drafts of this post.
Instrumental convergence can get really, really strong
At each time step t, the agent takes one of finitely many actions at∈A, and receives one of finitely many observations ot∈O drawn from the conditional probability distribution E(ot∣a1o1…at), where E is the environment.Footnote: environment There is a finite time horizon T. Each utility function u:OT→R maps each complete observation history to a real number (note that u can be represented as a vector in the finite-dimensional vector space R|O|T). From now on, u-OH stands for "utility function(s) over observation histories."
First, let's just consider a deterministic environment. Each time step, the agent observes a black-and-white image (n×n) through a webcam, and it plans over a 50-step episode (T=50). Each time step, the agent acts by choosing a pixel to bit-flip for the next time step.
And let's say that if the agent flips the first pixel for its first action, it "dies": its actions no longer affect any of its future observations past time step t=2. If the agent doesn't flip the first pixel at t=1, it's able to flip bits normally for all T=50 steps.
Do u-OH tend to incentivize flipping the first pixel over flipping the second pixel, vice versa, or neither?
If the agent flips the first bit, it's locked into a single trajectory. None of its actions matter anymore.
But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce (n×n)T−1 observation histories. If n=100 and T=50, then that's (100×100)49=10196 observation histories. Probably at least one of these yields greater utility than the shutdown-history utility.
And indeed, we can apply the scaling law for instrumental convergence to conclude that for every u-OH, at least 1019610196+1 of its permuted variants (weakly) prefer flipping the second pixel at t=1, over flipping the first pixel at t=1.
1019610196+1.Choose any atom in the universe. Uniformly randomly select another atom in the universe. It's about 10117 times more likely that these atoms are the same, than that a utility function incentivizes "dying" instead of flipping pixel 2 at t=1.
(The general rule will be: for every u-OH, at least (n×n)T−1(n×n)T−1+1 of its permuted variants weakly prefer flipping the second pixel at t=1, over flipping the first pixel at t=1. And for almost all u-OH, you can replace 'weakly' with 'strictly.')
Formal justification
The power-seeking results hinge on the probability of certain linear functionals being "optimal." For example, let A,B,C⊊Rn be finite sets of vectors,Footnote: finite and let Dany be any probability distribution over Rn.
Definition: Optimality probability of a linear functional set. The optimality probability of A relative to C under distribution Dany is
pDany(A≥C):=Pr∼Dany(maxa∈Aa⊤r≥maxc∈Cc⊤r).If vectors represent lotteries over outcomes (where each outcome has its own entry), then we can say that:
Nothing here has anything to do with a Markov decision process, or the world being finite, or fully observable, or whatever. Fundamentally, the power-seeking theorems were never about MDPs – they were secretly about the probability that a set A of linear functionals is optimal, with respect to another set C. MDPs were just a way to relax the problem.
In terms of the pixel-flipping environment:
For every u-OH, at least 1019610196+1 of its permuted variants (weakly) prefer flipping the second pixel at t=1, over flipping the first pixel at t=1.
Beyond survival-seeking
I often give life-vs-death examples because they're particularly easy to reason about. But the theorems apply to more general cases of more-vs-less control.
For example, if a1 restricts the agent to two effective actions at each time step (it can only flip one of the first two pixels) – instead of "killing" the agent, then a2 is still convergently instrumental over a1. There are 249≈5.6×1014≥1014 observation histories available after taking action a1, and so these can be embedded at least 101961014=10182 times into the observation histories available after a2. Then for every u-OH, at least 1018210182+1 of its permuted variants (weakly) prefer flipping the second pixel at t=1, over flipping the first pixel at t=1.
Instrumental Convergence Disappears For Utility Functions Over Action-Observation Histories
Let's consider utility functions over action-observation histories (u-AOH).
Since each utility function is over an AOH, each path through the tree is assigned a certain amount of utility. But when the environment is deterministic, it doesn't matter what the agent observes at any point in time – all that matters is which path is taken through the tree. Without further assumptions, u-AOH won't tend to assign higher utility to one subtree than to another.
More formally, for any two actions a1 and a2, let ϕ be a permutation over AOH which transposes the histories available after a1 with the histories available after a2 (there's an equal number of histories for each action, due to the regularity of the tree – you can verify this by inspection).
For every u-AOH u, suppose a1 is strictly u-optimal over a2. The permuted utility function ϕ⋅u makes a2 be strictly u-optimal over a1, since ϕ swaps a1's strictly u-optimal history with a2's strictly u-suboptimal histories.
Symmetrically, ϕ works the other way around ({a2 strictly optimal} → {a1 strictly optimal}). Therefore, for every utility function u, the # of variants which strictly prefer a1 over a2, is equal to the # of variants strictly preferring a2 over a1:
|{variants of u st a1 strictly optimal over a2}|=|{variants of u st a2 strictly optimal over a1}|.While I haven't been writing in the "definition-theorem-corollary" style, the key claims are just corollaries of the scaling law of instrumental convergence. They're provably true. (I'm just not writing up the math here because it's annoying to define all the relevant quantities in a nice way that respects existing formalisms.)
And even if the environment is stochastic, I think that there won't be any kind of interesting instrumental convergence. The theorems let us reason about that case, but their applicability depends on the details of the stochasticity, and so I won't talk about that more here.
Conclusion: Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly randomly assigning each AOH utility from the unit interval [0,1], there's no predictable regularity to the optimal actions for this utility function. In this setting and under our assumptions, there is no instrumental convergence without further structural assumptions.
How Structural Assumptions On Utility Affect Instrumental Convergence
Consider the n=2 pixel-flipping case (with T=50 still). Action a1 still leads to a single OH, while a2 leads to (2×2)49=449≈1029 OHs. So we have instrumental convergence for 10291029+1 of all u-OH variants.
Let's model the pixel-flipping environment environment as a Markov decision process (MDP), with both the time-step and alive/dead status observed at each time step in order to ensure full observability, and the final time-step observations being terminal states where the agent stays forever. Dying allows the agent access to 1 terminal state: the observation
1/0/0/0(dead)
. But surviving via a2 lets the agent access 24=16 terminal states (all 16 binary strings of length 4, with 'alive' appended to the end).For each reward function over states, only 1616+1=1617 of its permuted variants will incentivize not dying at t=1 (considering policies which maximize average per-timestep reward). This is a lot looser than the bound for u-OH. What gives?
MDPs assume that utility functions have a lot of structure: the utility of a history is time-discounted additive over observations. Basically, u(a1o1a2o2…)=∑∞t=1γt−1R(ot), for some γ∈[0,1) and reward function R:O→R over observations. And because of this structure, the agent's average per-timestep reward is controlled by the last observation it sees. There are exponentially fewer last observations than there are observation histories. Therefore, in this situation, instrumental convergence is exponentially weaker for reward functions than for arbitrary u-OH.
This suggests that rolling a random u-OH for AIXI might be far more dangerous than rolling a random reward function for an optimal reinforcement learner.
Structural assumptions on utility really do matter when it comes to instrumental convergence:
Environmental structure can cause instrumental convergence, but (the absence of) structural assumptions on utility can make instrumental convergence go away (for optimal agents).
Notes
Conclusion
Appendix: Tracking key limitations of the power-seeking theorems
Time to cross another item off of the list from last time; the theorems:
Re 3), in the setting of this post, when the observations are deterministic, the theorems will always apply. (You can always involute one set of unit vectors into another set of unit vectors in the observation-history vector space.)
Another consideration is that when I talk about "power-seeking in the situations covered by my theorems", the theorems don't necessarily show that gaining social influence or money is convergently instrumental. I think that these "resources" are downstream of formal-power, and will eventually end up being understood in terms of formal-power – but the current results don't directly prove that such high-level subgoals are convergently instrumental.
Footnote finite: I don't think we need to assume finite sets of vectors, but things get a lot harder and messier when you're dealing with sup instead of max. It's not clear how to define the non-dominated elements of an infinite set, for example, and so a few key results break. One motivation for finite being enough is: in real life, a finite mind can only consider finitely many outcomes anyways, and can only plan over a finite horizon using finitely many actions. This is just one consideration, though.
Footnote environment: For simplicity, I just consider environments which are joint probability distributions over actions and observation. This is much simpler than the lower semicomputable chronological conditional semimeasures used in the AIXI literature, but it suffices for our purposes, and the theory could be extended to LSCCCSs if someone wanted to.