For the 2019 LessWrong review, I've completely rewritten my post Seeking Power is Often Robustly Instrumental in MDPs. The post explains the key insights of my theorems on power-seeking and instrumental convergence / robust instrumentality. The new version is more substantial, more nuanced, and better motivated, without sacrificing the broad accessibility or the cute drawings of the original.
Big thanks to diffractor, Emma Fickel, Vanessa Kosoy, Steve Omohundro, Neale Ratzlaff, and Mark Xu for reading / giving feedback on this new version.
Here's my review, which I also posted as a comment.
One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility.
Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not imply. For example, the main paper is 9 pages long; in Appendix B, I further dedicated 3.5 pages to exploring the nuances of the formal definition of ‘power-seeking’ (Definition 6.1).
However, there are a few things I wish I’d gotten right the first time around. Therefore, I’ve restructured and rewritten much of the post. Let’s walk through some of the changes.
Like many good things, this terminological shift was prompted by a critique from Andrew Critch.
Roughly speaking, this work considered an action to be ‘instrumentally convergent’ if it’s very probably optimal, with respect to a probability distribution on a set of reward functions. For the formal definition, see Definition 5.8 in the paper.
This definition is natural. You can even find it echoed by Tony Zador in the Debate on Instrumental Convergence:
So i would say that killing all humans is not only not likely to be an optimal strategy under most scenarios, the set of scenarios under which it is optimal is probably close to a set of measure 0.
(Zador uses “set of scenarios” instead of “set of reward functions”, but he is implicitly reasoning: “with respect to my beliefs about what kind of objective functions we will implement and what states the agent will confront in deployment, I predict that deadly actions have a negligible probability of being optimal.”)
While discussing this definition of ‘instrumental convergence’, Andrew asked me: “what, exactly, is doing the converging? There is no limiting process. Optimal policies just are.”
It would be more appropriate to say that an action is ‘instrumentally robust’ instead of ‘instrumentally convergent’: the instrumentality is robust to the choice of goal. However, I found this to be ambiguous: ‘instrumentally robust’ could be read as “the agent is being robust for instrumental reasons.”
I settled on ‘robustly instrumental’, rewriting the paper’s introduction as follows:
An action is said to be instrumental to an objective when it helps achieve that objective. Some actions are instrumental to many objectives, making them robustly instrumental. The so-called instrumental convergence thesis is the claim that agents with many different goals, if given time to learn and plan, will eventually converge on exhibiting certain common patterns of behavior that are robustly instrumental (e.g. survival, accessing usable energy, access to computing resources). Bostrom et al.'s instrumental convergence thesis might more aptly be called the robust instrumentality thesis, because it makes no reference to limits or converging processes:
“Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent's goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.”
Some authors have suggested that gaining power over the environment is a robustly instrumental behavior pattern on which learning agents generally converge as they tend towards optimality. If so, robust instrumentality presents a safety concern for the alignment of advanced reinforcement learning systems with human society: such systems might seek to gain power over humans as part of their environment. For example, Marvin Minsky imagined that an agent tasked with proving the Riemann hypothesis might rationally turn the planet into computational resources.
This choice is not costless: many are already acclimated to the existing ‘instrumental convergence.’ It even has its own Wikipedia page. Nonetheless, if there ever were a time to make the shift, that time would be now.
The original post claimed that “optimal policies tend to seek power”, period. This was partially based on a result which I’d incorrectly interpreted. Vanessa Kosoy and Rohin Shah pointed out this error to me, and I quickly amended the original post and posted a follow-up explanation.
At the time, I’d wondered whether this was still true in general via some other result. The answer is ‘no’: it isn’t always more probable for optimal policies to navigate towards states which give them more control over the future. Here’s a surprising counterexample which doesn’t even depend on my formalization of ‘power.’
No matter how you cut it, the relationship just isn’t true in general. Instead, the post now sketches sufficient conditions under which power-seeking behavior is more probably optimal – conditions which are proven in the paper.
If you want to leave a comment, please don't do it here: leave it on the original post.