Nominated Posts for the 2019 Review

Posts need at least 2 nominations to continue into the Review Phase.
Nominate posts that you have personally found useful and important.
Sort by: fewest nominations

2019 Review Discussion

This is a linkpost for

In 2008, Steve Omohundro's foundational paper The Basic AI Drives conjectured that superintelligent goal-directed AIs might be incentivized to gain significant amounts of power in order to better achieve their goals. Omohundro's conjecture bears out in toy models, and the supporting philosophical arguments are intuitive. In 2019, the conjecture was even debated by well-known AI researchers.

Power-seeking behavior has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a well-understood cause. The goal of this post (and the accompanying paper, Optimal Policies Tend to Seek Power) is to change that.


It’s 2008, the ancient wild west of AI alignment. A few people have started thinking about questions like “if we gave an AI a utility function over world states, and it actually maximized that...

4Matt MacDermott2mo
I've been thinking about whether these results could be interpeted pretty differently under different branding. The current framing, if I understand it correctly, is something like, 'Powerseeking is not desirable. We can prove that keeping your options open tends to be optimal and tends to meet a plausible definition of powerseeking. Therefore we should expect RL agents to seek power, which is bad.' An alternative framing would be, 'Making irreversible changes is not desirable. We can prove that keeping your options open tends to be optimal. Therefore we should not expect RL agents to make irreversible changes, which is good.' I don't think that the second framing is better than the first, but I do think that if you had run with it instead then lots of people would be nodding their heads and feeling reassured about corrigibility, instead of feeling like their views about instrumental convergence had been confirmed. That makes me feel like we shouldn't update our views too much based on formal results that leave so much room for interpretation. If I showed a bunch of theorems about MDPs, with no exposition, to two people with different opinions about alignment, I expect they might come to pretty different conclusions about what they meant.  What do you think? (To be clear I think this is a great post and paper, I just worry that there are pitfalls when it comes to interpretation.)  

Consider an agent navigating a tree MDP, with utility on the leaf nodes. At any internal node in the tree, ~most utility functions will have the agent retain options by going towards the branch with the most leaves. But all policies use up all available options -- they navigate to a leaf with no more power.

I agree that we shouldn't update too hard for other reasons. EG this post's focus on optimal policies seems bad because reward is not the optimization target.

I've been thinking more about partial agency. I want to expand on some issues brought up in the comments to my previous post, and on other complications which I've been thinking about. But for now, a more informal parable. (Mainly because this is easier to write than my more technical thoughts.)

This relates to oracle AI and to inner optimizers, but my focus is a little different.


Suppose you are designing a new invention, a predict-o-matic. It is a wonderous machine which will predict everything for us: weather, politics, the newest advances in quantum physics, you name it. The machine isn't infallible, but it will integrate data across a wide range of domains, automatically keeping itself up-to-date with all areas of science and current events. You fully expect that...

If someone had a strategy that took two years, they would have to over-bid in the first year, taking a loss. But then they have to under-bid on the second year if they're going to make a profit, and--"

"And they get undercut, because someone figures them out."

I think one could imagine scenarios where the first trader can use their influence in the first year to make sure they are not undercut in the second year, analogous to the prediction market example. For instance, the trader could install some kind of encryption in the software that this company use... (read more)

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be...

18Eliezer Yudkowsky9mo
IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent.  If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair.  They cannot evade that by trying to make some 'commitment' earlier than I do.  I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases. I am not locked into warfare with things that demand $6 instead of $5.  I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters. From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma.  I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.  I suggest cultivating the same suspicion with respect to the imagination of commitment races between Ultimatum Game players, in which whoever manages to make some move logically first walks away with $9 and the other poor agent can only take $1 - especially if you end up reasoning that the computationally weaker agent should be the winner.

The Ultimatum game seems like it has pretty much the same type signature as the prisoner's dilemma: Payoff matrix for different strategies, where the players can roll dice to pick which strategy they use. Does timeless decision theory return the "correct answer" (second player rejects greedy proposals with some probability) when you feed it the Ultimatum game?

3Daniel Kokotajlo9mo
I agree with all this I think. This is why I said commitment races happen between consequentialists (I defined that term more narrowly than you do; the sophisticated reasoning you do here is nonconsequentialist by my definition). I agree that agents worthy of the label "rational" will probably handle these cases gracefully and safely.  However, I'm not yet supremely confident that the AGIs we end up building will handle these cases gracefully and safely. I would love to become more confident & am looking for ways to make it more likely.  If today you go around asking experts for an account of rationality, they'll pull off the shelf CDT or EDT or game-theoretic rationality (nash equilibria, best-respond to opponent) -- something consequentialist in the narrow sense. I think there is a nonzero chance that the relevant AGI will be like this too, either because we explicitly built it that way or because in some young dumb early stage it (like humans) picks up ideas about how to behave from its environment. Or else maybe because narrow-consequentialism works pretty well in single-agent environments and many muti-agent environments too, and maybe by the time the AGI is able to self-modify to something more sophisticated it is already thinking about commitment races and already caught in their destructive logic. (ETA: Insofar as you are saying: "Daniel, worrying about this is silly, any AGI smart enough to kill us all will also be smart enough to not get caught in commitment races" then I say... I hope so! But I want to think it through carefully first; it doesn't seem obvious to me, for the above reasons.)

This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence.



The goal of this sequence is to analyze the type of learned optimization that occurs when a...

Sure—I just edited it to be maybe a bit less jarring for those who know Greek.

"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility:

Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its


I am still pretty unconvinced that there is a corruption mechanism that wouldn’t be removed more quickly by SGD than the mesaobjective would be reverted. Are there more recent write ups that shed more light on this?

Specifically, I can’t tell whether this assumes the corruption mechanism has access to a perfect model of its own weights via observation (eg hacking) or via somehow the weights referring to themselves. This is important because if “the mesaobjective weights” are referred to via observation, then SGD will not compute a gradient wrt them (since t... (read more)

If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts.

Prior work

The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would

2J L1y
Apologies if it's obvious, but why the focus on SGD? I'm assuming it's not meant as shorthand for other types of optimization algorithms given the emphasis on SGD's specific inductive bias, and the Deep Double Descent paper mentions that the phenomena hold across most natural choices in optimizers. 

SGD is meant as a shorthand that includes other similar optimizers like Adam.

Note to mods: I'm a bit uncertain whether posts like this one currently belong on the Alignment Forum. Please move it if it doesn't. Or if anyone would prefer not to have such posts on AF, please let me know.

In Strategic implications of AIs’ ability to coordinate at low cost, I talked about the possibility that different AGIs can coordinate with each other much more easily than humans can, by doing something like merging their utility functions together. It now occurs to me that another way for AGIs to greatly reduce coordination costs in an economy is by having each AGI or copies of each AGI profitably take over much larger chunks of the economy (than companies currently own), and this can be done with AGIs that


I was reading parts of Superintelligence recently for something unrelated and noticed that Bostrom makes many of the same points as this post:

If the frontrunner is an AI system, it could have attributes that make it easier for it to expand its capabilities while reducing the rate of diffusion. In human-run organizations, economies of scale are counteracted by bureaucratic inefficiencies and agency problems, including difficulties in keeping trade secrets. These problems would presumably limit the growth of a machine intelligence project so long as it is op

... (read more)

This post is eventually about partial agency. However, it's been a somewhat tricky point for me to convey; I take the long route. Epistemic status: slightly crazy.

I've occasionally said "Everything boils down to credit assignment problems."

What I really mean is that credit assignment pops up in a wide range of scenarios, and improvements to credit assignment algorithms have broad implications. For example:

  • Politics.
    • When politics focuses on (re-)electing candidates based on their track records, it's about credit assignment. The practice is sometimes derogatorily called "finger pointing", but the basic computation makes sense: figure out good and bad qualities via previous performance, and vote accordingly.
    • When politics instead focuses on policy, it is still (to a degree) about credit assignment. Was raising the minimum wage responsible for reduced employment? Was it
  • In between … well … in between, we're navigating treacherous waters …

Right, I basically agree with this picture. I might revise it a little:

  • Early, the AGI is too dumb to hack its epistemics (provided we don't give it easy ways to do so!).
  • In the middle, there's a danger zone.
  • When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
  • When the AGI is really smart, it
... (read more)

There has been considerable debate over whether development in AI will experience a discontinuity, or whether it will follow a more continuous growth curve. Given the lack of consensus and the confusing, diverse terminology, it is natural to hypothesize that much of the debate is due to simple misunderstandings. Here, I seek to dissolve some misconceptions about the continuous perspective, based mostly on how I have seen people misinterpret it in my own experience.

First, we need to know what I even mean by continuous takeoff. When I say it, I mean a scenario where the development of competent, powerful AI follows a trajectory that is roughly in line with what we would have expected by extrapolating from past progress. That is, there is no point at...

Suppose that 1% of the world’s resources are controlled by unaligned AI, and 99% of the world’s resources are controlled by humans. We might hope that at least 99% of the universe’s resources end up being used for stuff-humans-like (in expectation).

Jessica Taylor argued for this conclusion in Strategies for Coalitions in Unit-Sum Games: if the humans divide into 99 groups each of which acquires influence as effectively as the unaligned AI, then by symmetry each group should end, up with as much influence as the AI, i.e. they should end up with 99% of the influence.

This argument rests on what I’ll call the strategy-stealing assumption: for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group...

Categorising the ways that the strategy-stealing assumption can fail:

  • It is intrinsically easier to gather flexible influence in pursuit of some goals, because
    • 1. It's easier to build AIs to pursue goals that are easy to check.
    • 3. It's easier to build institutions to pursue goals that are easy to check.
    • 9. It's easier to coordinate around simpler goals.
    • plus 4 and 5 insofar as some values require continuously surviving humans to know what to eventually spend resources on, and some don't.
    • plus 6 insofar as humans are otherwise an important part of the strategic e
... (read more)

It seems likely to me that AIs will be able to coordinate with each other much more easily (i.e., at lower cost and greater scale) than humans currently can, for example by merging into coherent unified agents by combining their utility functions. This has been discussed at least since 2009, but I'm not sure its implications have been widely recognized. In this post I talk about two such implications that occurred to me relatively recently.

I was recently reminded of this quote from Robin Hanson's Prefer Law To Values:

The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire. In this case we don’t expect to have much in the way of skills to


This post is excellent, in that it has a very high importance-to-word-count ratio. It'll take up only a page or so, but convey a very useful and relevant idea, and moreover ask an important question that will hopefully stimulate further thought.

[Epistemic status: Argument by analogy to historical cases. Best case scenario it's just one argument among many. Edit: Also, thanks to feedback from others, especially Paul, I intend to write a significantly improved version of this post in the next two weeks. Edit: I never did, because in the course of writing my response I realized the original argument made a big mistake. See this review.]

I have on several occasions heard people say things like this:

The original Bostrom/Yudkowsky paradigm envisioned a single AI built by a single AI project, undergoing intelligence explosion all by itself and attaining a decisive strategic advantage as a result. However, this is very unrealistic. Discontinuous jumps in technological capability are very rare, and it is very implausible that one project

It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.

2Daniel Kokotajlo2y
ReviewI've written up a review here [], which I made into a separate post because it's long. Now that I read the instructions more carefully, I realize that I maybe should have just put it here and waited for mods to promote it if they wanted to. Oops, sorry, happy to undo if you like.
Load More