Nominated Posts for the 2019 Review

Posts need at least 2 nominations to continue into the Review Phase.
Nominate posts that you have personally found useful and important.
Sort by: fewest nominations

2019 Review Discussion

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be...

19Eliezer Yudkowsky2mo
IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent. If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair. They cannot evade that by trying to make some 'commitment' earlier than I do. I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases. I am not locked into warfare with things that demand $6 instead of $5. I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters. From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself. I suggest cultivating the same suspicion with respect to the imagination of commitment races between Ultimatum Game players, in which whoever manages to make some move logically first walks away with $9 and the other poor agent can only take $1 - especially if you end up reasoning that the computationally weaker agent should be the winner.

The Ultimatum game seems like it has pretty much the same type signature as the prisoner's dilemma: Payoff matrix for different strategies, where the players can roll dice to pick which strategy they use. Does timeless decision theory return the "correct answer" (second player rejects greedy proposals with some probability) when you feed it the Ultimatum game?

3Daniel Kokotajlo2mo
I agree with all this I think. This is why I said commitment races happen between consequentialists (I defined that term more narrowly than you do; the sophisticated reasoning you do here is nonconsequentialist by my definition). I agree that agents worthy of the label "rational" will probably handle these cases gracefully and safely. However, I'm not yet supremely confident that the AGIs we end up building will handle these cases gracefully and safely. I would love to become more confident & am looking for ways to make it more likely. If today you go around asking experts for an account of rationality, they'll pull off the shelf CDT or EDT or game-theoretic rationality (nash equilibria, best-respond to opponent) -- something consequentialist in the narrow sense. I think there is a nonzero chance that the relevant AGI will be like this too, either because we explicitly built it that way or because in some young dumb early stage it (like humans) picks up ideas about how to behave from its environment. Or else maybe because narrow-consequentialism works pretty well in single-agent environments and many muti-agent environments too, and maybe by the time the AGI is able to self-modify to something more sophisticated it is already thinking about commitment races and already caught in their destructive logic. (ETA: Insofar as you are saying: "Daniel, worrying about this is silly, any AGI smart enough to kill us all will also be smart enough to not get caught in commitment races" then I say... I hope so! But I want to think it through carefully first; it doesn't seem obvious to me, for the above reasons.)

This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence.



The goal of this sequence is to analyze the type of learned optimization that occurs when a...

Sure—I just edited it to be maybe a bit less jarring for those who know Greek.

"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility:

Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its


I am still pretty unconvinced that there is a corruption mechanism that wouldn’t be removed more quickly by SGD than the mesaobjective would be reverted. Are there more recent write ups that shed more light on this?

Specifically, I can’t tell whether this assumes the corruption mechanism has access to a perfect model of its own weights via observation (eg hacking) or via somehow the weights referring to themselves. This is important because if “the mesaobjective weights” are referred to via observation, then SGD will not compute a gradient wrt them (since t... (read more)

If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts.

Prior work

The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would

2J L3mo
Apologies if it's obvious, but why the focus on SGD? I'm assuming it's not meant as shorthand for other types of optimization algorithms given the emphasis on SGD's specific inductive bias, and the Deep Double Descent paper mentions that the phenomena hold across most natural choices in optimizers.

SGD is meant as a shorthand that includes other similar optimizers like Adam.

Note to mods: I'm a bit uncertain whether posts like this one currently belong on the Alignment Forum. Please move it if it doesn't. Or if anyone would prefer not to have such posts on AF, please let me know.

In Strategic implications of AIs’ ability to coordinate at low cost, I talked about the possibility that different AGIs can coordinate with each other much more easily than humans can, by doing something like merging their utility functions together. It now occurs to me that another way for AGIs to greatly reduce coordination costs in an economy is by having each AGI or copies of each AGI profitably take over much larger chunks of the economy (than companies currently own), and this can be done with AGIs that


I was reading parts of Superintelligence recently for something unrelated and noticed that Bostrom makes many of the same points as this post:

If the frontrunner is an AI system, it could have attributes that make it easier for it to expand its capabilities while reducing the rate of diffusion. In human-run organizations, economies of scale are counteracted by bureaucratic inefficiencies and agency problems, including difficulties in keeping trade secrets. These problems would presumably limit the growth of a machine intelligence project so long as it is op

... (read more)
This is a linkpost for

In 2008, Steve Omohundro's foundational paper The Basic AI Drives conjectured that superintelligent goal-directed AIs might be incentivized to gain significant amounts of power in order to better achieve their goals. Omohundro's conjecture bears out in toy models, and the supporting philosophical arguments are intuitive. In 2019, the conjecture was even debated by well-known AI researchers.

Power-seeking behavior has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a well-understood cause. The goal of this post (and the accompanying paper, Optimal Policies Tend to Seek Power) is to change that.


It’s 2008, the ancient wild west of AI alignment. A few people have started thinking about questions like “if we gave an AI a utility function over world states, and it actually maximized that...

I proposed changing "instrumental convergence" to "robust instrumentality." This proposal has not caught on, and so I reverted the post's terminology. I'll just keep using 'convergently instrumental.' I do think that 'convergently instrumental' makes more sense than 'instrumentally convergent', since the agent isn't "convergent for instrumental reasons", but rather, it's more reasonable to say that the instrumentality is convergent in some sense.

For the record, the post used to contain the following section:

A note on terminology

The robustness-of-strategy p... (read more)

This post is eventually about partial agency. However, it's been a somewhat tricky point for me to convey; I take the long route. Epistemic status: slightly crazy.

I've occasionally said "Everything boils down to credit assignment problems."

What I really mean is that credit assignment pops up in a wide range of scenarios, and improvements to credit assignment algorithms have broad implications. For example:

  • Politics.
    • When politics focuses on (re-)electing candidates based on their track records, it's about credit assignment. The practice is sometimes derogatorily called "finger pointing", but the basic computation makes sense: figure out good and bad qualities via previous performance, and vote accordingly.
    • When politics instead focuses on policy, it is still (to a degree) about credit assignment. Was raising the minimum wage responsible for reduced employment? Was it
  • In between … well … in between, we're navigating treacherous waters …

Right, I basically agree with this picture. I might revise it a little:

  • Early, the AGI is too dumb to hack its epistemics (provided we don't give it easy ways to do so!).
  • In the middle, there's a danger zone.
  • When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
  • When the AGI is really smart, it
... (read more)

An actual debate about instrumental convergence, in a public space! Major respect to all involved, especially Yoshua Bengio for great facilitation.

For posterity (i.e. having a good historical archive) and further discussion, I've reproduced the conversation here. I'm happy to make edits at the request of anyone in the discussion who is quoted below. I've improved formatting for clarity and fixed some typos. For people who are not researchers in this area who wish to comment, see the public version of this post here. For people who do work on the relevant areas, please sign up in the top right. It will take a day or so to confirm membership.

Original Post

Yann LeCun: "don't fear the Terminator", a short opinion piece by Tony Zador and me that was just...

1Richard Ngo1y
What's your specific critique of this? I think it's an interesting and insightful point.

LeCun claims too much. It's true that the case of animals like orangutans points to a class of cognitive architectures which seemingly don't prioritize power-seeking. It's true that this is some evidence against power-seeking behavior being common amongst relevant cognitive architectures. However, it doesn't show that instrumental subgoals are much weaker drives of behavior than hardwired objectives.

One reading of this "drives of behavior" claim is that it has to be tautological; by definition, instrumental subgoals are always in service of the (hardwired)... (read more)

There has been considerable debate over whether development in AI will experience a discontinuity, or whether it will follow a more continuous growth curve. Given the lack of consensus and the confusing, diverse terminology, it is natural to hypothesize that much of the debate is due to simple misunderstandings. Here, I seek to dissolve some misconceptions about the continuous perspective, based mostly on how I have seen people misinterpret it in my own experience.

First, we need to know what I even mean by continuous takeoff. When I say it, I mean a scenario where the development of competent, powerful AI follows a trajectory that is roughly in line with what we would have expected by extrapolating from past progress. That is, there is no point at...

Suppose that 1% of the world’s resources are controlled by unaligned AI, and 99% of the world’s resources are controlled by humans. We might hope that at least 99% of the universe’s resources end up being used for stuff-humans-like (in expectation).

Jessica Taylor argued for this conclusion in Strategies for Coalitions in Unit-Sum Games: if the humans divide into 99 groups each of which acquires influence as effectively as the unaligned AI, then by symmetry each group should end, up with as much influence as the AI, i.e. they should end up with 99% of the influence.

This argument rests on what I’ll call the strategy-stealing assumption: for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group...

Categorising the ways that the strategy-stealing assumption can fail:

  • It is intrinsically easier to gather flexible influence in pursuit of some goals, because
    • 1. It's easier to build AIs to pursue goals that are easy to check.
    • 3. It's easier to build institutions to pursue goals that are easy to check.
    • 9. It's easier to coordinate around simpler goals.
    • plus 4 and 5 insofar as some values require continuously surviving humans to know what to eventually spend resources on, and some don't.
    • plus 6 insofar as humans are otherwise an important part of the strategic e
... (read more)

It seems likely to me that AIs will be able to coordinate with each other much more easily (i.e., at lower cost and greater scale) than humans currently can, for example by merging into coherent unified agents by combining their utility functions. This has been discussed at least since 2009, but I'm not sure its implications have been widely recognized. In this post I talk about two such implications that occurred to me relatively recently.

I was recently reminded of this quote from Robin Hanson's Prefer Law To Values:

The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire. In this case we don’t expect to have much in the way of skills to


This post is excellent, in that it has a very high importance-to-word-count ratio. It'll take up only a page or so, but convey a very useful and relevant idea, and moreover ask an important question that will hopefully stimulate further thought.

[Epistemic status: Argument by analogy to historical cases. Best case scenario it's just one argument among many. Edit: Also, thanks to feedback from others, especially Paul, I intend to write a significantly improved version of this post in the next two weeks. Edit: I never did, because in the course of writing my response I realized the original argument made a big mistake. See this review.]

I have on several occasions heard people say things like this:

The original Bostrom/Yudkowsky paradigm envisioned a single AI built by a single AI project, undergoing intelligence explosion all by itself and attaining a decisive strategic advantage as a result. However, this is very unrealistic. Discontinuous jumps in technological capability are very rare, and it is very implausible that one project

It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.

2Daniel Kokotajlo1y
ReviewI've written up a review here [] , which I made into a separate post because it's long. Now that I read the instructions more carefully, I realize that I maybe should have just put it here and waited for mods to promote it if they wanted to. Oops, sorry, happy to undo if you like.

Note: I am not Chris Olah. This post was the result of lots of back-and-forth with Chris, but everything here is my interpretation of what Chris believes, not necessarily what he actually believes. Chris also wanted me to emphasize that his thinking is informed by all of his colleagues on the OpenAI Clarity team and at other organizations.

In thinking about AGI safety—and really any complex topic on which many smart people disagree—I’ve often found it very useful to build a collection of different viewpoints from people that I respect that I feel like I understand well enough to be able to think from their perspective. For example, I will often try to compare what an idea feels like when I put on my Paul Christiano hat to...

The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.

Load More