Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.

Even if the problem turns out to not be very difficult, it's helpful to have a model of why one...

Just want to say that I found this immensely clarifying and valuable since I read it months ago.

Thanks to John Wentworth, Garrett Baker, Theo Chapman, and David Lorell for feedback and discussions on drafts of this post.

In this post I’ll describe some of my thoughts on the AI control research agenda. If you haven’t read that post, I’m not going to try and summarize it here[1], so I recommend reading it first.

Here’s a TL;DR of my cruxes:

  • Evaluating whether you’re in a regime where control is doable requires strong capability evaluations. I expect that getting our capability evals to be comprehensive enough to be confident of our findings here is really hard.
  • Control evaluations are less likely to work if our AIs become wildly superhuman in problematic domains (such as hacking, persuasion, etc) before transformative AI[2]. I think the assumption that this wouldn’t happen is a very
Link should presumably be to this comment.

Hmm I think somehow the problem is that the equals sign in your url is being encoded as an ASCII value with a % sign etc rather than being treated as a raw equals sign, weird.

Comments: The following is a list (very lightly edited with help from Rob Bensinger) I wrote in July 2017, at Nick Beckstead’s request, as part of a conversation we were having at the time. From my current vantage point, it strikes me as narrow and obviously generated by one person, listing the first things that came to mind on a particular day.

I worry that it’s easy to read the list below as saying that this narrow slice, all clustered in one portion of the neighborhood, is a very big slice of the space of possible ways an AGI group may have to burn down its lead.

This is one of my models for how people wind up with really weird pictures of MIRI beliefs. I generate three examples


Rereading this now -- do you still endorse this?

In some discussions (especially about acausal trade and multi-polar conflict), I’ve heard the motto “X will/won’t be a problem because superintelligences will just be Updateless”. Here I’ll explain (in layman’s terms) why, as far as we know, it’s not looking likely that a super satisfactory implementation of Updatelessness exists, nor that superintelligences automatically implement it, nor that this would drastically improve multi-agentic bargaining.

Epistemic status: These insights seem like the most robust update from my work with Demski on Logical Updatelessness and discussions with CLR employees about Open-Minded Updatelessness. To my understanding, most researchers involved agree with them and the message of this post.

What is Updatelessness?

This is skippable if you’re already familiar with the concept.

It’s easier to illustrate with the following example: Counterfactual Mugging.

I will throw a fair coin.

  • If it lands Heads, you

(Excuse my ignorance. These are real questions, not just gotchas. I did see that you linked to the magic parts post.)

Will "commitment" and "agent" have to be thrown out and remade from sensible blocks? Perhaps cellular automata? ie did you create a dilemma out of nothing when you chose your terms?

Like we said a "toaster" is "literally anything that somehow produces toast" then our analysis of breakfast quickly broke down.

From my distant position it seems the real work to be done is at that lower level. We have not even solved 3x+1!!! How will we possibly draw up a sound notion of agents and commitments without some practical knowhow about slicing up the environment?

8Oliver Habryka6d
Promoted to curated: I think it's pretty likely a huge fraction of the value of the future will be determined by the question this post is trying to answer, which is how much game theory produces natural solutions to coordination problems, or more generally how much better we should expect systems to get at coordination as they get smarter. I don't think I agree with everything in the post, and a few of the characterizations of updatelessness seem a bit off to me (which Eliezer points to a bit in his comment), but I still overall found reading this post quite interesting and valuable for helping me think about for which of the problems of coordination we have a more mechanistic understanding of how being smarter and better at game theory might help, and which ones we don't have good mechanisms for, which IMO is a quite important question.

TL;DR: Scaling labs have their own alignment problem analogous to AI systems, and there are some similarities between the labs and misaligned/unsafe AI. 


Major AI scaling labs (OpenAI/Microsoft, Anthropic, Google/DeepMind, and Meta) are very influential in the AI safety and alignment community. They put out cutting-edge research because of their talent, money, and institutional knowledge. A significant subset of the community works for one of these labs. This level of influence is beneficial in some aspects. In many ways, these labs have strong safety cultures, and these values are present in their high-level approaches to developing AI – it’s easy to imagine a world in which things are much worse. But the amount of influence that these labs have is also something to be cautious about. 

The alignment community...

Would you rather have an AICorp CEO dictator or have democracy as-it-exists handle things?

12Rohin Shah4d
I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g. I feel pretty similarly about most of the other arguments in this post. Tbc I think there are plenty of things one could reasonably critique scaling labs about, I just think the argumentation in this post is by and large off the mark, and implies a standard that if actually taken literally would be a similarly damning critique of the alignment community. (Conflict of interest notice: I work at Google DeepMind.)
5Stephen Casper3d
Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one. 
4Stephen Casper5d
See also this much older and closely related post by Thomas Woodside: Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?
Load More