This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.
Even if the problem turns out to not be very difficult, it's helpful to have a model of why one...
Thanks to John Wentworth, Garrett Baker, Theo Chapman, and David Lorell for feedback and discussions on drafts of this post.
Here’s a TL;DR of my cruxes:
Comments: The following is a list (very lightly edited with help from Rob Bensinger) I wrote in July 2017, at Nick Beckstead’s request, as part of a conversation we were having at the time. From my current vantage point, it strikes me as narrow and obviously generated by one person, listing the first things that came to mind on a particular day.
I worry that it’s easy to read the list below as saying that this narrow slice, all clustered in one portion of the neighborhood, is a very big slice of the space of possible ways an AGI group may have to burn down its lead.
This is one of my models for how people wind up with really weird pictures of MIRI beliefs. I generate three examples
In some discussions (especially about acausal trade and multi-polar conflict), I’ve heard the motto “X will/won’t be a problem because superintelligences will just be Updateless”. Here I’ll explain (in layman’s terms) why, as far as we know, it’s not looking likely that a super satisfactory implementation of Updatelessness exists, nor that superintelligences automatically implement it, nor that this would drastically improve multi-agentic bargaining.
Epistemic status: These insights seem like the most robust update from my work with Demski on Logical Updatelessness and discussions with CLR employees about Open-Minded Updatelessness. To my understanding, most researchers involved agree with them and the message of this post.
This is skippable if you’re already familiar with the concept.
It’s easier to illustrate with the following example: Counterfactual Mugging.
I will throw a fair coin.
TL;DR: Scaling labs have their own alignment problem analogous to AI systems, and there are some similarities between the labs and misaligned/unsafe AI.
Major AI scaling labs (OpenAI/Microsoft, Anthropic, Google/DeepMind, and Meta) are very influential in the AI safety and alignment community. They put out cutting-edge research because of their talent, money, and institutional knowledge. A significant subset of the community works for one of these labs. This level of influence is beneficial in some aspects. In many ways, these labs have strong safety cultures, and these values are present in their high-level approaches to developing AI – it’s easy to imagine a world in which things are much worse. But the amount of influence that these labs have is also something to be cautious about.
The alignment community...