Many of Paul Christiano's writings were valuable corrections to the dominant Yudkowskian paradigm of AI safety. However, I think that many of them (especially papers like concrete problems in AI safety and posts like these two) also ended up providing a lot of intellectual cover for people to do "AI safety" work (especially within AGI companies) that isn't even trying to be scalable to much more powerful systems.
I want to register a prediction that "gradual disempowerment" will end up being (mis)used in a similar way. I don't really know what to do about this, but I intend to avoid using the term myself. My own research on related topics I cluster under headings like "understanding intelligence", "understanding political philosophy", and "understanding power". To me this kind of understanding-oriented approach seems more productive than trying to create a movement based around a class of threat models.
An analogy that points at one way I think the instrumental/terminal goal distinction is confused:
Imagine trying to classify genes as either instrumentally or terminally valuable from the perspective of evolution. Instrumental genes encode traits that help an organism reproduce. Terminal genes, by contrast, are the "payload" which is being passed down the generations for their own sake.
This model might seem silly, but it actually makes a bunch of useful predictions. Pick some set of genes which are so crucial for survival that they're seldom if ever modified (e.g. the genes for chlorophyll in plants, or genes for ATP production in animals). Treating those genes as "terminal" lets you "predict" that other genes will gradually evolve in whichever ways help most to pass those terminal genes on, which is what we in fact see.
But of course there's no such thing as "terminal genes". What's actually going on is that some genes evolved first, meaning that a bunch of downstream genes ended up selected for compatibility with them. In principle evolution would be fine with the terminal genes being replaced, it's just that it's computationally difficult to find a way to do so without breaking downstream dependencies.
I think this is a good analogy for how human values work. We start off with some early values, and then develop instrumental strategies for achieving them. Those instrumental strategies become crystallized and then give rise to other instrumental strategies for achieving them, and so on. Understood this way, we can describe an organism's goals/strategies purely in terms of which goals "have power over" which other goals, which goals are most easily replaced, etc, without needing to appeal to some kind of essential "terminalism" that some goals have and others don't. (Indeed, the main reason you'd need that concept is to describe someone who has modified their goals towards having a sharper instrumental/terminal distinction—i.e. it's a self-fulfilling prophecy.)
That's the descriptive view. But "from the inside" we still want to know which goals we should pursue, and how to resolve disagreements between our goals. How to figure that out without labeling some goals as terminal and others as instrumental? I don't yet have a formal answer, but my current informal answer is that there's a lot of room for positive-sum trade between goals, and so you should set up a system which maximizes the ability of those goals to cooperate with each other, especially by developing new "compromise" goals that capture the most important parts of each.
This leads to a pretty different view of the world from the Bostromian one. It often feels like the Bostrom paradigm implicitly divides the future into two phases. There's the instrumental phase, during which your decisions are dominated by trying to improve your long-term ability to achieve your goals. And there's the terminal phase, during which you "cash out" your resources into whatever you value. This isn't a *necessary* implication of the instrumental/terminal distinction, but I expect it's an emergent consequence in a range of environments of taking the instrumental/terminal distinction seriously. E.g. in our universe it sure seems like any scale-sensitive value system should optimize purely for number of galaxies owned for a long time before trying to turn those galaxies into paperclips/hedonium/etc.
From the alternative perspective I've outlined above, though, the process of instrumentally growing and gaining resources is also simultaneously the process of constructing values. In other words, we start off with underspecified values as children, but then over time choose to develop them in ways which are instrumentally useful. This process leads to the emergence of new, rich, nuanced goals which satisfy our original goals while also going far beyond them, just as the development of complex multicellular organisms helps to propagate the original bacterial genes for chlorophyll or ATP—not by "maximizing" for those "terminal" genes, but by building larger creatures much more strange and wonderful.
Nice work. I would also be excited about someone running with a similar project but for de-censoring Western models (e.g. on some of the topics discussed in this curriculum).
I have referred back to this post a lot since writing it. I still think it's underrated, because without understanding what we mean by "alignment research" it's easy to get all sorts of confused about what the field is trying to do.
Why did the task fall to a bunch of kids/students?
I'm not surprised by this, my sense is that it's usually young people and outsiders who pioneer new fields. Older people are just so much more shaped by existing paradigms, and also have so much more to lose, that it outweighs the benefits of their expertise and resources.
Also 1993 to 2000 doesn't seem like that large a gap to me. Though I guess the thing I'm pointing at could also be summarized as "why hasn't someone created a new paradigm of AI safety in the last decade?" And one answer is that Paul and Chris and a few others created a half-paradigm of "ML safety", but it hasn't yet managed to show impressive enough results to fully take over. However, it did win on a memetic level amongst EAs in particular.
The task at hand might then be understood as synthesizing the original "AI safety" with "ML safety". Or, to put it a bit more poetically, it's synthesizing the rationalist approach to aligning AGI with the empiricist approach to aligning AGI.
To answer your question, it's pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:
If this is true, then it should significantly update us away from the strategy "solve our current problems by becoming more philosophically competent and doing good consequentialist reasoning", right? If you are very bad at X, then all else equal you should try to solve problems using strategies that don't require you to do much X.
You might respond that there are no viable strategies for solving our current problems without applying a lot of philosophical competence and consequentialist reasoning. I think scientific competence and virtue ethics are plausibly viable alternative strategies (though the line between scientific and philosophical competence seems blurry to me, as I discuss below). But even given that we disagree on that, humanity solved many big problems in the past without using much philosophical competence and consequentialist reasoning, so it seems hard to be confident that we won't solve our current problems in other ways.
Out of your examples, the influence of economics seems most solid to me. I feel confused about whether game theory itself made nuclear war more or less likely—e.g. von Neumann was very aggressive, perhaps related to his game theory work, and maybe MAD provided an excuse to stockpile weapons? Also the Soviets didn't really have the game theory IIRC.
On the analytical philosophy front, the clearest wins seem to be cases where they transitioned from doing philosophy to doing science or math—e.g. the formalization of probability (and economics to some extent too). If this is the kind of thing you're pointing at, then I'm very much on board—that's what I think we should be doing for ethics and intelligence. Is it?
Re the AI safety stuff: it all feels a bit too early to say what its effects on the world have been (though on net I'm probably happy it has happened).
I guess this isn't an "in-depth account" but I'm also not sure why you're asking for "in-depth", i.e., why doesn't a list like this suffice?
Because I have various objections to this list (some of which are detailed above) and with such a succinct list it's hard to know which aspects of them you're defending, which arguments for their positive effects you find most compelling, etc.
I agree with this statement denotatively, and my own interests/work have generally been "driven by open-ended curiosity and a drive to uncover deep truths", but isn't this kind of motivation also what got humanity into its current mess? In other words, wasn't the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?
Interestingly, I was just having a conversation with Critch about this. My contention was that, in the first few decades of the field, AI researchers were actually trying to understand cognition. The rise of deep learning (and especially the kind of deep learning driven by massive scaling) can be seen as the field putting that quest on hold in order to optimize for more legible metrics.
I don't think you should find this a fully satisfactory answer, because it's easy to "retrodict" ways that my theory was correct. But that's true of all explanations of what makes the world good at a very abstract level, including your own answer of metaphilosophical competence. (Also, we can perhaps cash my claim out in predictions, like: was a significant barrier to more researchers working on deep learning the criticism that it didn't actually provide good explanations of or insight into cognition? Without having looked it up, I suspect so.)
consistently good strategy requires a high amount of consequentialist reasoning
I don't think that's true. However I do think it requires deep curiosity about what good strategy is and how it works. It's not a coincidence that my own research on a theory of coalitional agency was in significant part inspired by strategic failures of EA and AI safety (with this post being one of the earliest building blocks I laid down). I also suspect that the full theory of coalitional agency will in fact explain how to do metaphilosophy correct, because doing good metaphilosophy is ultimately a cognitive process and can therefore be characterized by a sufficiently good theory of cognition.
Again, I don't expect you to fully believe me. But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them. Without that, it's hard to trust metaphilosophy or even know what it is (though I think you've given a sketch of this in a previous reply to me at some point).
I should also try to write up the same thing, but about how virtues contributed to good things. And maybe also science, insofar as I'm trying to defend doing more science (of cognition and intelligence) in order to help fix risks caused by previous scientific progress.
In trying to reply to this comment I identified four "waves" of AI safety, and lists of the central people in each wave. Since this is socially complicated I'll only share the full list of the first wave here, and please note that this is all based on fuzzy intuitions gained via gossip and other unreliable sources.
The first wave I’ll call the “founders”; I think of them as the people who set up the early institutions and memeplexes of AI safety before around 2015. My list:
The second wave I’ll call the “old guard”; those were the people who joined or supported the founders before around 2015. A few central examples include Paul Christiano, Chris Olah, Andrew Critch and Oliver Habryka.
Around 2014/2015 AI safety became significantly more professionalized and growth-oriented. Bostrom published Superintelligence, the Puerto Rico conference happened, OpenAI was founded, DeepMind started a safety team (though I don't recall exactly when), and EA started seriously pushing people towards AI safety. I’ll call the people who entered the field from then until around 2020 "safety scalers" (though I'm open to better names). A few central examples include Miles Brundage, Beth Barnes, John Wentworth, Rohin Shah, Dan Hendrycks and myself.
And then there’s the “newcomers” who joined in the last 5-ish years. I have a worse mental map of these people, but some who I respect are Leo Gao, Sahil, Marius Hobbhahn and Jesse Hoogland.
In this comment I expressed concern that my generation (by which I mean the "safety scalers") have kinda given up on solving alignment. But another higher-level concern is: are people from these last two waves the kinds of people who would have been capable of founding AI safety in the first place? And if not, where are those people now? Of course there's some difference in the skills required for founding a field vs pushing the field forward, but to a surprising extent I keep finding that the people who I have the most insightful conversations with are the ones who were around from the very beginning. E.g. I think Vassar is the single person doing the best thinking about the lessons we can learn about failures of AI safety over the last decade (though he's hard to interface with), and Yudkowsky is still the single person who's most able to push the Overton window towards taking alignment seriously (even though in principle many other people could have written (less doomy versions of) his Time op-ed or his recent book), Scott is still the single best blogger in the space, and so on.
Relatedly, when I talk to someone who's exceptionally thoughtful about politics (and particularly the psychological aspects of politics), a disturbingly large proportion of the time it turns out that worked at (or were somehow associated with) Leverage. This is really weird to me. Maybe I just have Leverage-aligned tastes/networks, but even so, it's a very striking effect. (Also, how come there's no young Moldbug?)
Assuming that I'm gesturing at something real, what are some possible explanations?
This is all only a rough gesture at the phenomenon, and you should be wary that I'm just being pessimistic rather than identifying something important. Also it's a hard topic to talk about clearly because it's loaded with a bunch of social baggage. But I do feel pretty confused and want to figure this stuff out.
Thanks for writing this up. While I don't have much context on what specifically has gone well or badly for your team, I do feel pretty skeptical about the types of arguments you give at several points: in particular focusing on theories of change, having the most impact, comparative advantage, work paying off in 10 years, etc. I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.
(A provocative version of this claim: for the most important breakthroughs, it's nearly impossible to identify a theory of change for them in advance. Imagine Newton or Darwin trying to predict how understanding mechanics/evolution would change the world. Now imagine them trying to do that before they had even invented the theory! And finally imagine if they only considered plans that they thought would work within 10 years, and the sense of scarcity and tension that would give rise to.)
The rest of my comment isn't directly about this post, but close enough that this seems like a reasonable place to put it. EDIT: to be more clear: the rest of this comment is not primarily about Neel or "pragmatic interpretability", it's about parts of the field that I consider to be significantly less relevant to "solving alignment" than that (though work that's nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
I get the sense that there was a "generation" of AI safety researchers who have ended up with a very marginalist mindset about AI safety. Some examples:
In other words, whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment. In the terminology of this excellent post, they are all trying to attack a category I problem not a category II problem. Sometimes it feels like almost the entire field EDIT: most of the field is Goodharting on the subgoal of "write a really persuasive memo to send to politicians". Pragmatic interpretability feels like another step in that direction (EDIT: but still significantly more principled than the things I listed above).
This is all related to something Buck recently wrote: "I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget". I'm sure Buck has thought a lot about his strategy here, and I'm sure that you've thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn't feel like giving up from the inside, but from my perspective that's part of the problem.)
Now, a lot of the "old guard" seems to have given up too. But they at least know what they've given up on. There was an ideal of fundamental scientific progress that MIRI and Paul and a few others were striving towards; they knew at least what it would feel like (if not what it would look like) to actually make progress towards understanding intelligence. Eliezer and various others no longer think that's plausible. I disagree. But aside from the object-level disagreement, I really want people to be aware that this is a thing that's at least possible in principle to aim for, lest the next generation of the AI safety community ends up giving up on it before they even know what they've given up on.
(I'll leave for another comment/post the question of what went wrong in my generation. The "types of arguments" I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it's even possible to aim towards scientific breakthroughs. But even if that's one part of the story, I expect it's more complicated than that.)
I mostly wouldn't expect it to at this point, FWIW. The people engaged right now are by and large people sincerely grappling with the idea, and particularly people who are already bought into takeover risk. Whereas one of the main mechanisms by which I expect misuse of the idea is that people who are uncomfortable with the concept of "AI takeover" can still classify themselves as part of the AI safety coalition when it suits them.
As an illustration of this happening to Paul's worldview, see this Vox article titled "AI disaster won't look like the Terminator. It'll be creepier." My sense is that both Paul and Vox wanted to distance themselves from Eliezer's scenarios, and so Paul phrased his scenario in a way which downplayed stuff like "robot armies" and then Vox misinterpreted Paul to further downplay that stuff. (More on this from Carl here.) Another example: Sam Altman has previously justified racing to AGI by appealing to the idea that a slow takeoff is better than a fast takeoff.
Now, some of these dynamics are unavoidable—we shouldn't stop debating takeoffs just because people might misuse the concepts. But it's worth keeping an eye out for ideas that are particularly prone to this, and gradual disempowerment seems like one.
Well, it's much more convenient than "AI takeover", and so the question is how much people are motivated to use it to displace the AI takeover meme in their internal narratives.
Kudos for doing so. I don't mean to imply that you guys are unaware of this issue or negligent; IMO it's a pretty hard problem to avoid. I agree that stuff like "understanding power" is nowhere near adequate as a replacement. However, I do think that there's some concept like "empowering humans" which is a way to address both takeover risk and gradual disempowerment risk, if we fleshed it out into a proper research field. (Analogously, ambitious mechinterp is a way to address both fast take-off and slow take-off risks.) And so I expect that a cluster forming around something like human empowerment would be more productive and less prone to capture.
Yeah, "avoid using it altogether" would be too strong. Maybe something more like "I'll avoid using it as a headline/pointer to a cluster of people/ideas, and only use it to describe the specific threat model".