(Status: no pretense to originality, but a couple people said they found this terminology useful, so I’m sharing it more widely.)
There’s a category of AGI safety work that we might call “Endgame Safety”, where we’re trying to do all the AGI safety work that we couldn’t or didn’t do ahead of time, in the very last moments before (or even after) people are actually playing around with powerful AGI algorithms of the type that could get irreversibly out of control and cause catastrophe.
I think everyone agrees that Endgame Safety is important and unavoidable. If nothing else, for every last line of AGI source code, we can do an analysis of what happens if that line of code has a bug, or if a cosmic ray flips a bit, and how do we write good unit tests, etc. But we’re obviously not going to have AGI source code until the endgame. That was an especially straightforward example, but I imagine that there will be many other things that also fall into the Endgame Safety bucket, i.e. bigger-picture important things to know about AGI that we only realize when we’re in the thick of building it.
So I am not an “Endgame Safety denialist”; I don’t think anyone is. But I find that people are sometimes misled by thinking about Endgame Safety, in the following two ways:
This is a bad argument because, what’s the rush? There’s going to be an endgame sooner or later, and we can do Endgame Safety Research then! Bringing the endgame sooner is basically equivalent to having all the AI alignment and strategy researchers hibernate for some number N years, and then wake up and get back to work. And that, in turn, is strictly worse than having all the AI alignment and strategy researchers do what they can during the next N years, and also continue doing work after those N years have elapsed.
I claim that there are plenty of open problems in AGI safety / alignment that we can do right now, that people are in fact working on right now, that seem robustly useful, and that are not in the category of “Endgame Safety”, e.g. my list of 7 projects, these 200 interpretability projects, this list, ELK, everything on Alignment Forum, etc.
For example, sometimes I’ll have this discussion:
So there’s my answer. I claim that this person is trying to do Endgame Safety right now, and I don’t want to help them. I think they should find something else to do right now instead, while they wait for some AI researcher to publish an answer to their prerequisite capabilities question. That’s bound to happen sooner or later! Or they can do contingency-planning for each of the possible answers to their capabilities question. Whatever.
See examples of this argument by Kaj Sotala and by Rohin Shah.
I mainly don’t like this argument because of (A), as elaborated in the next section.
But I also think (B) is problematic. First, why might someone believe (B)? The main argument I’ve seen revolves around the idea that making the endgame happen ASAP minimizes computing overhang, and thus slow takeoff will be likelier, which in turn might make the endgame less of a time-crunch.
I mostly don’t buy this argument because I think it relies on guessing what bottlenecks will hit in what order on the path to AGI—not only for capabilities but also for safety. That kind of guessing seems very difficult. For example, Kaj mentions “more neuroscience understanding” as bad because it shortens the endgame, but couldn’t Kaj have equally well said that “more neuroscience understanding” is exactly the thing we want right now to make the endgame start sooner??
As another example, this is controversial, but I don’t expect that compute will be a bottleneck to fast takeoff, and thus I don’t expect future advances in compute to meaningfully speed the Endgame. I think we already have a massive compute overhang, with current technology, once we know how to make AGI at all. The horse has already left the barn, IMO. See here.
So anyway, I see this whole (B) argument as quite fragile, even leaving aside the (A) issues. And speaking of (A), that brings us to the next item:
The biggest problem here is that, while Endgame Safety researchers will be in a better position to make progress in some ways, they’ll be in a much worse position than us in other ways. In particular:
As mentioned at the top, I think this is obviously true.
I’m talking as if there’s a sharp line where things either are or aren’t “Endgame Safety”. More realistically, I’m sure there will be a continuum. I don’t think that matters for anything I’m saying in this post.
I guess I could share information in secret, but let’s assume for the sake of argument that I don’t want to. For example, maybe I don’t know them very well. Or maybe they’re planning to publish their research. (A very reasonable thing to do!) Or maybe secrets are a giant awful mess with tons of negative consequences that I don’t want to burden them with.
E.g. Jeff Hawkins discusses in this video how someone sent him a free copy of Human Compatible, and he read it, and he thought that the book’s arguments were all stupid. For Yann LeCun see here.
Downvoted for mischaracterizing the views of the people you're arguing against in the "Bad argument 1" section.
(Footnote 2 is better, but it's a footnote, not the main text.)
EDIT: Removed my downvote given the edits.
Thanks, I just added the following text:
(Edited to add: Obviously, the existence of bad arguments for X does not preclude the existence of good arguments for X! The following is a narrow point of [hopefully] clarification, not an all-things-considered argument for a conclusion. See also footnote .)
I know that you don’t make Bad Argument 1—you were specifically one of the people I was thinking of when I wrote Footnote 2. I disagree that nobody makes Bad Argument 1. I think that Lone Pine’s comment on this very post is probably an example. I have seen lots of other examples, although I’m having trouble digging up other ones right now.
I guess you can say it’s unvirtuous / un-scout-mindset of me to spend more time refuting bad arguments for positions I disagree with, than refuting bad arguments for positions I agree with? Hmm.
I also changed the Kaj link from “Example of this argument” to “Example of something close to this argument”. As a matter of fact, I do actually think that Kaj’s post had some actual Bad-Argument-1-thinking slipping in in various places in his text. At least, that’s how it came across to me. But it’s probably not a good use of time to argue about that.
The edits help, thanks. I was in large part reacting to the fact that Kaj's post reads very differently from your summary of Bad Argument 1 (rather than the fact that I don't make Bad Argument 1). In the introductory paragraph where he states his position (the third paragraph of the post), he concludes:
Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research.
Which is clearly not equivalent to "alignment researchers hibernate for N years and then get back to work".
Plausibly some actually Bad-Argument-1 is slipping in but it's not the thing that is explicitly said in the introduction.
Eh, this seems fine. I just don't actually know of people who seriously believe Bad Argument 1 and would bet they are rare. So I predict that the main effect is making people who already believe your position even more confident in the position by virtue of thinking that the opponents of that position are stupid. More generally I wish that when people wrote takedowns of some incorrect position they try to establish that the position is actually commonly held.
(Separately, I am generally more in favor of posts that lay out the strongest arguments for and against a position and then coming to an overall conclusion, at which point you don't have to worry about considerations like "does anyone believe the thing I'm refuting", but those are significantly more effort.)
On further reflection, I promoted the thing from a footnote to the main text, elaborated on it, and added another thing at the end.
(I think I wrote this post in a snarkier way than my usual style, and I regret that. Thanks again for the pushback.)
Of you course, Hawkins doesn't just say they are stupid. It is Byrnes who is summarily dismissing Hawkins, in fact.
Do you think my post implied that Hawkins said they were stupid for no reason at all? If so, can you suggest how to change the wording?
To my ears, if I hear someone say “Person X thinks Argument Y is stupid”, it’s very obvious that I could then go ask Person X why they think it’s stupid, and they would have some answer to that question.
So when I wrote “Jeff Hawkins thought the book’s arguments were all stupid”, I didn’t think I was implying that Jeff wasn’t paying attention, or that Jeff wasn’t thinking, or whatever. If I wanted to imply those things, I would have said “Jeff Hawkins ignored the book’s arguments” or “Jeff Hawkins unthinkingly dismissed the book’s arguments” or “Jeff Hawkins dismissed the book’s arguments without any justification” or something like that. I really meant no negative connotation. I describe myself as thinking that lots of things are stupid, and I don’t think of that as a self-deprecating kind of thing to say. Again, I’m open to changing the wording.
As it turns out, Jeff Hawkins has written extensively on why he thinks that AGI x-risk is not going to happen, and I in turn have written extensively (probably more than literally anyone else on earth) on why his arguments are wrong: See in particular