Sure, I agree! If we miss even one such action, we're screwed. My point is that if people put enough skill and effort into trying to catch all such actions, then there is a significant chance that they'll catch literally all the actions that are (1) world-ending and that (2) the AI actually wants to try.
There's also a significant chance we won't, which is quite bad and very alarming, hence people should work on AI safety.
I mostly agree with the reasoning here; thank you to Eliezer for posting it and explaining it clearly. It's good to have all these reasons here in once place.
The one area I partly disagree with is Section B.1. As I understand it, the main point of B.1 is that we can't guard against all of the problems that will crop up as AI grows more intelligent, because we can't foresee all of those problems, because most of them will be "out-of-distribution," i.e., not the kinds of problems where we have reasonable training data. A superintelligent AI will do strange things that wouldn't have occurred to us, precisely because it's smarter than we are, and some of those things will be dangerous enough to wipe out all human life.
I think this somewhat overstates the problem. If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world's computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort and build avoidance of catastrophic danger as a category into its utility function...
And then we test whether the AI is actually doing these things and successfully using something like the human category of "catastrophe" when the AI is only slightly smarter than humans...
And then learn from those tests and honestly look at the failures and improve the AI's catastrophe-avoidance skills based on what we learn...
Then the chances that that AI won't immediately destroy the world seem to me to be much much larger than 0.1%. They're still low, which is bad, but they're not laughably insignificant, either, because if you make an honest, thoughtful, sustained effort to constrain the preferences of your successors, then often you at least partially succeed.
If natural selection had feelings, it might not be maximally happy with the way humans are behaving in the wake of Cro-Magnon optimization...but it probably wouldn't call it a disaster, either. Despite the existence of contraception, there sure are a whole lot more Cro-Magnons than there ever were Neanderthals, and the population is still going up every year.
Similarly, training an AI to act responsibly isn't going to get us a reliably safe AI, but whoever launches the first super-intelligent AI puts enough effort into that kind of training, then I don't see any reason why we shouldn't expect at least a 50% chance of a million or better survivors. I'm much more worried about large, powerful organizations that "vocally disdain all talk of AGI safety" than I am about the possibility that AGI safety research is inherently futile. It's inherently imperfect in that there's no apparent path to guaranteeing the friendliness of superintelligence...but that's not quite the same thing as saying that we shouldn't expect to be able to increase the probability that superintelligence is at least marginally friendly.
One of my assumptions is that it's possible to design a "satisficing" engine -- an algorithm that generates candidate proposals for a fixed number of cycles, and then, assuming at least one proposal with estimated utility greater than X has been generated within that amount of time, selects one of the qualifying proposals at random. If there are no qualifying candidates, the AI takes no action.
If you have a straightforward optimizer that always returns the action with the highest expected utility, then, yeah, you only have to miss one "cheat" that improves "official" utility at the expense of murdering everyone everywhere and then we all die. But if you have a satisficer, then as long as some of the qualifying plans don't kill everyone, there's a reasonable chance that the AI will pick one of those plans. Even if you forget to explicitly penalize one of the pathways to disaster, there's no special reason why that one pathway would show up in a large majority of the AI's candidate plans.