Alternative title: “When should you assume that what could go wrong, will go wrong?”

Thanks to Mary Phuong and Ryan Greenblatt for helpful suggestions and discussion, and Akash Wasil for some edits.

In discussions of AI safety, people often propose the assumption that something goes as badly as possible. Eliezer Yudkowsky in particular has argued for the importance of security mindset when thinking about AI alignment.

I think there are several distinct reasons that this might be the right assumption to make in a particular situation. But I think people often conflate these reasons, and I think that this causes confusion and mistaken thinking. So I want to spell out some distinctions.

Throughout this post, I give a bunch of specific arguments about AI alignment, including one argument that I think I was personally getting wrong until I noticed my mistake yesterday (which was my impetus for thinking about this topic more and then writing this post). I think I’m probably still thinking about some of my object level examples wrong, and hope that if so, commenters will point out my mistakes. But I think I’ll stand by the claim that we should be attempting to distinguish between these classes of argument.

My list of reasons to maybe use worst-case thinking

Here’s an attempt at describing some different classes situations where you might want to argue that something goes as badly as it could.

You’re being optimized against

For example, if you’ve built an unaligned AI and you have a team of ten smart humans looking for hidden gotchas in its proposed actions, then the unaligned AI will probably come up with a way of doing something bad that the humans miss. In AI alignment, we most often think about cases where the AI we’re training is optimizing against us, but sometimes we also need to think about cases where other AIs or other humans are optimizing against us or our AIs.

In situations like this, I think Eliezer’s attitude is basically right: we’re being optimized against and so we have to use worst-case thinking and search hard for systems which we can strongly argue are infallible.

One minor disagreement: I’m less into hard takeoffs than he is, so I place less weight than he does on situations where your AI becomes superintelligent enough during training that it can exploit some kind of novel physics to jump an airgap or whatever. (Under my model, such a model probably just waits until it’s deployed to the internet–which is one of the first things that AGI developers want to do with it, because that’s how you make money with a powerful AI–and then kills everyone.)

But I fundamentally agree with his rejection of arguments of the form “only a small part of the space of possible AI actions would be devastatingly bad, so things will probably be fine”.

Scott Garrabrant writes about an argument like this here.

The space you’re selecting over happens to mostly contain bad things

When Hubinger et al argue in section 4.4 of Risks from Learned Optimization that “there are more paths to deceptive alignment than to robust alignment,” they aren’t saying that you get a misaligned mesa-optimizer because the base optimizer is trying to produce an agent that is as misaligned as possible, they’re saying that even though the base optimizer isn’t trying to find a misaligned policy, most policies that it can find are misaligned and so you’ll probably get one. But unlike the previous situation, if instead it was the case that 50% of the policies that SGD might find were aligned, then we’d have a 50% chance of surviving, because SGD isn’t optimizing against us.

I think that AI alignment researchers often conflate these two classes of arguments. IMO, when you’re training an AGI:

  • The AI will try to kill you if it’s misaligned. So if you remove some but not all strategies that any unaligned AI could use to get through your training process, you haven’t made much progress at all.
  • But SGD isn’t trying to kill you, and so if there exist rare misaligned models in the model space that could make it through the training process and then kill you, what matters is how common they are, not whether they exist at all. If you never instantiate the model, it never gets a chance to pervert your optimization process (barring crazy scenarios with acausal threats or whatever).

(I noticed that I was making a mistake related to mixing up these two classes on Sunday; I then thought about this some more and wrote this post.)

You want to solve a problem in as much generality as possible, and so you want to avoid making assumptions that might not hold

There’s a certain sense in which cryptographers make worst-case assumptions in their research. For example, when inventing public key cryptography, cryptographers were asking the question “Suppose I want to be able to communicate privately with someone, but an eavesdropper is able to read all messages that we send to each other. Is there some way to communicate privately regardless?”

Suppose someone responded by saying “It seems like you’re making the assumption that someone is spying on your communications all the time. But isn’t this unrealistically pessimistic?”

The cryptographer’s response would be to say “Sure, it’s probably not usually the case that someone is spying on my packets when I send messages over the internet. But when I’m trying to solve the technical problem of ensuring private communication, it’s quite convenient to assume a simple and pessimistic threat model. Either I’ll find an approach that works in any scenario less pessimistic than the one I solved, or I’ll learn that we actually need to ensure some other way that no-one’s reading my packets.”

Similarly, in the alignment case, sometimes we make pessimistic empirical assumptions when trying to specify settings for our problems, because solutions developed for pessimistic assumptions generalize to easier situations but the converse isn’t true.

As a large-scale example, when we talk about trying to come up with competitive solutions to AI alignment, a lot of the motivation isn’t the belief that there will be literally no useful global coordination around AI.

A smaller-scale example: When trying to develop schemes for relaxed adversarial training, we assume that we have no access to any interpretability tools for our models. This isn’t because we actually believe that we’ll have no interpretability tools, it’s because we’re trying to develop an alternative to relying on interpretability.

This is kind of similar to the attitude that cryptographers have. 

Aiming your efforts at worlds where you have the biggest marginal impact

Suppose you are unsure how hard the alignment problem is. Maybe you think that humanity’s odd’s of success are given by a logistic function of the difference between how much alignment progress was made and how hard the problem is. When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.

Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.

And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected. This is the point that Eliezer makes in Security Mindset and the Logistic Success Curve: the security-minded character thinks that it’s so unlikely that a particular security-lax project will succeed at building a secure system that she doesn’t think it’s worth her time to try to help them make marginal improvements to their security.

And so, this kind of thinking only pushes you to aim your efforts at surprisingly bad worlds if you’re already P(doom) < 50%.

This type of thinking is common among people who are thinking about global catastrophic biological risks. I don’t know of any public documents that are specifically about this point, but you can see an example of this kind of reasoning in Andrew Snyder-Beattie’s Peak defence vs trough defence in biosecurity.


Sometimes a problem involves a bunch of weird things that could go wrong, and in order to get good outcomes, it has to be the case that all of them go well. For example, I don’t think that “a terrorist infiltrates the team of labellers who are being used to train the AGI and poisons the data” is a very likely AI doom scenario. But I think there are probably 100 scenarios as plausible as that one, each of which sounds kind of bad. And I think it’s probably worth some people’s time to try to stamp out all these individually unlikely failure modes.

Planning fallacy

Ryan Greenblatt notes that you can also make a general reference class claim that people are too optimistic (planning fallacy etc.).

Differences between these arguments

Depending on which of these arguments you’re making, you should respond very differently when someone says “the thing you’re proposing is quite far fetched”.

  • If the situation involves being optimized against, you say “I agree that that action would be quite a weird action among actions. But there’s a powerful optimization process selecting for actions like that action. So I expect it to happen anyway. To persuade me otherwise, you need to either claim that there isn’t adversarial selection, or that bad actions either don’t exist or are so hard to find that an adversary won’t possibly be able to find them.”
  • If you think that the situation involves a random process selecting over a space that is almost all bad, then you should say “Actually I disagree, I think that in fact the situation we’re talking about is probably about as bad as I’m saying; we should argue about what the distribution actually looks like.”
  • If you are making worst-case assumptions as part of your problem-solving process, then you should say “I agree that this situation seems sort of surprisingly bad. But I think we should try to solve it anyway, because solving it gives us a solution that is likely to work no matter what the empirical situation turns out to be, and I haven’t yet been convinced that my pessimistic assumptions make my problem impossible.”
  • If you’re making worst-case assumptions because you think that P(doom) is low and you are focusing on scenarios you agree are worse than expected, you should say “I agree that this situation seems sort of surprisingly bad. But I want to work on the situations where I can make the biggest difference, and I think that these surprisingly bad situations are the highest-leverage ones to work on.”
  • If you’re engaging in Murphyjistu, you should say “Yeah this probably won’t come up, but it still seems like a good idea to try and crush all these low-probability mechanisms by which something bad might happen.”

Mary Phuong proposes breaking this down into two questions:

  • When should you believe things will go badly, because they in fact will go badly? (you're being optimized against, or the probability of badness is high for some other reason)
  • When should you focus your efforts on worlds where things go badly? I.e. it's about which parts of the distribution you intervene on, rather than an argument about what the distribution looks like.
New Comment
15 comments, sorted by Click to highlight new comments since:

A few more reasons...

First: why do software engineers use worst-case reasoning?

  • A joking answer would be "the users are adversaries". For most software this isn't literally true; the users don't want to break the software. But users are optimizing for things, and optimization in general tends to find corner cases. (In linear programming, for instance, almost all objectives will be maximized at a literal corner of the set allowed by the constraints.) This is sort of like "being optimized against", but it emphasizes that the optimizer need not be "adversarial" in the intuitive sense of the word in order to have that effect.
  • Users do a lot of different things, and "corner cases" tend to come up a lot more often than a naive analysis might think. If a user is weird in one way, they're more likely to be weird in another way too. This is sort of like "the space contains a high proportion of bad things", but with more emphasis on the points in the space being weighted in ways which weight Weirdness more than a naive analysis would suggest.
  • Software engineers often want to provide simple, predictable APIs. Error cases (especially unexpected error cases) make APIs more complex.
  • In software, we tend to have a whole tech stack. Even if each component of the stack fails only rarely, overall failure can still be extremely common if there's enough pieces any one of which can break the whole thing. (I worked at a mortgage startup where this was a big problem - we used a dozen external APIs which were each fine 95+% of the time, but that still meant our app was down very frequently overall.) So, we need each individual component to be very highly reliable.

And one more, generated by thinking about some of my own use-cases:

  • Unknown unknowns. Worst-case reasoning forces people to consider all the possible failure modes, and rule out any unknown unknowns.

These all carry over to alignment pretty straightforwardly.

Just want to draw out and highlight something mentioned in passing in the "You want to solve a problem in as much generality as possible..." section. Not only would it be great if you could solve a problem in the worst case, the worst case assumption is also often radically easier to think about than trying to think about realistic cases. In some sense the worst case assumption is the second-simplest assumption you could possibly make about the empirical situation (the simplest being the best case assumption -- "this problem never comes up"). My understanding is that proving theorems about average case phenomena is a huge pain and often comes much after proofs about the worst case bounds.

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

One keyword for this is "partial specification", e.g. here is a paper I wrote that makes a minimal set of statistical assumptions and worst-case assumptions everywhere else: (Unfortunately the statistical assumptions are not really reasonable so the method was way too brittle in practice.) This kind of idea is also common in robust statistics. But my take would not be that it is simpler--in general it is way harder than just working with the empirical distribution in front of you.

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just "solve existing problems and then try the same thing tomorrow," and generally I'd agree "solve an existing problem for which you can verify success" is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it's feasible.

My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it's trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder.

(However, I agree that once we already need to do some averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)

Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.

And in fact, since min_x f(theta,x) <= E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.

Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.

To elaborate this formally,

  •  is best-case
  •  is worst-case
  •  is average-case

 and  are both "easier" monoids than  essentially because of dominance relations; for any , there's going to be a single  that dominates all others, in the sense that all other  can be excluded from consideration and have no impact on the outcome. Whereas when calculating , the only  that can be excluded are those outside the distribution's support.

 is even easier than  because it commutes with the outer ; not only is there a single  that dominates all others, it doesn't necessarily even depend on  (the problem can be solved as  or ). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.

Somewhere between worst-case and average-case performance is quantile-case performance, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing -quantile-case performance looks like  (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.

Here's the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:

  • Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. Christoph Dann, Tor Lattimore, Emma Brunskill. NIPS 2017.
    • Defines a concept of "Uniform-PAC bound", which is roughly when -quantile-case episodic regret scales polynomially in .
    • Proves that a Uniform-PAC bound implies:
      • PAC bound
      • Uniform high-probability regret bound
      • Convergence to zero regret with high probability
    • Constructs an algorithm, UBEV, that has a Uniform-PAC bound
    • Empirically compares quite favorably to other algorithms with only PAC or regret bounds
  • Policy Certificates: Towards Accountable Reinforcement Learning. Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill. ICML 2019.
    • Defines an even stronger concept of "IPOC bound", which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
    • Constructs an algorithm ORLC that has an IPOC-bound
    • Empirically compares favorably to UBEV
  • Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models. Gintare Dziugaite. December 2018 PhD thesis under Zoubin Ghahrmani.
  • Lipschitz Lifelong Reinforcement Learning. Erwan Lecarpentier, David Abel, Kavosh Asadi, et al. AAAI 2021.
    • Defines a pseudometric on the space of all MDPs
    • Proves that the mapping from an MDP to its optimal Q-function is (pseudo-)Lipschitz
    • Uses this to construct an algorithm LRMax that can transfer-learn from past similar MDPs while also being PAC-MDP
  • Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. Jiafan He, Dongruo Zhou, Quanwaun Gu. NIPS 2021.
    • Constructs an algorithm FLUTE that has a Uniform-PAC bound with a certain linearity assumption on the structure of the MDP being learned.
  • Beyond No Regret: Instance-Dependent PAC RL. Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson. August 2021 preprint.
  • Learning PAC-Bayes Priors for Probabilistic Neural Networks. María Pérez-Ortiz, Omar Rivasplata, Benjamin Guedj, et al. September 2021 preprint.
  • Tigheter Risk Certificates for Neural Networks. María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári. ICML 2021.
  • PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause. ICML 2021.

Thanks! I appreciated these distinctions. The worst-case argument for modularity came up in a past argument I had with Eliezer, where I argued that this was a reason for randomization (even though Bayesian decision theory implies you should never randomize). See section 2 here: The Power of Noise.

Re: 50% vs. 10% vs. 90%. I liked this illustration, although I don't think your argument actually implies 50% specifically. For instance if it turns out that everyone else is working on the 50% worlds and no one is working on the 90% worlds, you should probably work on the 90% worlds. In addition:

 *  It seems pretty plausible that the problem is overall more tractable in 10% worlds than 50% worlds, so given equal neglectedness you would prefer the 10% world.

 * Many ideas will generalize across worlds, and recruitment / skill-building / organization-building also generalizes across worlds. This is an argument towards working on problems that seem tractable and relevant to any world, as long as they are neglected enough that you are building out distinct ideas and organizational capacity (vs. just picking from the same tree as ML generally). I don't think that this argument dominates considerations, but it likely explains some of our differences in approach.

In the terms laid out in your post, I think my biggest functional disagreement (in terms of how it affects what problems we work on) is that I expect most worst-case assumptions make the problem entirely impossible, and I am more optimistic that many empirically-grounded assumptions will generalize quite far, all the way to AGI. To be clear, I am not against all worst-case assumptions (for instance my entire PhD thesis is about this) but I do think they are usually a source of significant added difficulty and one has to be fairly careful where they are making them.

For instance, as regards Redwood's project, I expect making language models fully adversarially robust is impossible with currently accessible techniques, and that even a fairly restricted adversary will be impossible to defend against while maintaining good test accuracy. On the other hand I am still pretty excited about Redwood's project because I think you will learn interesting things by trying. (I spent some time trying to solve the unrestricted adversarial example competition, totally failed, but still felt it was a good use of time for similar reasons, and the difficulties for language models seem interestingly distinct in a way that should generate additional insight.) I'm actually not sure if this differs that much from your beliefs, though.

This is a relatively banal meta-commentary on reasons people sometimes give for doing worst-case analysis, and the differences between those reasons. The post reads like a list of things with no clear through-line. There is a gesture at an important idea from a Yudkowsky post (the logistic success curve idea) but the post does not helpfully expound that idea. There is a kind of trailing-off towards the end of the post as things like "planning fallacy" seem to have been added to the list with little time taken to place them in the context of the other things on the list. In the "differences between these arguments" section, the post doesn't clearly elucidate deep differences between the arguments, it just lists verbal responses that you might make if you are challenged on plausibility grounds in each case.

Overall, I felt that this post under-delivered on an important topic.

When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.

Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.

And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected.

This section seems reversed to me, unless I'm misunderstanding it. If "things as I expect" are P(doom) 99%, and "I'm pleasantly wrong about the usefulness of natural abstractions" is P(doom) 50%, the first paragraph suggests I should do the "better than expected" / "surprisingly good" world, because the marginal impact of effort is higher in that world. 

[Another way to think about it is surprising in the direction you already expect is extremizing, but logistic success has its highest derivative in the middle, i.e. is a moderating force.]

This piece took an important topic that I hadn't realized I was confused/muddled about, convinced me I was confused/muddled about it, while simultaneously providing a good framework for thinking about it. I feel like I have a clearer sense of how Worst Case Thinking applies in alignment.

I also appreciated a lot of the comments here that explore the topic in more detail.