Worst-case thinking in AI alignment

A few more reasons...

First: why do software engineers use worst-case reasoning?

A joking answer would be "the users are adversaries". For most software this isn't literally true; the users don't want to break the software. But users are optimizing for things, and optimization in general tends to find corner cases. (In linear programming, for instance, almost all objectives will be maximized at a literal corner of the set allowed by the constraints.) This is sort of like "being optimized against", but it emphasizes that the optimizer need not be "adversarial" in the intuitive sense of the word in order to have that effect.
Users do a lot of different things, and "corner cases" tend to come up a lot more often than a naive analysis might think. If a user is weird in one way, they're more likely to be weird in another way too. This is sort of like "the space contains a high proportion of bad things", but with more emphasis on the points in the space being weighted in ways which weight Weirdness more than a naive analysis would suggest.
Software engineers often want to provide simple, predictable APIs. Error cases (especially unexpected error cases) make APIs more complex.
In software, we tend to have a whole tech stack. Even if each component of the stack fails only rarely, overall failure can still be extremely common if there's enough pieces any one of which can break the whole thing. (I worked at a mortgage startup where this was a big problem - we used a dozen external APIs which were each fine 95+% of the time, but that still meant our app was down very frequently overall.) So, we need each individual component to be very highly reliable.

And one more, generated by thinking about some of my own use-cases:

Unknown unknowns. Worst-case reasoning forces people to consider all the possible failure modes, and rule out any unknown unknowns.

These all carry over to alignment pretty straightforwardly.

Just want to draw out and highlight something mentioned in passing in the "You want to solve a problem in as much generality as possible..." section. Not only would it be great if you could solve a problem in the worst case, the worst case assumption is also often radically easier to think about than trying to think about realistic cases. In some sense the worst case assumption is the second-simplest assumption you could possibly make about the empirical situation (the simplest being the best case assumption -- "this problem never comes up"). My understanding is that proving theorems about average case phenomena is a huge pain and often comes much after proofs about the worst case bounds.

[-]jsteinhardt4y80

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

One keyword for this is "partial specification", e.g. here is a paper I wrote that makes a minimal set of statistical assumptions and worst-case assumptions everywhere else: https://arxiv.org/abs/1606.05313. (Unfortunately the statistical assumptions are not really reasonable so the method was way too brittle in practice.) This kind of idea is also common in robust statistics. But my take would not be that it is simpler--in general it is way harder than just working with the empirical distribution in front of you.

[-]paulfchristiano4y20

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just "solve existing problems and then try the same thing tomorrow," and generally I'd agree "solve an existing problem for which you can verify success" is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it's feasible.

[-]davidad4y20

My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it's trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder.

(However, I agree that once we already need to do some averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)

[-]jsteinhardt4y10

Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.

And in fact, since min_x f(theta,x) <= E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.

[-]davidad4y20

Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.

[-]davidad4y50

To elaborate this formally,

is best-case
${max}_{θ} {min}_{x} f (θ, x)$ is worst-case
${max}_{θ} E_{x} f (θ, x)$ is average-case

$max$ and $min$ are both "easier" monoids than $E$ essentially because of dominance relations; for any $θ$ , there's going to be a single $x$ that dominates all others, in the sense that all other $x^{'} \neq x$ can be excluded from consideration and have no impact on the outcome. Whereas when calculating $E$ , the only $x^{'}$ that can be excluded are those outside the distribution's support.

$max$ is even easier than $min$ because it commutes with the outer $max$ ; not only is there a single $x$ that dominates all others, it doesn't necessarily even depend on $θ$ (the problem can be solved as ${max}_{x} {max}_{θ} f (θ, x)$ or ${max}_{θ, x} f (θ, x)$ ). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.

[-]davidad4y40

Somewhere between worst-case and average-case performance is quantile-case performance, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing -quantile-case performance looks like ${max}_{θ} sup {k | E_{x} [[f (θ, x) > k]] > p}$ (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.

[-]davidad4y20

Here's the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. Christoph Dann, Tor Lattimore, Emma Brunskill. NIPS 2017.
- Defines a concept of "Uniform-PAC bound", which is roughly when -quantile-case episodic regret scales polynomially in $1 / (1 - p)$ .
- Proves that a Uniform-PAC bound implies:
  - PAC bound
  - Uniform high-probability regret bound
  - Convergence to zero regret with high probability
- Constructs an algorithm, UBEV, that has a Uniform-PAC bound
- Empirically compares quite favorably to other algorithms with only PAC or regret bounds
Policy Certificates: Towards Accountable Reinforcement Learning. Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill. ICML 2019.
- Defines an even stronger concept of "IPOC bound", which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
- Constructs an algorithm ORLC that has an IPOC-bound
- Empirically compares favorably to UBEV
Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models. Gintare Dziugaite. December 2018 PhD thesis under Zoubin Ghahrmani.
Lipschitz Lifelong Reinforcement Learning. Erwan Lecarpentier, David Abel, Kavosh Asadi, et al. AAAI 2021.
- Defines a pseudometric on the space of all MDPs
- Proves that the mapping from an MDP to its optimal Q-function is (pseudo-)Lipschitz
- Uses this to construct an algorithm LRMax that can transfer-learn from past similar MDPs while also being PAC-MDP
Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. Jiafan He, Dongruo Zhou, Quanwaun Gu. NIPS 2021.
- Constructs an algorithm FLUTE that has a Uniform-PAC bound with a certain linearity assumption on the structure of the MDP being learned.
Beyond No Regret: Instance-Dependent PAC RL. Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson. August 2021 preprint.
Learning PAC-Bayes Priors for Probabilistic Neural Networks. María Pérez-Ortiz, Omar Rivasplata, Benjamin Guedj, et al. September 2021 preprint.
Tigheter Risk Certificates for Neural Networks. María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári. ICML 2021.
PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause. ICML 2021.

[-]Michaël Trazzi4y30

I would add How useful is quantilization for specification gaming? Ryan Carey

[-]jsteinhardt4y40

Thanks! I appreciated these distinctions. The worst-case argument for modularity came up in a past argument I had with Eliezer, where I argued that this was a reason for randomization (even though Bayesian decision theory implies you should never randomize). See section 2 here: The Power of Noise.

Re: 50% vs. 10% vs. 90%. I liked this illustration, although I don't think your argument actually implies 50% specifically. For instance if it turns out that everyone else is working on the 50% worlds and no one is working on the 90% worlds, you should probably work on the 90% worlds. In addition:

* It seems pretty plausible that the problem is overall more tractable in 10% worlds than 50% worlds, so given equal neglectedness you would prefer the 10% world.

* Many ideas will generalize across worlds, and recruitment / skill-building / organization-building also generalizes across worlds. This is an argument towards working on problems that seem tractable and relevant to any world, as long as they are neglected enough that you are building out distinct ideas and organizational capacity (vs. just picking from the same tree as ML generally). I don't think that this argument dominates considerations, but it likely explains some of our differences in approach.

In the terms laid out in your post, I think my biggest functional disagreement (in terms of how it affects what problems we work on) is that I expect most worst-case assumptions make the problem entirely impossible, and I am more optimistic that many empirically-grounded assumptions will generalize quite far, all the way to AGI. To be clear, I am not against all worst-case assumptions (for instance my entire PhD thesis is about this) but I do think they are usually a source of significant added difficulty and one has to be fairly careful where they are making them.

For instance, as regards Redwood's project, I expect making language models fully adversarially robust is impossible with currently accessible techniques, and that even a fairly restricted adversary will be impossible to defend against while maintaining good test accuracy. On the other hand I am still pretty excited about Redwood's project because I think you will learn interesting things by trying. (I spent some time trying to solve the unrestricted adversarial example competition, totally failed, but still felt it was a good use of time for similar reasons, and the difficulties for language models seem interestingly distinct in a way that should generate additional insight.) I'm actually not sure if this differs that much from your beliefs, though.

[-]Alex Flint3y30Review for 2021 Review

This is a relatively banal meta-commentary on reasons people sometimes give for doing worst-case analysis, and the differences between those reasons. The post reads like a list of things with no clear through-line. There is a gesture at an important idea from a Yudkowsky post (the logistic success curve idea) but the post does not helpfully expound that idea. There is a kind of trailing-off towards the end of the post as things like "planning fallacy" seem to have been added to the list with little time taken to place them in the context of the other things on the list. In the "differences between these arguments" section, the post doesn't clearly elucidate deep differences between the arguments, it just lists verbal responses that you might make if you are challenged on plausibility grounds in each case.

Overall, I felt that this post under-delivered on an important topic.

[-]Vaniver4y20

When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.
Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.
And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected.

This section seems reversed to me, unless I'm misunderstanding it. If "things as I expect" are P(doom) 99%, and "I'm pleasantly wrong about the usefulness of natural abstractions" is P(doom) 50%, the first paragraph suggests I should do the "better than expected" / "surprisingly good" world, because the marginal impact of effort is higher in that world.

[Another way to think about it is surprising in the direction you already expect is extremizing, but logistic success has its highest derivative in the middle, i.e. is a moderating force.]

[-]Raemon3y10Review for 2021 Review

This piece took an important topic that I hadn't realized I was confused/muddled about, convinced me I was confused/muddled about it, while simultaneously providing a good framework for thinking about it. I feel like I have a clearer sense of how Worst Case Thinking applies in alignment.

I also appreciated a lot of the comments here that explore the topic in more detail.

AI ALIGNMENT FORUM
Petrov Day
AF

AI ALIGNMENT FORUM
Petrov Day
AF

76

Worst-case thinking in AI alignment

76

My list of reasons to maybe use worst-case thinking

You’re being optimized against

The space you’re selecting over happens to mostly contain bad things

You want to solve a problem in as much generality as possible, and so you want to avoid making assumptions that might not hold

Aiming your efforts at worlds where you have the biggest marginal impact

Murphyjitsu

Planning fallacy

Differences between these arguments