A catastrophe is an event so bad that we are not willing to let it happen even a single time. For example, we would be unhappy if our self-driving car ever accelerates to 65 mph in a residential area and hits a pedestrian.
Catastrophes present a theoretical challenge for traditional machine learning — typically there is no way to reliably avoid catastrophic behavior without strong statistical assumptions.
In this post, I’ll lay out a very general model for catastrophes in which they are avoidable under much weaker statistical assumptions. I think this framework applies to the most important kinds of catastrophe, and will be especially relevant to AI alignment.
Designing practical algorithms that work in this model is an open problem. In a subsequent post I describe what I currently see as the most promising angles of attack.
We consider an agent A interacting with the environment over a sequence of episodes. Each episode produces a transcript τ, consisting of the agent’s observations and actions, along with a reward r ∈ [0, 1]. Our primary goal is to quickly learn an agent which receives high reward. (Supervised learning is the special case where each transcripts consist of a single input and a label for that input.)
While training, we assume that we have an oracle which can determine whether a transcript τ is “catastrophic.” For example, we might show a transcript to a QA analyst and ask them if it looks catastrophic. This oracle can be applied to arbitrary sequences of observations and actions, including those that don’t arise from an actual episode. So training can begin before the very first interaction with nature, using only calls to the oracle.
Intuitively, a transcript should only be marked catastrophic if it satisfies two conditions:
While actually interacting with the environment, the agent cannot query the oracle — there is no time to wait for a QA engineer to review a proposed action to check if it would be catastrophic.
Moreover, if interaction with nature ever produces a catastrophic transcript, we immediately fail. The performance of an algorithm is characterized by two parameters: the probability of catastrophic failure, and the total reward assuming no catastrophic failure.
We assume that there are some policies such that no matter what nature does, the resulting transcript is never catastrophic.
Traditionally in RL the goal is to get as much reward as the best policy from some class C. We’ slightly weaken that goal, and instead aim to do as well as the best policy from C that never makes a catastrophic decision.
I’ve described an online version of learning with catastrophes. We can also consider the batch version, where the learner is first given a large number of “training” episodes.
In the batch version, there is no penalty for catastrophes at training time, and we don’t care about training error. The two performance parameters are test-time performance and test-time catastrophe probability.
This definition depends on an oracle who determines which transcripts are catastrophic.
For weak AI systems, the oracle may be a human. But a powerful AI system might take actions which are catastrophic but which look inoffensive to a human judge, so this approach doesn’t cut it.
In general, the judge should be a human+AI team which is more competent than the system being trained, armed with an adequate solution to the informed oversight problem.
Learning with catastrophes is straightforward given an unlimited number of queries to the catastrophe oracle. Given any online learning algorithm A, we can “harden” it by running the following process before the beginning of each episode (a similar process can be applied to a batch learner):
It’s easy to prove that this process converges, if A is competent: the number of times we invoke step 2 is at most the time required to learn an optimal catastrophe-free policy (plus the number of episodes).
The big problem with this algorithm is the exponential search in step 1.
The most obvious approach to tractable learning with catastrophes is to replace step 1 with a tractable algorithm that approximates the search. This process is an example of adversarial training.
We can formalize this as a simple game between the agent A and an adversary B. The two agents alternate taking moves, with A playing the role of the agent and B playing the role of the environment. B wins if the resulting transcript is catastrophic, A wins if it is not catastrophic.
Before each episode, we play this game N times, where N > 0 is a hyperparameter. If B is able to win any of the games, then we add the resulting transcript to A’s training data and start again. Once B is not able to win any games, then we move on to a “real” episode. We hope that B is good enough at the game that if it can’t produce a catastrophic transcript, then the real environment can’t produce a catastrophic transcript either.
More precisely, before each episode we perform the following process:
I discuss this idea in more detail in my post on red teams. There are serious problems with this approach and I don’t think it can work on its own, but fortunately it seems combinable with other techniques.
Learning with catastrophes is a very general model of catastrophic failures which avoids being obviously impossible. I think that designing competent algorithms for learning with catastrophes may be an important ingredient in a successful approach to AI alignment.
This was originally posted here on 28th May, 2016.
Tomorrow's AI Alignment sequences post will be in the sequence on Value Learning by Rohin Shah.
The next post in this sequence will be 'Thoughts on Reward Engineering' by Paul Christiano, on Thursday.
I'm not sure how necessary it is to explicitly aim to avoid catastrophic behavior -- it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice. Of course, it would be better to have stronger guarantees against catastrophic behavior, so I certainly support research on learning from catastrophes -- but if it turns out to be too hard, or impose too much overhead, it could still be fine to aim for corrigibility alone.
I do want to make a perhaps obvious note: the assumption that "there are some policies such that no matter what nature does, the resulting transcript is never catastrophic" is somewhat strong. In particular, it precludes the following scenario: the environment can do anything computable, and the oracle evaluates behavior only based on outcomes (observations). In this case, for any observation that the oracle would label as catastrophic, there is an environment that regardless of the agent's action outputs that observation. So for this problem to be solvable, we need to either have a limit on what the environment "could do", or an oracle that judges "catastrophe" based on the agent's action in addition to outcomes (which I suspect will cache out to "are the actions in this transcript knowably going to cause something bad to happen"). In the latter case, it sounds like we are trying to train "robust corrigibility" as opposed to "never letting a catastrophe happen". Do you have a sense for which of these two assumptions you would want to make?