*Note: This describes an idea of Jessica Taylor's*.

The usual training procedures for machine learning models are not always well-equipped to avoid rare catastrophes. In order to maintain the safety of powerful AI systems, it will be important to have training procedures that can efficiently learn from such events. [1]

We can model this situation with the problem of *exploration-only* online bandit learning. In this scenario, we grant the AI system an exploration phase, in which it is allowed to select catastrophic arms and view their consequences. (We can imagine that such catastrophic selections are acceptable because they are simulated or just evaluated by human overseers). Then, the AI system is switched into the deployment phase, in which it must select an arm that almost always avoids catastrophes.

##Setup In outline, the learner will receive a series of randomly selected examples, and will select an expert (modeled as a bandit arm) at each time step. The challenge is to find a high-performing expert in as few time steps as possible. We give some definitions:

- Let be some finite set of possible inputs.
- Let be the set of available experts (i.e. bandit arms).
- Let be the reward function. is the reward for following expert on example .
- Let be the catastrophic risk function. is the catastrophic risk incurred by following expert on example .
- Let be the mixed payoff that the learner is to optimize, where is the
*risk-tolerance*. can be very small, on the order of . - Let be the input distribution from which examples are drawn in the deployment phase.
- Let