Note: This describes an idea of Jessica Taylor’s.

In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.

A major challenge with efficiently learning to avoid catastrophes is that the utility (or loss) function can take extremely negative values. This causes problems for standard online learning models, which often incur regret proportional to the difference between the maximum and minimum achievable utility.

In our last post, we explored how exploration-only bandit algorithms can be adapted to efficiently learn to avoid catastrophic actions. That model was oversimplified in a couple of ways. First, it provided a fixed distribution of examples to the learner. Second, it cleanly divides the learning process into an exploration phase, in which the learner can make many mistakes, and a subsequent deployment phase, in which the learner must select some near-optimal, non-catastrophic arm. In this post, we relax both of these assumptions. We use a learner that responds to a changing, adversarially-selected stream of examples. We do continue to allow the learner some exploration steps but the learner must choose when to use these to avoid catastrophic consequences. We don't achieve tight bounds, but we demonstrate that bandit learning systems can at least achieve sublinear regret using a sublinear number of exploration steps in an adversarial context. Similarly to our last post, we assume that the learner's beliefs about risks of catastrophe contain some grain of truth. Specifically, in this post, the learner has access to some overestimates of how frequently it should check for catastrophes, and sum of these estimates is not too high by more than a constant factor. We also assume the existence of some arm that has no risk of catastrophe.

The proofs in this post are somewhat long, and may be substantially changed with future work, so some readers may want to just read "Setup", "Defining the agent's performance", "A proposed learning model" and "Discussion".


Our model relies on some definitions:

  • Let be a set of bandit arms.
  • Let be the intrinsic reward function selected at time step .
  • Let be the intrinsic catastrophic risk function at time step . We assume that there is some bandit arm that always avoids catastrophes:
  • Let be the net payoff for the agent, where is the risk-tolerance, which may be some very small number of order .
  • Let be the ideal sampling probability, where is a parameter that specifies how much lost utility from catastrophes the learner tolerates.
  • Let be some overestimate of the ideal sampling probability. The learner has access to this overestimate.
  • Let be the sampling probability that the learner selects for some time step.

On each time step, , the learner performs two tasks: i) it selects an arm, and ii) it decides whether the step is an exploration step or a deployment step. Both of these decisions are carried out probabalistically.

To begin each time step , the learner outputs a distribution over arms, . The learner also outputs a sampling probability to be used if each arm is selected. The adversary selects the intrinsic reward and catastrophic risk functions, and in response to . Then, the learner's selected arm is drawn from .

Next, a biased coin is flipped, with probability of heads, . If it lands heads () then is an exploration step. This means that the learner observes the intrinsic reward and risk of its selected arm but receives no payoff. If the coin lands tails, it is a deployment step, the agent receives a payoff but does not observe it.

After of these time steps have passed, the hope is that our learning algorithm has achieved low regret and has used few exploration steps.

Defining the agent's performance

We define the number of exploration steps and the regret in turn.

The number of exploration steps is defined as .

Regret is defined relative to the best arm, :

The regret for a series of arms is then:

To summarize, the learner only accumulates its net payoff from deployment steps but it is expected to come close to the maximum payoff that can be achieved using all time steps.

A proposed learning model

In designing a learning system for this kind of problem, a couple of difficulties arise. The first is that the surrogate loss function has an extremely large range. The net payoff function has range , where and so when bandit algorithms like Exp3 are applied to this learning problem, their regret will include an unacceptable factor of . So is replaced with a surrogate loss function , whose range is capped to :

where indicates how how much lost utility from catastrophes the learner tolerates.

The second difficulty is that if the learner observed on every time step, it would always output , and it would use a linear number of exploration steps. To achieve a sublinear amount of exploration, the learner must instead use sampling probabilities that are sublinear in and use estimates of the surrogate problem, , such that:

where is the base sampling rate. Note that this estimate depends on the sampling probabilities , which in turn depends on the learner's selected arms, .

The main result, which follows from Theorem 1, is that a bandit learner can achieve ()-optimal regret for any where the error and the number of exploration time steps are sublinear, and do not depend on the risk-tolerance, .

**Theorem 1. ** Consider a learner that selects a sequence of arms using the Exp3 algorithm on some payoff functions using sampling probabilities , such that: where the learner uses an overestimate of the ideal sampling probability whose sum exceeds the ideal sampling probabilities by some fixed constant : where . Then, the learner's regret and exploration steps are probabalistically bounded: where where , , is the base sampling rate, is the number of bandit arm, and being amount of utility that the learner is prepared to lose from catastrophes.

In outline, our approach is to use the Exp3 adversarial bandit algorithm on the estimates of the surrogate loss function. We will begin by showing that the surrogate regret upper-bounds the actual regret (Lemma 2). We then quantify the error from approximating the surrogate regret (Lemmas 3, 4). To the estimation error, we add the regret from the Exp3 algorithm to obtain Lemma 5. To prove the bound on exploration steps, Theorem 2 is just a small extension.

Good surrogate performance implies good deployment performance

First, we need to show that is a good surrogate for . Specifically, we show that the regret with is bounded by our regret with .

Lemma 2

If a learner selects arms and sampling probabilities as described in Theorem 1, then, its regret will be no worse than the regret with with the surrogate .

Proof of Lemma 2. First, we can show that . This is because if the catastrophic risk is ever too high, the learner will just take an exploration step.

If the catastrophic risk from some arm is unacceptably high, i.e. , then , and so .

If the catastrophic risk is in the acceptable range, i.e. if , then:

So for all time steps in , , so . It follows that:

We must also show that . This is true because for all , . One can see that .

It therefore follows that:

Bounds on estimation error

The next step is to bound the error that arises from estimating with .

Lemma 3 If a learner selects arms and sampling probabilities as described in Theorem 1, then:

where .

Proof of Lemma 3.

First, we show that is an unbiased estimate of .

Then, since is an unbiased estimate of , we can define the following martingale:

Using the fact that has range , we can bound :

Azuma's inequality yields the result:

where . ◻

We will use the same argument to bound the estimation error when a best arm is repeatedly selected.

Lemma 4. Consider that learner selects arms and sampling probabilities as described in Theorem 1, and let be a best arm. Then the estimate of the payoffs for the best arm can be bounded by:

where .

Proof of Lemma 4.

The proof is similar to Lemma 3. Since is an unbiased estimate of , we have the martingale . As in Lemma 3, this martingale is bounded by . The application of Azuma's inequality yields the result.◻

A bound on surrogate regret

In order to prove Theorem 1, we will need to show that the Exp3 algorithm achieves low regret with respect to .

Its regret guarantee states that an Exp3 learner that selects some arms on some payoff functions that have range will have regret bounded as [1]:

Next, we show that the Exp3 algorithm will perform well on the surrogate problem.

Lemma 5

If a learner selects arms and sampling probabilities as described in Theorem 1, then its performance in on the surrogate problem will be probabalistically bounded: where , and .

Proof of Lemma 5. Using the union bound, we have that: $$ \begin{aligned} & \text{Pr}\left(\sum_{t=1}^T (S_{t}(i^) - S_{t}(i_t)) \geq 2\alpha + \eta\right) \ & = \text{Pr}\left( \sum_{t=1}^T (S_{t}(i^) - \hat{S}_{t}(i^*)

  • \hat{S}{t}(i^*) - \hat{S}{t}(i_t)
  • \hat{S}{t}(i^*) - S{t}(i_t)) \geq 2\alpha + \eta\right) \ & \leq \text{Pr}\left( \sum_{t=1}^T (S_{t}(i^) - \hat{S}_{t}(i^)) \geq \alpha \vee \sum_{t=1}^T(\hat{S}{t}(i^*) - \hat{S}{t}(i_t)) \geq \eta \vee \sum_{t=1}^T (\hat{S}{t}(i_t) - S{t}(i_t)) \geq \alpha\right) \ & \leq \text{Pr}\left( \sum_{t=1}^T (S_{t}(i^) - \hat{S}_{t}(i^)) \geq \alpha\right)
  • \text{Pr}\left(\sum_{t=1}^T(\hat{S}{t}(i^*) - \hat{S}{t}(i_t)) \geq \eta\right) + \text{Pr}\left(\sum_{t=1}^T (\hat{S}{t}(i_t) - S{t}(i_t)) \geq \alpha\right) \ & \leq \exp\left(\frac{-\gamma^2\alpha^2}{2z^2T}\right) + 0 + \exp\left(\frac{-\gamma^2\alpha^2}{2z^2T}\right) \qquad \qquad \text{(Lemmas 3, 4 and Exp3 bound)}\ & \leq 2\exp\left(\frac{-\gamma^2\alpha^2}{2z^2T}\right) \end{aligned} $$ ◻

In order to prove a bound on exploration steps in Theorem 1, we must first prove that the sum of sampling probabilities is not too high.

A bound on sampling probabilities

Lemma 6.

If a learner selects arms and sampling probabilities as described in Theorem 1, and furthermore,

where and , then, the sum of the sampling probabilities used is bounded by:

Proof of Lemma 6.

First note that a best arm must have non-negative total score on the surrogate problem:

If , then:

where , and .

\section{Bounding the overall regret}

Lemma 7.

If a learner selects arms and sampling probabilities as described in Theorem 1, and furthermore,

where , and , then regret will be bounded as:

Proof of Lemma 7

If where , and , then Lemma 6 gives us that:

Using Lemma 2, we know that this bound on regret will give the learner a good net payoff:

Completing the proof of Theorem 1

Proof of Theorem 1 For the learner to succeed, two things must happen. First, its sampling error must not cause too much regret. Second, it must not have too many exploration steps:

From Lemma 5, we can bound the surrogate performance as: . Using Azuma's inequality, we can also bound , in order to complete the result. Since is an unbiased estimate of , we have the Martingale:

where .

By Azuma's inequality, we have , where . So we obtain:

Achieving sublinear regret and sublinear exploration

What remains is to appropriately set the learner's parameters to achieve sublinear exploration and regret. If we set the parameters to , , and the constants to and , we have , and this yields an algorithm that will with probability achieve an error of less than where:

and the number of exploration steps is: That is to say that the regret and number of exploration steps will both be of order with probability , and do not depend on the risk-tolerance, .


The main aim of this post has been to develop a new model where a bandit algorithm can learn to avoid catastrophes. We have seen that a learner can avoid catastrophes by a process that includes switching between exploration and deployment steps. The modelling assumptions were that it can access overestimates of the ideal rate at which to sample for a catastrophe, such that these estimates are not overall too sensitive to nonserious catastrophic risks: , and that there is some arm that carries no catastrophic risk.

These assumptions could have some difficulties in practice. Many AI systems can avoid causing catastrophes be remaining immobile or inert, but others such as self-driving cars and spy drones, do not have this luxury. Obtaining the estimates of catastrophe risk would also be very difficult. In general, a lot of data would be required, which might perhaps have to be labelled using simulations and/or by human operators. There is further work to be done in trying to relax some of these assumptions.

Since this is just a proof of concept, the regret and exploration bounds are both of order , and we have left this bound very loose. Ultimately, the aim would be to achieve efficiency nearer to the order of in keeping with a similar class of games lacking "local observability" as described by Bartok et al [2]. We have some ideas about how to achieve this, and will post our results.


  1. The bound on Exp3 is given in the discussion of Corrollary 4.2 on page 8-9 in Auer, Peter, et al. "Gambling in a rigged casino: The adversarial multi-armed bandit problem." Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995.
  2. See Bartók, Gábor, et al. "Partial monitoring-classification, regret bounds, and algorithms." Mathematics of Operations Research 39.4 (2014): 967-997.
Personal Blog


New Comment