Note: This describes an idea of Jessica Taylor’s.

In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.


A major challenge with efficiently learning to avoid catastrophes is that the utility (or loss) function can take extremely negative values. This causes problems for standard online learning models, which often incur regret proportional to the difference between the maximum and minimum achievable utility.

In our last post, we explored how exploration-only bandit algorithms can be adapted to efficiently learn to avoid catastrophic actions. That model was oversimplified in a couple of ways. First, it provided a fixed distribution of examples to the learner. Second, it cleanly divides the learning process into an exploration phase, in which the learner can make many mistakes, and a subsequent deployment phase, in which the learner must select some near-optimal, non-catastrophic arm. In this post, we relax both of these assumptions. We use a learner that responds to a changing, adversarially-selected stream of examples. We do continue to allow the learner some exploration steps but the learner must choose when to use these to avoid catastrophic consequences. We don't achieve tight bounds, but we demonstrate that bandit learning systems can at least achieve sublinear regret using a sublinear number of exploration steps in an adversarial context. Similarly to our last post, we assume that the learner's beliefs about risks of catastrophe contain some grain of truth. Specifically, in this post, the learner has access to some overestimates of how frequently it should check for catastrophes, and sum of these estimates is not too high by more than a constant factor. We also assume the existence of some arm that has no risk of catastrophe.

The proofs in this post are somewhat long, and may be substantially changed with future work, so some readers may want to just read "Setup", "Defining the agent's performance", "A proposed learning model" and "Discussion".

Setup

Our model relies on some definitions:

  • Let be a set of bandit arms.
  • Let be the intrinsic reward function selected at time step .
  • Let be the intrinsic catastrophic risk function at time step . We assume that there is some bandit arm that always avoids catastrophes:
  • Let be the net payoff for the agent, where is the risk-tolerance, which may be some very small number of order .
  • Let be the ideal sampling probability, where is a parameter that specifies how much lost utility from catastrophes the learner tolerates.
  • Let be some overestimate of the ideal sampling probability. The learner has access to this overestimate.
  • Let be the sampling probability that the learner selects for some time step.

On each time step, , the learner performs two tasks: i) it selects an arm, and ii) it decides whether the step is an exploration step or a deployment step. Both of these decisions are carried out probabalistically.

To begin each time step , the learner outputs a distribution over arms, . The learner also outputs a sampling probability to be used if each arm is selected. The adversary selects the intrinsic reward and catastrophic risk functions, and in response to . Then, the learner's selected arm is drawn from .

Next, a biased coin is flipped, with probability of heads, . If it lands heads () then is an exploration step. This means that the learner observes the intrinsic reward and risk of its selected arm but receives no payoff. If the coin lands tails, it is a deployment step, the agent receives a payoff but does not observe it.

After of these time steps have passed, the hope is that our learning algorithm has achieved low regret and has used few exploration steps.

Defining the agent's performance

We define the number of exploration steps and the regret in turn.

The number of exploration steps is defined as .

Regret is defined relative to the best arm, :

The regret for a series of arms is then:

To summarize, the learner only accumulates its net payoff from deployment steps but it is expected to come close to the maximum payoff that can be achieved using all time steps.

A proposed learning model

In designing a learning system for this kind of problem, a couple of difficulties arise. The first is that the surrogate loss function has an extremely large range. The net payoff function has range , where and so when bandit algorithms like Exp3 are applied to this learning problem, their regret will include an unacceptable factor of . So is replaced with a surrogate loss function , whose range is capped to :

where indicates how how much lost utility from catastrophes the learner tolerates.

The second difficulty is that if the learner observed on every time step, it would always output , and it would use a linear number of exploration steps. To achieve a sublinear amount of exploration, the learner must instead use sampling probabilities that are sublinear in and use estimates of the surrogate problem, , such that:

where is the base sampling rate. Note that this estimate depends on the sampling probabilities , which in turn depends on the learner's selected arms, .

The main result, which follows from Theorem 1, is that a bandit learner can achieve ()-optimal regret for any where the error and the number of exploration time steps are sublinear, and do not depend on the risk-tolerance, .

**Theorem 1. ** Consider a learner that selects a sequence of arms using the Exp3 algorithm on some payoff functions using sampling probabilities , such that: where the learner uses an overestimate of the ideal sampling probability whose sum exceeds the ideal sampling probabilities by some fixed constant : where . Then, the learner's regret and exploration steps are probabalistically bounded: where where , , is the base sampling rate, is the number of bandit arm, and being amount of utility that the learner is prepared to lose from catastrophes.

In outline, our approach is to use the Exp3 adversarial bandit algorithm on the estimates of the surrogate loss function. We will begin by showing that the surrogate regret upper-bounds the actual regret (Lemma 2). We then quantify the error from approximating the surrogate regret (Lemmas 3, 4). To the estimation error, we add the regret from the Exp3 algorithm to obtain Lemma 5. To prove the bound on exploration steps, Theorem 2 is just a small extension.

Good surrogate performance implies good deployment performance

First, we need to show that is a good surrogate for . Specifically, we show that the regret with is bounded by our regret with .

Lemma 2

If a learner selects arms and sampling probabilities as described in Theorem 1, then, its regret will be no worse than the regret with with the surrogate .