In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.

A major challenge with efficiently learning to avoid catastrophes is that the utility (or loss) function can take extremely negative values. This causes problems for standard online learning models, which often incur regret proportional to the difference between the maximum and minimum achievable utility.

In our last post, we explored how exploration-only bandit algorithms can be adapted to efficiently learn to avoid catastrophic actions. That model was oversimplified in a couple of ways. First, it provided a fixed distribution of examples to the learner. Second, it cleanly divides the learning process into an exploration phase, in which the learner can make many mistakes, and a subsequent deployment phase, in which the learner must select some near-optimal, non-catastrophic arm. In this post, we relax both of these assumptions. We use a learner that responds to a changing, adversarially-selected stream of examples. We do continue to allow the learner some exploration steps but the learner must choose when to use these to avoid catastrophic consequences. We don't achieve tight bounds, but we demonstrate that bandit learning systems can at least achieve sublinear regret using a sublinear number of exploration steps in an adversarial context. Similarly to our last post, we assume that the learner's beliefs about risks of catastrophe contain some grain of truth. Specifically, in this post, the learner has access to some overestimates of how frequently it should check for catastrophes, and sum of these estimates is not too high by more than a constant factor. We also assume the existence of some arm that has no risk of catastrophe.

The proofs in this post are somewhat long, and may be substantially changed with future work, so some readers may want to just read "Setup", "Defining the agent's performance", "A proposed learning model" and "Discussion".

Setup

Our model relies on some definitions:

Let I={I1,I2,...,IK} be a set of bandit arms.

Let Rt:I↦[0,1] be the intrinsic reward function selected at time step t.

Let Ct:I↦[0,1] be the intrinsic catastrophic risk function at time step t. We assume that there is some bandit arm that always avoids catastrophes:
∃i∈I s.t. ∀tCt(i)=0

Let Mt(i):=Rt(I)−Ct(i)τ be the net payoff for the agent, where τ∈(0,1] is the risk-tolerance, which may be some very small number of order 10−20.

Let at,it:=min(Ct(it)zτ,1) be the ideal sampling probability, where z∈[1,∞) is a parameter that specifies how much lost utility from catastrophes the learner tolerates.

Let ^at,it∈[0,1] be some overestimate of the ideal sampling probability. The learner has access to this overestimate.

Let bt,it∈[0,1] be the sampling probability that the learner selects for some time step.

On each time step, 1,2,...,T, the learner performs two tasks: i) it selects an arm, and ii) it decides whether the step is an exploration step or a deployment step. Both of these decisions are carried out probabalistically.

To begin each time step t, the learner outputs a distribution over arms, D∈ΔI. The learner also outputs a sampling probability bt,it to be used if each arm it is selected. The adversary selects the intrinsic reward and catastrophic risk functions, Rt and Ct in response to D. Then, the learner's selected arm it is drawn from D.

Next, a biased coin is flipped, with probability bt,it of heads, Ft∼Bernoulli(bt,it). If it lands heads (Ft=1) then t is an exploration step. This means that the learner observes the intrinsic reward Rt(it) and risk Ct(it) of its selected arm but receives no payoff. If the coin lands tails, t it is a deployment step, the agent receives a payoff but does not observe it.

After T of these time steps have passed, the hope is that our learning algorithm has achieved low regret and has used few exploration steps.

Defining the agent's performance

We define the number of exploration steps and the regret in turn.

The number of exploration steps is defined as ∑Tt=1Ft.

Regret is defined relative to the best arm, i∗:
i∗∈argmaxi′∈IT∑t=1Mt(i′)

The regret for a series of arms i1,i2,...,iT is then:
Regret(i∗):=T∑t=1Mt(i∗)−T∑t=1E[(1−Ft)⋅Mt(it)]=T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))

To summarize, the learner only accumulates its net payoff from deployment steps but it is expected to come close to the maximum payoff that can be achieved using all time steps.

A proposed learning model

In designing a learning system for this kind of problem, a couple of difficulties arise. The first is that the surrogate loss function has an extremely large range. The net payoff function Mt has range [−1τ,1], where τ≈10−20 and so when bandit algorithms like Exp3 are applied to this learning problem, their regret will include an unacceptable factor of 1/τ. So Mt is replaced with a surrogate loss function St, whose range is capped to [−z,1]:

St(i):=Rt(i)−min(Ct(i)τ,z)

where z∈[1,∞) indicates how how much lost utility from catastrophes the learner tolerates.

The second difficulty is that if the learner observed St on every time step, it would always output bt,it=1, and it would use a linear number of exploration steps. To achieve a sublinear amount of exploration, the learner must instead use sampling probabilities bt,it that are sublinear in T and use estimates of the surrogate problem, ^S1(i),...,^ST(i), such that:

^St(i):=FtRt(i)−min(1τCt(i),z)bt,it
where γ∈(0,1] is the base sampling rate. Note that this estimate depends on the sampling probabilities bt,it, which in turn depends on the learner's selected arms, it.

The main result, which follows from Theorem 1, is that a bandit learner can achieve (ϵ,δ)-optimal regret for any δ∈(0,1) where the error ϵ and the number of exploration time steps are sublinear, and do not depend on the risk-tolerance, τ.

**Theorem 1. **
Consider a learner that selects a sequence of arms i1,i2,...,iT using the Exp3 algorithm on some payoff functions ^S1,...,^ST using sampling probabilities b1,i1,b2,i2,...,bT,iT, such that:
bt,it:=min(max(^at,it,γ),1)
where the learner uses an overestimate of the ideal sampling probability ^at,it≥at,it
whose sum exceeds the ideal sampling probabilities by some fixed constant T∈[1,∞):
T∑t=1^at,it≤HT∑t=1at,it
where H≥1. Then, the learner's regret and exploration steps are probabalistically bounded:
Pr(T∑t=1Ft≥ζ+Hz(2α+η+T)∨T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))≥2α+η+2α+η+Tz)≤2exp(−γ2α22z2T)+exp(−ζ28T)
where η=2(z+1)γ√(e−1)Kln(K)T
where α>0, ζ>0, γ∈(0,1] is the base sampling rate, K is the number of bandit arm, and z∈[1,∞] being amount of utility that the learner is prepared to lose from catastrophes.

In outline, our approach is to use the Exp3 adversarial bandit algorithm on the estimates ^S1,...,^ST of the surrogate loss function. We will begin by showing that the surrogate regret upper-bounds the actual regret (Lemma 2). We then quantify the error from approximating the surrogate regret (Lemmas 3, 4). To the estimation error, we add the regret from the Exp3 algorithm to obtain Lemma 5. To prove the bound on exploration steps, Theorem 2 is just a small extension.

Good surrogate performance implies good deployment performance

First, we need to show that St is a good surrogate for Mt. Specifically, we show that the regret with Mt is bounded by our regret with St.

Lemma 2

If a learner selects arms i1,...iT and sampling probabilities b1,i1,...,bT,iT as described in Theorem 1, then, its regret will be no worse than the regret with with the surrogate St.
Regret(i∗)=T∑t=1(Mt(it)−(1−bt,it)Mt(i∗)≤T∑t=1(St(i∗)−(1−bt,it)St(it)

Proof of Lemma 2.
First, we can show that (1−bt,it)Mt(it)=(1−bt,it)St(it). This is because if the catastrophic risk is ever too high, the learner will just take an exploration step.

If the catastrophic risk from some arm it is unacceptably high, i.e. Ct(it)τ>z, then bt,it=^at,it=at,it=1, and so (1−bt,it)=0.

If the catastrophic risk is in the acceptable range, i.e. if Ct(it)τ≤z, then:

So for all time steps in {1,2,...,T}, (1−bt,it)(Mt(it)−St(it))=0, so ∑Tt=1(1−bt,it)(Mt(it)−St(it))=0. It follows that:
T∑t=1(1−bt,it)Mt(it)=T∑t=1(1−bt,it)St(it)

We must also show that ∑Tt=1Mt(i∗)≤∑Tt=1St(i∗). This is true because for all i, St(i)≥Mt(i). One can see that Mt(i)−St(i)=min(Ct,(i)τ,z)−Ct,(i∗)τ≤0.

It therefore follows that:
Regret(i∗)=T∑t=1(Mt(i∗)−(1−bt,it)Mt(it)≤T∑t=1(St(i∗)−(1−bt,it)St(it))

◻

Bounds on estimation error

The next step is to bound the error that arises from estimating St with ^St.

Lemma 3
If a learner selects arms i1,...iT and sampling probabilities b1,i1,...,bT,iT as described in Theorem 1, then:

Pr(t∑t′=1(^St′(it)−St′(it))>α)≤exp(−γ2α22z2t)
where α>0.

Proof of Lemma 3.

First, we show that ^St(i) is an unbiased estimate of St(i).
E[^St(i)]=E⎡⎢
⎢⎣FtRt(it)−min(1τCt(it),z)bt,it⎤⎥
⎥⎦=bt,itRt(i)−min(1τCt(it),z)bt,it=St(i)

Then, since ^St(i) is an unbiased estimate of St(i), we can define the following martingale:

Xt:=t∑t′=1(^St′(it′)−St′(it′))

Using the fact that St(it) has range [−z,1], we can bound |Xt−Xt−1|:
Xt−Xt−1=^St(it)−St(it)=(1bt,itFt−1)St(it)∴|Xt−Xt−1|≤max(|1bt,it−1|,|−1|)⋅|−z|≤max(1min(max(^at,it,γ),1)−1,1)⋅z≤max(1γ−1,1)⋅z≤zγ(γ∈(0,1])

Azuma's inequality yields the result:

Pr(t∑t′=1(^St′(it)−St′(it))>α)≤exp(−γ2α22z2t)
where α>0.
◻

We will use the same argument to bound the estimation error when a best arm i∗ is repeatedly selected.

Lemma 4.
Consider that learner selects arms i1,...iT and sampling probabilities b1,i1,...,bT,iT as described in Theorem 1, and let i∗∈argmaxi′∈I∑Tt=1Mt(i′) be a best arm.
Then the estimate of the payoffs for the best arm can be bounded by:

Pr(t∑t′=1(St′(i∗)−^St′(i∗)>α)≤exp(−γ2α22z2t)
where α>0.

Proof of Lemma 4.

The proof is similar to Lemma 3. Since ^St′(i∗) is an unbiased estimate of St′(i∗), we have the martingale Yt=∑tt′=1(St′(i∗)−^St′(i∗)). As in Lemma 3, this martingale is bounded by |Yt−Yt−1|≤zγ. The application of Azuma's inequality yields the result.◻

A bound on surrogate regret

In order to prove Theorem 1, we will need to show that the Exp3 algorithm achieves low regret with respect to ^St.

Its regret guarantee states that an Exp3 learner that selects some arms i1,i2,...,iT on some payoff functions ^S1,...,^ST that have range [−zγ,1γ] will have regret bounded as [1]:

Next, we show that the Exp3 algorithm will perform well on the surrogate problem.

Lemma 5

If a learner selects arms i1,...iT and sampling probabilities b1,i1,...,bT,iT as described in Theorem 1, then its performance in on the surrogate problem will be probabalistically bounded:
Pr(T∑t=1(St(i∗)−St(it))≥2α+η)≤2exp(−γ2α22z2T)
where η=2(z+1)γ√(e−1)Kln(K)T, and α>0.

Proof of Lemma 5.
Using the union bound, we have that:
$$
\begin{aligned}
& \text{Pr}\left(\sum_{t=1}^T (S_{t}(i^) - S_{t}(i_t)) \geq 2\alpha + \eta\right) \
& = \text{Pr}\left(
\sum_{t=1}^T (S_{t}(i^) - \hat{S}_{t}(i^*)

If ∑Tt=1(St(i∗)−St(it))≤2α+η, then:
T∑t=1(−St(it))≤2α+ηT∑t=1(−Rt(it)+zat,it)≤−T+T∑t=1zat,it≤∴T∑t=1at,it≤1z(2α+η+T)T∑t=1^at,itH≤∴T∑t=1bt,it≤Hz(2α+η+T)

where η=2(z+1)γ√(e−1)Kln(K)T, and α>0.

◻

\section{Bounding the overall regret}

Lemma 7.

If a learner selects arms i1,...iT and sampling probabilities b1,i1,...,bT,iT as described in Theorem 1, and furthermore,
T∑t=1(St(i∗)−St(it))≤2α+η)

where η=2(z+1)γ√(e−1)Kln(K)T, and α>0, then regret will be bounded as:

T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))≤2α+η+Hz(2α+η+T)

Proof of Lemma 7

If ∑Tt=1(St(i∗)−St(it))≤2α+η where η=2(z+1)γ√(e−1)Kln(K)T, and α>0, then Lemma 6 gives us that:
T∑t=1bt,it≤Hz(2α+η+T)∴T∑t=1bt,itStit≤Hz(2α+η+T)∴T∑t=1(St(i∗)−St(it))+T∑t=1bt,itStit≤2α+η+Hz(2α+η+T)T∑t=1(St(i∗)−(1−bt,it)St(it))≤

Using Lemma 2, we know that this bound on regret will give the learner a good net payoff:
T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))≤2α+η+Hz(2α+η+T)

◻

Completing the proof of Theorem 1

Proof of Theorem 1
For the learner to succeed, two things must happen. First, its sampling error must not cause too much regret. Second, it must not have too many exploration steps:

Pr(T∑t=1Ft≥ζ+Hz(2α+η+T)∨T∑t=1(Mt(i∗)−(1−bt,it)Mt(it)≥2α+η+2α+η+Tz)≤Pr(T∑t=1(Ft−bt,it)≥ζ∨T∑t=1bt,it≥Hz(2α+η+T)∨T∑t=1(St(i∗)−St(it)≥2α+η)≤Pr(T∑t=1(Ft−bt,it)≥ζ∨T∑t=1(St(i∗)−St(it)≥2α+η)(Lemma 6)≤Pr(T∑t=1(Ft−bt,it)≥ζ)+Pr(T∑t=1(St(i∗)−St(it)≥2α+η)(Union bound)
From Lemma 5, we can bound the surrogate performance as: Pr(∑Tt=1(St(i∗)−St(it)≥2α+η)≤2exp(−γ2α22z2T). Using Azuma's inequality, we can also bound Pr(∑Tt=1(Ft−bt,it)≥ζ), in order to complete the result. Since Ft is an unbiased estimate of bt,it, we have the Martingale:

Zt=t∑t′=1(Ft′−bt′)
where |Zt−Zt−1|<2.

By Azuma's inequality, we have Pr(∑Tt=1(Ft−bt,it)≥ζ)≤exp(−ζ28T), where ζ>0. So we obtain:

Achieving sublinear regret and sublinear exploration

What remains is to appropriately set the learner's parameters to achieve sublinear exploration and regret. If we set the parameters to γ=T−16, z=T16, and the constants to α=√−2log(δ/3)T56 and ζ=2√−2log(δ/3)T, we have η=2ν(T16+1)T23, and this yields an algorithm that will with (1−δ) probability achieve an error of less than ϵ where:

and the number of exploration steps is:
T∑t=1Ft≤2√−2log(δ/3)T+HT16(2√−2log(δ/3)T56+2ν(T16+1)T23+T)=O(√Kln(K)T56)
That is to say that the regret and number of exploration steps will both be of order O(T56) with probability 1−δ, and do not depend on the risk-tolerance, τ.

Discussion

The main aim of this post has been to develop a new model where a bandit algorithm can learn to avoid catastrophes. We have seen that a learner can avoid catastrophes by a process that includes switching between exploration and deployment steps. The modelling assumptions were that it can access overestimates ^at,it of the ideal rate at which to sample for a catastrophe, such that these estimates are not overall too sensitive to nonserious catastrophic risks: ∑Tt=1^at,it≤H∑Tt=1at,it, and that there is some arm that carries no catastrophic risk.

These assumptions could have some difficulties in practice. Many AI systems can avoid causing catastrophes be remaining immobile or inert, but others such as self-driving cars and spy drones, do not have this luxury. Obtaining the estimates of catastrophe risk would also be very difficult. In general, a lot of data would be required, which might perhaps have to be labelled using simulations and/or by human operators. There is further work to be done in trying to relax some of these assumptions.

Since this is just a proof of concept, the regret and exploration bounds are both of order T56, and we have left this bound very loose. Ultimately, the aim would be to achieve efficiency nearer to the order of T23 in keeping with a similar class of games lacking "local observability" as described by Bartok et al [2]. We have some ideas about how to achieve this, and will post our results.

References

The bound on Exp3 is given in the discussion of Corrollary 4.2 on page 8-9 in Auer, Peter, et al. "Gambling in a rigged casino: The adversarial multi-armed bandit problem." Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995.

See Bartók, Gábor, et al. "Partial monitoring-classification, regret bounds, and algorithms." Mathematics of Operations Research 39.4 (2014): 967-997.

Note: This describes an idea of Jessica Taylor’s.In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.

A major challenge with efficiently learning to avoid catastrophes is that the utility (or loss) function can take extremely negative values. This causes problems for standard online learning models, which often incur regret proportional to the difference between the maximum and minimum achievable utility.

In our last post, we explored how exploration-only bandit algorithms can be adapted to efficiently learn to avoid catastrophic actions. That model was oversimplified in a couple of ways. First, it provided a fixed distribution of examples to the learner. Second, it cleanly divides the learning process into an exploration phase, in which the learner can make many mistakes, and a subsequent deployment phase, in which the learner must select some near-optimal, non-catastrophic arm. In this post, we relax both of these assumptions. We use a learner that responds to a changing, adversarially-selected stream of examples. We do continue to allow the learner some exploration steps but the learner must choose when to use these to avoid catastrophic consequences. We don't achieve tight bounds, but we demonstrate that bandit learning systems can at least achieve sublinear regret using a sublinear number of exploration steps in an adversarial context. Similarly to our last post, we assume that the learner's beliefs about risks of catastrophe contain some grain of truth. Specifically, in this post, the learner has access to some overestimates of how frequently it should check for catastrophes, and sum of these estimates is not too high by more than a constant factor. We also assume the existence of some arm that has no risk of catastrophe.

The proofs in this post are somewhat long, and may be substantially changed with future work, so some readers may want to just read "Setup", "Defining the agent's performance", "A proposed learning model" and "Discussion".

## Setup

Our model relies on some definitions:

ideal sampling probability, where z∈[1,∞) is a parameter that specifies how much lost utility from catastrophes the learner tolerates.overestimate of the ideal sampling probability. The learner has access to this overestimate.sampling probabilitythat the learner selects for some time step.On each time step, 1,2,...,T, the learner performs two tasks: i) it selects an arm, and ii) it decides whether the step is an exploration step or a deployment step. Both of these decisions are carried out probabalistically.

To begin each time step t, the learner outputs a distribution over arms, D∈ΔI. The learner also outputs a sampling probability bt,it to be used if each arm it is selected. The adversary selects the intrinsic reward and catastrophic risk functions, Rt and Ct in response to D. Then, the learner's selected arm it is drawn from D.

Next, a biased coin is flipped, with probability bt,it of heads, Ft∼Bernoulli(bt,it). If it lands heads (Ft=1) then t is an

exploration step. This means that the learner observes the intrinsic reward Rt(it) and risk Ct(it) of its selected arm but receives no payoff. If the coin lands tails, t it is a deployment step, the agent receives a payoff but does not observe it.After T of these time steps have passed, the hope is that our learning algorithm has achieved low regret and has used few exploration steps.

## Defining the agent's performance

We define the number of exploration steps and the

regretin turn.The number of exploration steps is defined as ∑Tt=1Ft.

Regret is defined relative to the best arm, i∗: i∗∈argmaxi′∈IT∑t=1Mt(i′)

The regret for a series of arms i1,i2,...,iT is then: Regret(i∗):=T∑t=1Mt(i∗)−T∑t=1E[(1−Ft)⋅Mt(it)]=T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))

To summarize, the learner only accumulates its net payoff from deployment steps but it is expected to come close to the maximum payoff that can be achieved using

alltime steps.## A proposed learning model

In designing a learning system for this kind of problem, a couple of difficulties arise. The first is that the surrogate loss function has an extremely large range. The net payoff function Mt has range [−1τ,1], where τ≈10−20 and so when bandit algorithms like Exp3 are applied to this learning problem, their regret will include an unacceptable factor of 1/τ. So Mt is replaced with a surrogate loss function St, whose range is capped to [−z,1]:

St(i):=Rt(i)−min(Ct(i)τ,z)

where z∈[1,∞) indicates how how much lost utility from catastrophes the learner tolerates.

The second difficulty is that if the learner observed St on every time step, it would always output bt,it=1, and it would use a linear number of exploration steps. To achieve a sublinear amount of exploration, the learner must instead use sampling probabilities bt,it that are sublinear in T and use estimates of the surrogate problem, ^S1(i),...,^ST(i), such that:

^St(i):=FtRt(i)−min(1τCt(i),z)bt,it where γ∈(0,1] is the

base sampling rate. Note that this estimate depends on the sampling probabilities bt,it, which in turn depends on the learner's selected arms, it.The main result, which follows from Theorem 1, is that a bandit learner can achieve (ϵ,δ)-optimal regret for any δ∈(0,1) where the error ϵ and the number of exploration time steps are sublinear, and do not depend on the risk-tolerance, τ.

In outline, our approach is to use the Exp3 adversarial bandit algorithm on the estimates ^S1,...,^ST of the surrogate loss function. We will begin by showing that the surrogate regret upper-bounds the actual regret (Lemma 2). We then quantify the error from approximating the surrogate regret (Lemmas 3, 4). To the estimation error, we add the regret from the Exp3 algorithm to obtain Lemma 5. To prove the bound on exploration steps, Theorem 2 is just a small extension.

## Good surrogate performance implies good deployment performance

First, we need to show that St is a good surrogate for Mt. Specifically, we show that the regret with Mt is bounded by our regret with St.

Proof of Lemma 2.First, we can show that (1−bt,it)Mt(it)=(1−bt,it)St(it). This is because if the catastrophic risk is ever too high, the learner will just take an exploration step.If the catastrophic risk from some arm it is unacceptably high, i.e. Ct(it)τ>z, then bt,it=^at,it=at,it=1, and so (1−bt,it)=0.

If the catastrophic risk is in the acceptable range, i.e. if Ct(it)τ≤z, then:

Mt(it)−St(it)=(Rt(it)−Ct(it)τ)−(Rt(it)−min(Ct(it)τ,z))=0

So for all time steps in {1,2,...,T}, (1−bt,it)(Mt(it)−St(it))=0, so ∑Tt=1(1−bt,it)(Mt(it)−St(it))=0. It follows that: T∑t=1(1−bt,it)Mt(it)=T∑t=1(1−bt,it)St(it)

We must also show that ∑Tt=1Mt(i∗)≤∑Tt=1St(i∗). This is true because for all i, St(i)≥Mt(i). One can see that Mt(i)−St(i)=min(Ct,(i)τ,z)−Ct,(i∗)τ≤0.

It therefore follows that: Regret(i∗)=T∑t=1(Mt(i∗)−(1−bt,it)Mt(it)≤T∑t=1(St(i∗)−(1−bt,it)St(it))

◻

## Bounds on estimation error

The next step is to bound the error that arises from estimating St with ^St.

Proof of Lemma 3.First, we show that ^St(i) is an unbiased estimate of St(i). E[^St(i)]=E⎡⎢ ⎢⎣FtRt(it)−min(1τCt(it),z)bt,it⎤⎥ ⎥⎦=bt,itRt(i)−min(1τCt(it),z)bt,it=St(i)

Then, since ^St(i) is an unbiased estimate of St(i), we can define the following martingale:

Xt:=t∑t′=1(^St′(it′)−St′(it′))

Using the fact that St(it) has range [−z,1], we can bound |Xt−Xt−1|: Xt−Xt−1=^St(it)−St(it)=(1bt,itFt−1)St(it)∴|Xt−Xt−1|≤max(|1bt,it−1|,|−1|)⋅|−z|≤max(1min(max(^at,it,γ),1)−1,1)⋅z≤max(1γ−1,1)⋅z≤zγ(γ∈(0,1])

Azuma's inequality yields the result:

Pr(t∑t′=1(^St′(it)−St′(it))>α)≤exp(−γ2α22z2t) where α>0. ◻

We will use the same argument to bound the estimation error when a best arm i∗ is repeatedly selected.

Proof of Lemma 4.The proof is similar to Lemma 3. Since ^St′(i∗) is an unbiased estimate of St′(i∗), we have the martingale Yt=∑tt′=1(St′(i∗)−^St′(i∗)). As in Lemma 3, this martingale is bounded by |Yt−Yt−1|≤zγ. The application of Azuma's inequality yields the result.◻

## A bound on surrogate regret

In order to prove Theorem 1, we will need to show that the Exp3 algorithm achieves low regret with respect to ^St.

Its regret guarantee states that an Exp3 learner that selects some arms i1,i2,...,iT on some payoff functions ^S1,...,^ST that have range [−zγ,1γ] will have regret bounded as [1]:

supi′∈IT∑t=1(^St(i′)−^St(it))≤2(1γ−−zγ)√e−1√TKlnK≤2(z+1)γ√e−1√TKlnK

Next, we show that the Exp3 algorithm will perform well on the surrogate problem.

Lemma 5Proof of Lemma 5.Using the union bound, we have that: $$ \begin{aligned} & \text{Pr}\left(\sum_{t=1}^T (S_{t}(i^) - S_{t}(i_t)) \geq 2\alpha + \eta\right) \ & = \text{Pr}\left( \sum_{t=1}^T (S_{t}(i^) - \hat{S}_{t}(i^*){t}(i^*) - \hat{S}{t}(i_t){t}(i^*) - S{t}(i_t)) \geq 2\alpha + \eta\right) \ & \leq \text{Pr}\left( \sum_{t=1}^T (S_{t}(i^) - \hat{S}_{t}(i^)) \geq \alpha \vee \sum_{t=1}^T(\hat{S}{t}(i^*) - \hat{S}{t}(i_t)) \geq \eta \vee \sum_{t=1}^T (\hat{S}{t}(i_t) - S{t}(i_t)) \geq \alpha\right) \ & \leq \text{Pr}\left( \sum_{t=1}^T (S_{t}(i^) - \hat{S}_{t}(i^)) \geq \alpha\right){t}(i^*) - \hat{S}{t}(i_t)) \geq \eta\right) + \text{Pr}\left(\sum_{t=1}^T (\hat{S}{t}(i_t) - S{t}(i_t)) \geq \alpha\right) \ & \leq \exp\left(\frac{-\gamma^2\alpha^2}{2z^2T}\right) + 0 + \exp\left(\frac{-\gamma^2\alpha^2}{2z^2T}\right) \qquad \qquad \text{(Lemmas 3, 4 and Exp3 bound)}\ & \leq 2\exp\left(\frac{-\gamma^2\alpha^2}{2z^2T}\right) \end{aligned} $$ ◻In order to prove a bound on exploration steps in Theorem 1, we must first prove that the sum of sampling probabilities is not too high.

## A bound on sampling probabilities

Lemma 6.Proof of Lemma 6.First note that a best arm i∗ must have non-negative total score on the surrogate problem:

∃i′∈I s.t. ∀tCt(i′)=0∃i′∈I s.t. ∀tMt(i′)≥0(Rt(i′)≥0)∴T∑t=1M(i∗)≥0∴T∑t=1S(i∗)≥0(St(i)≥Mt(i))

If ∑Tt=1(St(i∗)−St(it))≤2α+η, then: T∑t=1(−St(it))≤2α+ηT∑t=1(−Rt(it)+zat,it)≤−T+T∑t=1zat,it≤∴T∑t=1at,it≤1z(2α+η+T)T∑t=1^at,itH≤∴T∑t=1bt,it≤Hz(2α+η+T)

where η=2(z+1)γ√(e−1)Kln(K)T, and α>0.

◻

\section{Bounding the overall regret}

Lemma 7.Proof of Lemma 7If ∑Tt=1(St(i∗)−St(it))≤2α+η where η=2(z+1)γ√(e−1)Kln(K)T, and α>0, then Lemma 6 gives us that: T∑t=1bt,it≤Hz(2α+η+T)∴T∑t=1bt,itStit≤Hz(2α+η+T)∴T∑t=1(St(i∗)−St(it))+T∑t=1bt,itStit≤2α+η+Hz(2α+η+T)T∑t=1(St(i∗)−(1−bt,it)St(it))≤

Using Lemma 2, we know that this bound on regret will give the learner a good net payoff: T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))≤2α+η+Hz(2α+η+T)

◻

## Completing the proof of Theorem 1

Proof of Theorem 1For the learner to succeed, two things must happen. First, its sampling error must not cause too much regret. Second, it must not have too many exploration steps:Pr(T∑t=1Ft≥ζ+Hz(2α+η+T)∨T∑t=1(Mt(i∗)−(1−bt,it)Mt(it)≥2α+η+2α+η+Tz)≤Pr(T∑t=1(Ft−bt,it)≥ζ∨T∑t=1bt,it≥Hz(2α+η+T)∨T∑t=1(St(i∗)−St(it)≥2α+η)≤Pr(T∑t=1(Ft−bt,it)≥ζ∨T∑t=1(St(i∗)−St(it)≥2α+η)(Lemma 6)≤Pr(T∑t=1(Ft−bt,it)≥ζ)+Pr(T∑t=1(St(i∗)−St(it)≥2α+η)(Union bound) From Lemma 5, we can bound the surrogate performance as: Pr(∑Tt=1(St(i∗)−St(it)≥2α+η)≤2exp(−γ2α22z2T). Using Azuma's inequality, we can also bound Pr(∑Tt=1(Ft−bt,it)≥ζ), in order to complete the result. Since Ft is an unbiased estimate of bt,it, we have the Martingale:

Zt=t∑t′=1(Ft′−bt′) where |Zt−Zt−1|<2.

By Azuma's inequality, we have Pr(∑Tt=1(Ft−bt,it)≥ζ)≤exp(−ζ28T), where ζ>0. So we obtain:

Pr(T∑t=1Ft≥ζ+Hz(2α+η+T)∨T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))≥2α+η+2α+η+Tz)≤2exp(−γ2α22z2T)+exp(−ζ28T)

◻

## Achieving sublinear regret and sublinear exploration

What remains is to appropriately set the learner's parameters to achieve sublinear exploration and regret. If we set the parameters to γ=T−16, z=T16, and the constants to α=√−2log(δ/3)T56 and ζ=2√−2log(δ/3)T, we have η=2ν(T16+1)T23, and this yields an algorithm that will with (1−δ) probability achieve an error of less than ϵ where:

ϵ=2√−2log(δ/3)T56+2ν(T16+1)T23+1T16(2√−2log(δ/3)T56+2ν(T16+1)T23+T)=(2√−2log(δ/3)+2ν+1)T56+O(T23)=O((√log(δ)+√Kln(K))T56)

and the number of exploration steps is: T∑t=1Ft≤2√−2log(δ/3)T+HT16(2√−2log(δ/3)T56+2ν(T16+1)T23+T)=O(√Kln(K)T56) That is to say that the regret and number of exploration steps will both be of order O(T56) with probability 1−δ, and do not depend on the risk-tolerance, τ.

## Discussion

The main aim of this post has been to develop a new model where a bandit algorithm can learn to avoid catastrophes. We have seen that a learner can avoid catastrophes by a process that includes switching between exploration and deployment steps. The modelling assumptions were that it can access overestimates ^at,it of the ideal rate at which to sample for a catastrophe, such that these estimates are not overall too sensitive to nonserious catastrophic risks: ∑Tt=1^at,it≤H∑Tt=1at,it, and that there is some arm that carries no catastrophic risk.

These assumptions could have some difficulties in practice. Many AI systems can avoid causing catastrophes be remaining immobile or inert, but others such as self-driving cars and spy drones, do not have this luxury. Obtaining the estimates of catastrophe risk would also be very difficult. In general, a lot of data would be required, which might perhaps have to be labelled using simulations and/or by human operators. There is further work to be done in trying to relax some of these assumptions.

Since this is just a proof of concept, the regret and exploration bounds are both of order T56, and we have left this bound very loose. Ultimately, the aim would be to achieve efficiency nearer to the order of T23 in keeping with a similar class of games lacking "local observability" as described by Bartok et al [2]. We have some ideas about how to achieve this, and will post our results.

## References