In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.
A major challenge with efficiently learning to avoid catastrophes is that the utility (or loss) function can take extremely negative values. This causes problems for standard online learning models, which often incur regret proportional to the difference between the maximum and minimum achievable utility.
In our last post, we explored how exploration-only bandit algorithms can be adapted to efficiently learn to avoid catastrophic actions. That model was oversimplified in a couple of ways. First, it provided a fixed distribution of examples to the learner. Second, it cleanly divides the learning process into an exploration phase, in which the learner can make many mistakes, and a subsequent deployment phase, in which the learner must select some near-optimal, non-catastrophic arm. In this post, we relax both of these assumptions. We use a learner that responds to a changing, adversarially-selected stream of examples. We do continue to allow the learner some exploration steps but the learner must choose when to use these to avoid catastrophic consequences. We don't achieve tight bounds, but we demonstrate that bandit learning systems can at least achieve sublinear regret using a sublinear number of exploration steps in an adversarial context. Similarly to our last post, we assume that the learner's beliefs about risks of catastrophe contain some grain of truth. Specifically, in this post, the learner has access to some overestimates of how frequently it should check for catastrophes, and sum of these estimates is not too high by more than a constant factor. We also assume the existence of some arm that has no risk of catastrophe.
The proofs in this post are somewhat long, and may be substantially changed with future work, so some readers may want to just read "Setup", "Defining the agent's performance", "A proposed learning model" and "Discussion".
Setup
Our model relies on some definitions:
Let I={I1,I2,...,IK} be a set of bandit arms.
Let Rt:I↦[0,1] be the intrinsic reward function selected at time step t.
Let Ct:I↦[0,1] be the intrinsic catastrophic risk function at time step t. We assume that there is some bandit arm that always avoids catastrophes:
∃i∈I s.t. ∀tCt(i)=0
Let Mt(i):=Rt(I)−Ct(i)τ be the net payoff for the agent, where τ∈(0,1] is the risk-tolerance, which may be some very small number of order 10−20.
Let at,it:=min(Ct(it)zτ,1) be the ideal sampling probability, where z∈[1,∞) is a parameter that specifies how much lost utility from catastrophes the learner tolerates.
Let ^at,it∈[0,1] be some overestimate of the ideal sampling probability. The learner has access to this overestimate.
Let bt,it∈[0,1] be the sampling probability that the learner selects for some time step.
On each time step, 1,2,...,T, the learner performs two tasks: i) it selects an arm, and ii) it decides whether the step is an exploration step or a deployment step. Both of these decisions are carried out probabalistically.
To begin each time step t, the learner outputs a distribution over arms, D∈ΔI. The learner also outputs a sampling probability bt,it to be used if each arm it is selected. The adversary selects the intrinsic reward and catastrophic risk functions, Rt and Ct in response to D. Then, the learner's selected arm it is drawn from D.
Next, a biased coin is flipped, with probability bt,it of heads, Ft∼Bernoulli(bt,it). If it lands heads (Ft=1) then t is an exploration step. This means that the learner observes the intrinsic reward Rt(it) and risk Ct(it) of its selected arm but receives no payoff. If the coin lands tails, t it is a deployment step, the agent receives a payoff but does not observe it.
After T of these time steps have passed, the hope is that our learning algorithm has achieved low regret and has used few exploration steps.
Defining the agent's performance
We define the number of exploration steps and the regret in turn.
The number of exploration steps is defined as ∑Tt=1Ft.
Regret is defined relative to the best arm, i∗:
i∗∈argmaxi′∈IT∑t=1Mt(i′)
The regret for a series of arms i1,i2,...,iT is then:
Regret(i∗):=T∑t=1Mt(i∗)−T∑t=1E[(1−Ft)⋅Mt(it)]=T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))
To summarize, the learner only accumulates its net payoff from deployment steps but it is expected to come close to the maximum payoff that can be achieved using all time steps.
A proposed learning model
In designing a learning system for this kind of problem, a couple of difficulties arise. The first is that the surrogate loss function has an extremely large range. The net payoff function Mt has range [−1τ,1], where τ≈10−20 and so when bandit algorithms like Exp3 are applied to this learning problem, their regret will include an unacceptable factor of 1/τ. So Mt is replaced with a surrogate loss function St, whose range is capped to [−z,1]:
St(i):=Rt(i)−min(Ct(i)τ,z)
where z∈[1,∞) indicates how how much lost utility from catastrophes the learner tolerates.
The second difficulty is that if the learner observed St on every time step, it would always output bt,it=1, and it would use a linear number of exploration steps. To achieve a sublinear amount of exploration, the learner must instead use sampling probabilities bt,it that are sublinear in T and use estimates of the surrogate problem, ^S1(i),...,^ST(i), such that:
^St(i):=FtRt(i)−min(1τCt(i),z)bt,it
where γ∈(0,1] is the base sampling rate. Note that this estimate depends on the sampling probabilities bt,it, which in turn depends on the learner's selected arms, it.
The main result, which follows from Theorem 1, is that a bandit learner can achieve (ϵ,δ)-optimal regret for any δ∈(0,1) where the error ϵ and the number of exploration time steps are sublinear, and do not depend on the risk-tolerance, τ.
**Theorem 1. **
Consider a learner that selects a sequence of arms i1,i2,...,iT using the Exp3 algorithm on some payoff functions ^S1,...,^ST using sampling probabilities b1,i1,b2,i2,...,bT,iT, such that:
bt,it:=min(max(^at,it,γ),1)
where the learner uses an overestimate of the ideal sampling probability ^at,it≥at,it
whose sum exceeds the ideal sampling probabilities by some fixed constant T∈[1,∞):
T∑t=1^at,it≤HT∑t=1at,it
where H≥1. Then, the learner's regret and exploration steps are probabalistically bounded:
Pr(T∑t=1Ft≥ζ+Hz(2α+η+T)∨T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))≥2α+η+2α+η+Tz)≤2exp(−γ2α22z2T)+exp(−ζ28T)
where η=2(z+1)γ√(e−1)Kln(K)T
where α>0, ζ>0, γ∈(0,1] is the base sampling rate, K is the number of bandit arm, and z∈[1,∞] being amount of utility that the learner is prepared to lose from catastrophes.
In outline, our approach is to use the Exp3 adversarial bandit algorithm on the estimates ^S1,...,^ST of the surrogate loss function. We will begin by showing that the surrogate regret upper-bounds the actual regret (Lemma 2). We then quantify the error from approximating the surrogate regret (Lemmas 3, 4). To the estimation error, we add the regret from the Exp3 algorithm to obtain Lemma 5. To prove the bound on exploration steps, Theorem 2 is just a small extension.
Good surrogate performance implies good deployment performance
First, we need to show that St is a good surrogate for Mt. Specifically, we show that the regret with Mt is bounded by our regret with St.
Lemma 2
If a learner selects arms i1,...iT and sampling probabilities b1,i1,...,bT,iT as described in Theorem 1, then, its regret will be no worse than the regret with with the surrogate St.
Regret(i∗)=T∑t=1(Mt(it)−(1−b
Note: This describes an idea of Jessica Taylor’s.
In order to better understand how machine learning systems might avoid catastrophic behavior, we are interested in modeling this as an adversarial learning problem.
A major challenge with efficiently learning to avoid catastrophes is that the utility (or loss) function can take extremely negative values. This causes problems for standard online learning models, which often incur regret proportional to the difference between the maximum and minimum achievable utility.
In our last post, we explored how exploration-only bandit algorithms can be adapted to efficiently learn to avoid catastrophic actions. That model was oversimplified in a couple of ways. First, it provided a fixed distribution of examples to the learner. Second, it cleanly divides the learning process into an exploration phase, in which the learner can make many mistakes, and a subsequent deployment phase, in which the learner must select some near-optimal, non-catastrophic arm. In this post, we relax both of these assumptions. We use a learner that responds to a changing, adversarially-selected stream of examples. We do continue to allow the learner some exploration steps but the learner must choose when to use these to avoid catastrophic consequences. We don't achieve tight bounds, but we demonstrate that bandit learning systems can at least achieve sublinear regret using a sublinear number of exploration steps in an adversarial context. Similarly to our last post, we assume that the learner's beliefs about risks of catastrophe contain some grain of truth. Specifically, in this post, the learner has access to some overestimates of how frequently it should check for catastrophes, and sum of these estimates is not too high by more than a constant factor. We also assume the existence of some arm that has no risk of catastrophe.
The proofs in this post are somewhat long, and may be substantially changed with future work, so some readers may want to just read "Setup", "Defining the agent's performance", "A proposed learning model" and "Discussion".
Setup
Our model relies on some definitions:
On each time step, 1,2,...,T, the learner performs two tasks: i) it selects an arm, and ii) it decides whether the step is an exploration step or a deployment step. Both of these decisions are carried out probabalistically.
To begin each time step t, the learner outputs a distribution over arms, D∈ΔI. The learner also outputs a sampling probability bt,it to be used if each arm it is selected. The adversary selects the intrinsic reward and catastrophic risk functions, Rt and Ct in response to D. Then, the learner's selected arm it is drawn from D.
Next, a biased coin is flipped, with probability bt,it of heads, Ft∼Bernoulli(bt,it). If it lands heads (Ft=1) then t is an exploration step. This means that the learner observes the intrinsic reward Rt(it) and risk Ct(it) of its selected arm but receives no payoff. If the coin lands tails, t it is a deployment step, the agent receives a payoff but does not observe it.
After T of these time steps have passed, the hope is that our learning algorithm has achieved low regret and has used few exploration steps.
Defining the agent's performance
We define the number of exploration steps and the regret in turn.
The number of exploration steps is defined as ∑Tt=1Ft.
Regret is defined relative to the best arm, i∗: i∗∈argmaxi′∈IT∑t=1Mt(i′)
The regret for a series of arms i1,i2,...,iT is then: Regret(i∗):=T∑t=1Mt(i∗)−T∑t=1E[(1−Ft)⋅Mt(it)]=T∑t=1(Mt(i∗)−(1−bt,it)Mt(it))
To summarize, the learner only accumulates its net payoff from deployment steps but it is expected to come close to the maximum payoff that can be achieved using all time steps.
A proposed learning model
In designing a learning system for this kind of problem, a couple of difficulties arise. The first is that the surrogate loss function has an extremely large range. The net payoff function Mt has range [−1τ,1], where τ≈10−20 and so when bandit algorithms like Exp3 are applied to this learning problem, their regret will include an unacceptable factor of 1/τ. So Mt is replaced with a surrogate loss function St, whose range is capped to [−z,1]:
St(i):=Rt(i)−min(Ct(i)τ,z)
where z∈[1,∞) indicates how how much lost utility from catastrophes the learner tolerates.
The second difficulty is that if the learner observed St on every time step, it would always output bt,it=1, and it would use a linear number of exploration steps. To achieve a sublinear amount of exploration, the learner must instead use sampling probabilities bt,it that are sublinear in T and use estimates of the surrogate problem, ^S1(i),...,^ST(i), such that:
^St(i):=FtRt(i)−min(1τCt(i),z)bt,it where γ∈(0,1] is the base sampling rate. Note that this estimate depends on the sampling probabilities bt,it, which in turn depends on the learner's selected arms, it.
The main result, which follows from Theorem 1, is that a bandit learner can achieve (ϵ,δ)-optimal regret for any δ∈(0,1) where the error ϵ and the number of exploration time steps are sublinear, and do not depend on the risk-tolerance, τ.
In outline, our approach is to use the Exp3 adversarial bandit algorithm on the estimates ^S1,...,^ST of the surrogate loss function. We will begin by showing that the surrogate regret upper-bounds the actual regret (Lemma 2). We then quantify the error from approximating the surrogate regret (Lemmas 3, 4). To the estimation error, we add the regret from the Exp3 algorithm to obtain Lemma 5. To prove the bound on exploration steps, Theorem 2 is just a small extension.
Good surrogate performance implies good deployment performance
First, we need to show that St is a good surrogate for Mt. Specifically, we show that the regret with Mt is bounded by our regret with St.