% operators that are separated from the operand by a space
% autosize deliminaters
% operators that require brackets
% operators that require parentheses
% Paper specific
We derive a regret bound for DRL reflecting dependence on:

Number of hypotheses

Mixing time of MDP hypotheses

The probability with which the advisor takes optimal actions
That is, the regret bound we get is fully explicit up to a multiplicative constant (which can also be made explicit). Currently we focus on plain (as opposed to catastrophe) and uniform (finite number of hypotheses, uniform prior) DRL, although this result can and should be extended to the catastrophe and/or nonuniform settings.
Appendix A contains the proofs, Appendix B recalls a few lemmas we need from previous essays.
##Results
First, we briefly recall some properties of Markov chains.
#Definition 1
Consider a finite set and . We say that is a period of when there is an essential state of (that is, ) s.t. is its period, i.e. . We denote the set of periods of .
The following is a corollary of the PerronFrobenius theorem which we give without proof. [I believe this is completely standard and would be grateful to get a source for this which treats the reducible case; of course I can produce the proof but it seems redundant.]
#Proposition 1
Consider a finite set and . Then, there are , , and s.t. for any
For the purpose of this essay, we will use a definition of local sanity slightly stronger than what previously appeared as "Definition 4." We think this strengthening is not substantial, but also the current analysis can be generalized to the weaker case by adding a term proportional to the 2nd derivative of (or the 2nd moment of the mixing time). We leave out the details for the time being.
We will use the notation .
#Definition 2
Let be a universe and . A policy is said to be locally sane for when there are , and (the set of uncorrupt states) s.t. is an realization of with state function , and for any , if then
i.
ii.
iii.
Recall that for any MPD , there is s.t. for any , if and only if .
We are now ready to state the regret bound.
#Theorem 1
There is some s.t. the following holds.
Consider an interface, , for some . For each , let be a locally sane policy for . For each , let be the corresponding MDP and the corresponding set of uncorrupt states. Assume that
i. For each , .
ii.
iii. For any and , if and , then .
Then, there is an policy s.t. for any
The factor might seem difficult to understand. However, it can be bounded as follows.
#Proposition 2
Let be an MDP, a Blackwell optimal policy for and , , as in Proposition 1 applied to the Markov chain . Then
Theorem 1 and Proposition 2 immediately give the following:
#Corollary 1
There is some s.t. the following holds.
Under the conditions of Theorem 1, let be s.t. for any
i.
ii.
Assuming w.l.o.g. that all uncorrupt states are reachable from , is guaranteed to exist thanks to condition iii of Definition 2 (if some uncorrupt state is unreachable, we can consider it to be corrupt.) Let , and be as in Proposition 1, for the Markov chain . Then, there is an policy s.t. for any
##Appendix A
The proof of Theorem 1 mostly repeats the proof of the previous "Theorem 1", except that we keep track of the bounds more carefully.
#Proof of Theorem 1
Fix and . Denote . To avoid cumbersome notation, whenever should appear a subscript, we will replace it by . Let be a probability space\Comment{ and a filtration of }. Let be \Comment{measurable w.r.t. }a random variable and the following be stochastic processes\Comment{ adapted to }
We also define by
(The following conditions on and imply that the range of the above is indeed in .) Let and be as in Proposition B.1 (the form of the bound we are proving allows assuming w.l.o.g. that ). By condition iii of Definition 2, there is s.t. for any , if then
We construct \Comment{, }, , , , , , and s.t is uniformly distributed and for any , , and , denoting
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on "impossible" information.
We now construct the policy s.t. for any , s.t. and
That is, we perform Thompson sampling at time intervals of size , moderated by the delegation routine , and discard from our belief state hypotheses whose probability is below and hypotheses sampling which resulted in recommending "unsafe" actions i.e. actions that refused to perform.
In order to prove has the desired property, we will define the stochastic processes , , , , and , each process of the same type as its shriekless counterpart (thus is constructed to accommodate them). These processes are required to satisfy the following:
For any , we construct the policy s.t. for any , s.t. and
Given any policy and policy we define by
Here, is a constant defined s.t. the probabilities sum to 1. We define the policy by
Define by
Condition iii of Proposition B.1 and condition ii of Definition 2 imply that for any
This means we can apply Proposition B.2 and get
Here, the policy is defined as in Proposition B.2, with in place of . We also define the policies and by
Denote
For each , denote
We have
Condition iv of Proposition B.1 implies that, given s.t.
Therefore, , and we remain with
We have
Since , it follows that
Using condition i of Proposition B.1, we conclude
Define the random variables by
Denote
We get
We apply Proposition B.3 to each term in the sum over .
Condition ii of Proposition B.1 implies that
Here, the factor of 2 comes from the difference between the equations for and (we can construct and intermediate policy between and and use the triangle inequality for ). We conclude
For each , we get
Now we set
Without loss of generality, we can assume this to be consistent with the assumption since the bound contains a term of anyway. Similarly, the second term in the bound implies we can assume that and therefore . Finally, note that assumption ii implies that the expression we round to get is . We get
#Proposition A.1
Fix , . Then, for we have
#Proof of Proposition A.1
We will establish the claim by showing that the first term is in and the second term is in .
For the first term, we have (assuming w.l.o.g. that )
For the second term, we have
Assume w.l.o.g. that . First, we show that the numerator (and hence the entire fraction) is negative. Denote . We have
Therefore, is increasing in . Observing that , we conclude that is negative in .
Now we show that the fraction is . Equivalently, we need to show that . We have
Denote . We have
Therefore, is decreasing in . Observing that we conclude that is positive in , establishing the desired result.
#Proposition A.2
For any :
#Proof of Proposition A.2
Consider . We have
has a local minimum at and a local maximum at . Therefore, for , . For , we get
Taking , we get
Now, and is a decreasing function of for . We conclude
#Proof of Proposition 2
By Proposition 1, we have
where, . Denote