Previously, we defined a setting called "Delegative Inverse Reinforcement Learning" (DIRL) in which the agent can delegate actions to an "advisor" and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is O(n2/3)) for one-shot DIRL (as opposed to standard regret bounds in RL which are only applicable in the episodic setting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider "Delegative Reinforcement Learning" (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoid catastrophic actions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a one-shot regret bound (in traditional normalization, O(n3/4)). Analogously to before, we allow for "corrupt" states in which both the advisor and the reward signal stop being reliable.

Appendix A contains the proofs and Appendix B contains propositions proved before.

Notation

The notation K:Xk→Y means K is Markov kernel from X to Y. When Y is a finite set, this is the same as a measurable function K:X→ΔY, and we use these notations interchangeably. Given K:Xk→Y and A⊆Y (corr. y∈Y), we will use the notation K(A∣x) (corr K(y∣x)) to stand for Pry′∼K(x)[y′∈A] (corr. Pry′∼K(x)[y′=y]). Given Y is a finite set, μ∈ΔYω, h∈Y∗ and y∈Y, the notation μ(y∣h) means Prx∼μ[x|h|=y∣x:|h|=h].

Given Ω a measurable space, μ∈ΔΩ, n,m∈N, {Ak}k∈[n], {Bj}j∈[m] finite sets and {Xk:Ω→Ak}k∈[n], {Yj:Ω→Bj}j∈[m] random variables (measurable mappings), the mutual information between the joint distribution of the Xk and the joint distribution of the Yj will be denoted

Iω∼μ[X0,X1…Xn−1;Y0,Y1…Ym−1]

We will parameterize our geometric time discount by γ=e−1/t, thus all functions that were previously defined to depend on t are now considered functions of γ.

Results

We start by explaining the relation between the formalism of general environments we used before and the formalism of finite MDPs.

Definition 1

A finite Markov Decision Process (MDP) is a tuple

M=(SM,AM,TM:SM×AMk→SM,RM:SM→[0,1])

Here, SM is a finite set (the set of states), AM is a finite set (the set of actions), TM is the transition kernel and RM is the reward function.

A stationary policy for M is any π:SMk→AM. The space of stationary policies is denoted ΠM. Given π∈ΠM, we define TMπ:SMk→SM by

TMπ(t∣s):=∑a∈AMTM(t∣s,a)π(a∣s)

We define VM:SM×(0,1)→[0,1] and QM:SM×AM×(0,1)→[0,1] by

VM(s,γ):=(1−γ)maxπ∈ΠM∞∑n=0γnETnMπ(s)[RM]

QM(s,a,γ):=(1−γ)RM(s)+γEt∼TM(s,a)[VM(t,γ)]

Here, TnMπ:SMk→SM is just the n-th power of TMπ in the sense of Markov kernel composition.

As well known, VM and QM are rational functions in γ for 1−γ≪1, therefore in this limit we have the Taylor expansions

VM(s,γ)=∞∑k=01k!VkM(s)⋅(1−γ)k

QM(s,a,γ)=∞∑k=01k!QkM(s,a)⋅(1−γ)k

Given any s∈SM, we define {AkM(s)⊆AM}k∈N recursively by

A0M(s):=argmaxa∈AMQ0M(s,a)

Ak+1M(s):=argmaxa∈AkM(s)Qk+1M(s,a)

All MDPs will be assumed to be finite, so we drop the adjective "finite" from now on.

Definition 2

Let I=(A,O) be an interface. An I-universe υ=(μ,r) is said to be an O-realization of MDP M with state function S:hdomμ→SM when AM=A and for any h∈hdomμ, a∈A and o∈O

TM(s∣S(h),a)=Pro∼μ(ha)[S(hao)=s]

r(h)=RM(S(h))

Now now define the relevant notion of a "good advisor."

Definition 3

Let υ=(μ,r) be a universe and ϵ>0. A policy π is said to be ϵ-sane for υ when there are M, S s.t. υ is an O-realization of M with state function S and for any h∈hdomμ

i. suppπ(h)⊆A0M(S(h))

ii. ∃a∈A1M(S(h)):π(a∣h)>ϵ

We can now formulate the regret bound.

Theorem 1

Fix an interface I and ϵ>0. Consider H={υk=(μk,r)∈ΥI}k∈[N] for some N∈N (note that r doesn't depend on k). Assume that for each k∈[N], σk is an ϵ-sane policy for υk. Then, there is an ¯I-metapolicy π∗ s.t. for any k∈[N]

EU∗υk(γ)−EUπ∗¯υk[σk](γ)=O((1−γ)1/4)

Corollary 1

Fix an interface I and ϵ>0. Consider H={υk=(μk,r)∈ΥI}k∈N. Assume that for each k∈N, σk is a ϵ-sane policy for υk. Define ¯H:={¯υk[σk]}n∈N. Then, ¯H is learnable.

Now, we deal with corrupt states.

Definition 4

Let υ=(μ,r) be a universe and ϵ>0. A policy π is said to be locally ϵ-sane for υ when there are M, S and U⊆SM (the set of uncorrupt states) s.t. υ is an O-realization of M with state function S, S(λ)∈U and for any h∈hdomμ, if S(h)∈U then

i. If a∈suppπ(h) and o∈suppμ(ha) then S(hao)∈U.

ii. suppπ(h)⊆A0M(S(h))

iii. ∃a∈A1M(S(h)):π(a∣h)>ϵ

iv. ∃a∈A2M(S(h)):TM(U∣S(h),a)=1

Of course, this requirement is still unrealistic for humans in the real world. In particular, it makes the formalism unsuitable for modeling the use of AI for catastrophe mitigation (which is ultimately what we are interested in!) since it assumes the advisor is already capable of avoiding any catastrophe. In following work, we plan to relax the assumptions further.

Corollary 2

Fix an interface I and ϵ>0. Consider H={υk=(μk,rk)∈ΥI}k∈[N] for some N∈N. Assume that for each k∈[N], σk is locally ϵ-sane for υk. For each k∈[N], let Uk⊆SMk be the corresponding set of uncorrupt states. Assume further that for any k,j∈N and h∈hdomμk∩hdomμj, if Sk(h)∈Uk and Sj(h)∈Uj, then rk(h)=rj(h). Then, there is an ¯I-policy π∗ s.t. for any k∈[N]

EU∗υk(γ)−EUπ∗¯υk[σk](γ)=O((1−γ)1/4)

Corollary 3

Assume the same conditions as in Corollary 2, except that H may be countable infinite. Then, ¯H is learnable.

Appendix A

First, we prove an information theoretic bound that shows that for Thompson sampling, the expected information gain is bounded below by a function of the loss.

Proposition A.1

Consider a probability space (Ω,P∈ΔΩ), N∈N, R⊆[0,1] a finite set and random variables U:Ω→R, K:Ω→[N] and J:Ω→[N]. Assume that K∗P=J∗P=ζ∈Δ[N] and I[K;J]=0. Then

Now, we describe a "delegation routine" D that can transform any "proto-policy" π that recommends some set of actions from A into an actual ¯I-policy s.t (i) with high probability, on each round, either a "safe" recommended action is taken, or all recommended actions are "unsafe" or delegation is performed and (ii) the expected number of delegations is small. For technical reasons, we also need to the modified routines D!k which behave the same way as D except for some low probability cases.

Proposition A.2

Fix an interface I=(A,O), N∈N, ϵ∈(0,1|A|), δ∈(0,1N). Consider some {σk:(A×O)∗k→A}k∈[N]. Then, there exist D:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→¯A and {D!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→¯A}k∈[N] with the following properties. Given x∈(2A×¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)∗, we denote x–– its projection to ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗. Thus, x––––∈(A×O)∗.
Given μ an I-environment, π:hdomμk→2A, D′:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→¯A and k∈[N], we can define Ξ[μ,σk,D′,π]∈Δ(2A×¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)ω as follows

We define {Ak:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→2A}k∈N, ~ζ:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N], ζ:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N], {~ζ!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N]}k∈N, {ζ!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N]}k∈N, D and {D!k}k∈N recursively. For each h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗, a∈¯A, o∈¯O, B⊆A and j,k∈[N], we require

Ak(h,B):={a∈A∣σk(a∣h––)>0∧(a∈B∨∀b∈B:σk(b∣h––)≤ϵ)}

~ζ(k∣λ)=~ζ!j(k∣λ):=1N

~ζ(k∣hao):=⎧⎪
⎪
⎪⎨⎪
⎪
⎪⎩ζ(k∣h) if a≠⊥1N if a=⊥,∑N−1i=0ζ(i∣h)⋅σi(b∣h––)=0(∑N−1i=0ζ(i∣h)⋅σi(b∣h––))−1ζ(k∣h)⋅σk(b∣h––) otherwise, assuming o∈bO

~ζ!j(k∣hao):=⎧⎨⎩ζ!j(k∣h) if a≠⊥(∑N−1i=0ζ!j(i∣h)⋅σi(b∣h––))−1ζ!j(k∣h)⋅σk(b∣h––) if a=⊥,o∈bO

Condition ii follows because the only difference is in the equation for ζ vs. ζ!j. That is, ζ may "discard" the "correct" element of [N] but this happens with probability at most δ per discard and there are at most N−1 discards.

Conditions iii and iv are obvious from the definition.

Definition A.1

Consider k∈N and a universe υ=(μ,r) that is an O-realization of M with state function S. A policy π is called k-optimal for υ when for any h∈hdomμ

suppπ(h)⊆AkM(S(h))

π is called Blackwell optimal for υ when it is k-optimal for any k∈N. Obviously, a stationary Blackwell optimal policy always exists. It is a standard (up to straightforward adaption to our formalism) result in MDP theory that any Blackwell optimal policy has maximal expected utility for any γ sufficiently close to 1.

The following proposition relates optimality in terms of expected utility to expected truncated utility, where truncated utility is defined by only summing rewards within a time duration T.

Proposition A.4

Fix an MDP M. Then, for any γ∈(0,1), T∈N, universe υ=(μ,r) that is an O-realization of M with state function S, π1 a 1-optimal policy for υ and π∗ a Blackwell optimal policy for υ, we have

The following shows that for any policy that doesn't make "irreversible errors," regret can be approximated by "episodic regret" for sufficiently large episode duration.

Proposition A.5

Consider a universe υ=(μ,r) that is an O-realization of M with state function S. Suppose that π∗ is a Blackwell optimal policy for υ and π0 is a 0-optimal policy for υ. For any n∈N, let π∗n be a policy s.t. for any h∈hdomμ

Previously, we defined a setting called "Delegative Inverse Reinforcement Learning" (DIRL) in which the agent can delegate actions to an "advisor" and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is O(n2/3)) for one-shot DIRL (as opposed to standard regret bounds in RL which are only applicable in the

episodicsetting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider "Delegative Reinforcement Learning" (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoidcatastrophicactions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a one-shot regret bound (in traditional normalization, O(n3/4)). Analogously to before, we allow for "corrupt" states in which both the advisor and the reward signal stop being reliable.Appendix A contains the proofs and Appendix B contains propositions proved before.

## Notation

The notation K:Xk→Y means K is Markov kernel from X to Y. When Y is a finite set, this is the same as a measurable function K:X→ΔY, and we use these notations interchangeably. Given K:Xk→Y and A⊆Y (corr. y∈Y), we will use the notation K(A∣x) (corr K(y∣x)) to stand for Pry′∼K(x)[y′∈A] (corr. Pry′∼K(x)[y′=y]). Given Y is a finite set, μ∈ΔYω, h∈Y∗ and y∈Y, the notation μ(y∣h) means Prx∼μ[x|h|=y∣x:|h|=h].

Given Ω a measurable space, μ∈ΔΩ, n,m∈N, {Ak}k∈[n], {Bj}j∈[m] finite sets and {Xk:Ω→Ak}k∈[n], {Yj:Ω→Bj}j∈[m] random variables (measurable mappings), the mutual information between the joint distribution of the Xk and the joint distribution of the Yj will be denoted

Iω∼μ[X0,X1…Xn−1;Y0,Y1…Ym−1]

We will parameterize our geometric time discount by γ=e−1/t, thus all functions that were previously defined to depend on t are now considered functions of γ.

## Results

We start by explaining the relation between the formalism of general environments we used before and the formalism of finite MDPs.

## Definition 1

A

finite Markov Decision Process(MDP) is a tupleM=(SM, AM, TM:SM×AMk→SM, RM:SM→[0,1])

Here, SM is a finite set (the set of states), AM is a finite set (the set of actions), TM is the transition kernel and RM is the reward function.

A

stationary policyfor M is any π:SMk→AM. The space of stationary policies is denoted ΠM. Given π∈ΠM, we define TMπ:SMk→SM byTMπ(t∣s):=∑a∈AMTM(t∣s,a)π(a∣s)

We define VM:SM×(0,1)→[0,1] and QM:SM×AM×(0,1)→[0,1] by

VM(s,γ):=(1−γ)maxπ∈ΠM∞∑n=0γnETnMπ(s)[RM]

QM(s,a,γ):=(1−γ)RM(s)+γEt∼TM(s,a)[VM(t,γ)]

Here, TnMπ:SMk→SM is just the n-th power of TMπ in the sense of Markov kernel composition.

As well known, VM and QM are rational functions in γ for 1−γ≪1, therefore in this limit we have the Taylor expansions

VM(s,γ)=∞∑k=01k!VkM(s)⋅(1−γ)k

QM(s,a,γ)=∞∑k=01k!QkM(s,a)⋅(1−γ)k

Given any s∈SM, we define {AkM(s)⊆AM}k∈N recursively by

A0M(s):=argmaxa∈AMQ0M(s,a)

Ak+1M(s):=argmaxa∈AkM(s)Qk+1M(s,a)

All MDPs will be assumed to be finite, so we drop the adjective "finite" from now on.

## Definition 2

Let I=(A,O) be an interface. An I-universe υ=(μ,r) is said to be

an O-realization of MDP M with state function S:hdomμ→SMwhen AM=A and for any h∈hdomμ, a∈A and o∈OTM(s∣S(h),a)=Pro∼μ(ha)[S(hao)=s]

r(h)=RM(S(h))

Now now define the relevant notion of a "good advisor."

## Definition 3

Let υ=(μ,r) be a universe and ϵ>0. A policy π is said to be

ϵ-sane for υwhen there are M, S s.t. υ is an O-realization of M with state function S and for any h∈hdomμi. suppπ(h)⊆A0M(S(h))

ii. ∃a∈A1M(S(h)):π(a∣h)>ϵ

We can now formulate the regret bound.

## Theorem 1

Fix an interface I and ϵ>0. Consider H={υk=(μk,r)∈ΥI}k∈[N] for some N∈N (note that r doesn't depend on k). Assume that for each k∈[N], σk is an ϵ-sane policy for υk. Then, there is an ¯I-metapolicy π∗ s.t. for any k∈[N]

EU∗υk(γ)−EUπ∗¯υk[σk](γ)=O((1−γ)1/4)

## Corollary 1

Fix an interface I and ϵ>0. Consider H={υk=(μk,r)∈ΥI}k∈N. Assume that for each k∈N, σk is a ϵ-sane policy for υk. Define ¯H:={¯υk[σk]}n∈N. Then, ¯H is learnable.

Now, we deal with corrupt states.

## Definition 4

Let υ=(μ,r) be a universe and ϵ>0. A policy π is said to be

locally ϵ-sane for υwhen there are M, S and U⊆SM (the set of uncorrupt states) s.t. υ is an O-realization of M with state function S, S(λ)∈U and for any h∈hdomμ, if S(h)∈U theni. If a∈suppπ(h) and o∈suppμ(ha) then S(hao)∈U.

ii. suppπ(h)⊆A0M(S(h))

iii. ∃a∈A1M(S(h)):π(a∣h)>ϵ

iv. ∃a∈A2M(S(h)):TM(U∣S(h),a)=1

Of course, this requirement is still unrealistic for humans in the real world. In particular, it makes the formalism unsuitable for modeling the use of AI for catastrophe mitigation (which is ultimately what we are interested in!) since it assumes the advisor is already capable of avoiding any catastrophe. In following work, we plan to relax the assumptions further.

## Corollary 2

Fix an interface I and ϵ>0. Consider H={υk=(μk,rk)∈ΥI}k∈[N] for some N∈N. Assume that for each k∈[N], σk is locally ϵ-sane for υk. For each k∈[N], let Uk⊆SMk be the corresponding set of uncorrupt states. Assume further that for any k,j∈N and h∈hdomμk∩hdomμj, if Sk(h)∈Uk and Sj(h)∈Uj, then rk(h)=rj(h). Then, there is an ¯I-policy π∗ s.t. for any k∈[N]

EU∗υk(γ)−EUπ∗¯υk[σk](γ)=O((1−γ)1/4)

## Corollary 3

Assume the same conditions as in Corollary 2, except that H may be countable infinite. Then, ¯H is learnable.

## Appendix A

First, we prove an information theoretic bound that shows that for Thompson sampling, the expected information gain is bounded below by a function of the loss.

## Proposition A.1

Consider a probability space (Ω,P∈ΔΩ), N∈N, R⊆[0,1] a finite set and random variables U:Ω→R, K:Ω→[N] and J:Ω→[N]. Assume that K∗P=J∗P=ζ∈Δ[N] and I[K;J]=0. Then

I[K;J,U]≥2(mini∈[N]ζ(i))(E[U∣J=K]−E[U])2

## Proof of Proposition A.1

We have

I[K;J,U]=I[K;J]+I[K;U∣J]=I[K;U∣J]=E[DKL(U∗(P∣K,J)∥ U∗(P∣J))]

Using Pinsker's inequality, we get

I[K;J,U]≥2E[dtv(U∗(P∣K,J),U∗(P∣J))2]≥2E[(E[U∣K,J]−E[U∣J])2]

Denote Ukj:=E[U∣K=k,J=j]. We get

I[K;J,U]≥2E(k,j)∼ζ×ζ[(Ukj−Ek′∼ζ[Uk′j])2]≥2E(k,j)∼ζ×ζ[[[k=j]](Ukj−Ek′∼ζ[Uk′j])2]

I[K;J,U]≥2Ej∼ζ[ζ(j)(Ujj−Ek∼ζ[Ukj])2]≥2(minj∈[N]ζ(j))Ej∼ζ[(Ujj−Ek∼ζ[Ukj])2]

I[K;J,U]≥2(minj∈[N]ζ(j))(Ej∼ζ[Ujj]−E(k,j)∼ζ×ζ[Ukj])2=2(mini∈[N]ζ(i))(E[U∣J=K]−E[U])2

Now, we describe a "delegation routine" D that can transform any "proto-policy" π that recommends some

setof actions from A into an actual ¯I-policy s.t (i) with high probability, on each round, either a "safe" recommended action is taken, or all recommended actions are "unsafe" or delegation is performed and (ii) the expected number of delegations is small. For technical reasons, we also need to the modified routines D!k which behave the same way as D except for some low probability cases.## Proposition A.2

Fix an interface I=(A,O), N∈N, ϵ∈(0,1|A|), δ∈(0,1N). Consider some {σk:(A×O)∗k→A}k∈[N]. Then, there exist D:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→¯A and {D!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→¯A}k∈[N] with the following properties. Given x∈(2A×¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)∗, we denote x–– its projection to ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗. Thus, x––––∈(A×O)∗. Given μ an I-environment, π:hdomμk→2A, D′:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→¯A and k∈[N], we can define Ξ[μ,σk,D′,π]∈Δ(2A×¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)ω as follows

Ξ[μ,σk,D′,π](B,a,o∣x):=π(B∣x––––)D′(a∣x––,B)¯μ[σk](o∣x––a)

We require that for every π, μ and k as above, the following conditions hold

i. Ex∼Ξ[μ,σk,D!k,π][|{n∈N∣xn∈2A×⊥×¯O}|]≤lnNδln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)=O(lnNδϵ)

ii. dtv(1N∑N−1j=0Ξ[μ,σj,D,π],1N∑N−1j=0Ξ[μ,σj,D!j,π])≤(N−1)δ

iii. For all x∈hdom¯μ[σk], if D!k(x,π(x––))≠⊥ then σk(D!k(x,π(x––))∣x––)>0

iv. For all x∈hdom¯μ[σk], if D!k(x,π(x––))∉π(x––)∪{⊥} then ∀a∈π(x––):σk(a∣x––)≤ϵ

In order to prove Proposition A.2, we need another mutual information bound.

## Proposition A.3

Consider N∈N, A a finite set, ϵ∈(0,1|A|), δ∈(0,1), B⊆A, ζ∈Δ[N] and σ:[N]k→A. Suppose that for every a∈A

Prk∼ζ[σ(a∣k)>0∧(a∈B∨∀b∈B:σ(b∣k)≤ϵ)]≤1−δ

Then

I(k,a)∼ζ⋉σ[k;a]≥δln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)

## Proof of Proposition A.3

We have

I(k,a)∼ζ⋉σ[k;a]=Ek∼ζ[DKL(σ(k)∥σ∗ζ)]

Define q∈(0,1) by

q:=1ϵ+(1−ϵ)−(1−ϵ)/ϵ

Let a∗∈A be s.t. (σ∗ζ)(a∗)>qϵ and either a∗∈B or every a∈B has (σ∗ζ)(a)≤qϵ. For every k∈[N], denote

Ak:={a∈A∣σ(a∣k)>0∧(a∈B∨∀b∈B:σ(b∣k)≤ϵ)}

If a∗∉Ak then either σ(a∗∣k)=0 or there is a∈B s.t. (σ∗ζ)(a)≤qϵ and σ(b∣k)≥ϵ. This implies

DKL(σ(k)∥σ∗ζ)≥min(DKL(0∥qϵ),DKL(ϵ∥qϵ))

We have

DKL(0∥qϵ)=ln11−qϵ=ln11−ϵϵ+(1−ϵ)−(1−ϵ)/ϵ=lnϵ+(1−ϵ)−(1−ϵ)/ϵϵ+(1−ϵ)−(1−ϵ)/ϵ−ϵ=ln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)

DKL(ϵ∥qϵ)=ϵlnϵqϵ+(1−ϵ)ln1−ϵ1−qϵ=ϵln1q+(1−ϵ)ln(1−ϵ)−ϵln11−qϵ+ln11−qϵ

DKL(ϵ∥qϵ)=ϵln1−qϵq+ln(1−ϵ)1−ϵ+ln11−qϵ=ϵln(1q−ϵ)+ln(1−ϵ)1−ϵ+ln11−qϵ

DKL(ϵ∥qϵ)=ϵln(1−ϵ)−(1−ϵ)/ϵ+ln(1−ϵ)1−ϵ+ln11−qϵ=ln(1−ϵ)−(1−ϵ)+ln(1−ϵ)1−ϵ+ln11−qϵ=ln11−qϵ

DKL(ϵ∥qϵ)=ln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)

It follows that

I(k,a)∼ζ⋉σ[k;a]≥Prk∼ζ[a∗∉Ak]ln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)≥δln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)

## Proof of Proposition A.2

We define {Ak:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×2A→2A}k∈N, ~ζ:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N], ζ:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N], {~ζ!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N]}k∈N, {ζ!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗k→[N]}k∈N, D and {D!k}k∈N recursively. For each h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗, a∈¯A, o∈¯O, B⊆A and j,k∈[N], we require

Ak(h,B):={a∈A∣σk(a∣h––)>0∧(a∈B∨∀b∈B:σk(b∣h––)≤ϵ)}

~ζ(k∣λ)=~ζ!j(k∣λ):=1N

~ζ(k∣hao):=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩ζ(k∣h) if a≠⊥1N if a=⊥, ∑N−1i=0ζ(i∣h)⋅σi(b∣h––)=0(∑N−1i=0ζ(i∣h)⋅σi(b∣h––))−1ζ(k∣h)⋅σk(b∣h––) otherwise, assuming o∈bO

~ζ!j(k∣hao):=⎧⎨⎩ζ!j(k∣h) if a≠⊥(∑N−1i=0ζ!j(i∣h)⋅σi(b∣h––))−1ζ!j(k∣h)⋅σk(b∣h––) if a=⊥, o∈bO

ζ(k∣h):=~ζ(k∣h)[[~ζ(k∣h)≥δ]]∑N−1i=0~ζ(i∣h)[[~ζ(i∣h)≥δ]]

ζ!j(k∣h):=~ζ!j(k∣h)[[~ζ!j(k∣h)≥δ]]∑N−1i=0~ζ!j(i∣h)[[~ζ!j(i∣h)≥δ]][[~ζ!j(j∣h)≥δ]]+[[k=j]]⋅[[~ζ!j(j∣h)<δ]]

D(h,B):={any a s.t. ∀k∈suppζ(h):a∈Ak(h,B) if such exists⊥ otherwise

D!j(h,B):={any a s.t. ∀k∈suppζ!j(h):a∈Ak(h,B) if such exists⊥ otherwise

Denote Ξ!k:=Ξ[μ,σk,D!k,π]. Proposition A.3 implies that for any k∈[N]

lnN≤∞∑n=0Ex∼Ξ!k[H(ζ!k(x––:n+1))−H(ζ!k(x––:n))]≤∞∑n=0Prx∼Ξ!k[xn∈2A×⊥×O]δln(1+ϵ(1−ϵ)1−ϵϵ)

This gives us condition i.

Condition ii follows because the only difference is in the equation for ζ vs. ζ!j. That is, ζ may "discard" the "correct" element of [N] but this happens with probability at most δ per discard and there are at most N−1 discards.

Conditions iii and iv are obvious from the definition.

## Definition A.1

Consider k∈N and a universe υ=(μ,r) that is an O-realization of M with state function S. A policy π is called

k-optimal for υwhen for any h∈hdomμsuppπ(h)⊆AkM(S(h))

π is called

Blackwell optimal for υwhen it is k-optimal for any k∈N. Obviously, a stationary Blackwell optimal policy always exists. It is a standard (up to straightforward adaption to our formalism) result in MDP theory that any Blackwell optimal policy has maximal expected utility for any γ sufficiently close to 1.The following proposition relates optimality in terms of expected utility to expected

truncatedutility, where truncated utility is defined by only summing rewards within a time duration T.## Proposition A.4

Fix an MDP M. Then, for any γ∈(0,1), T∈N, universe υ=(μ,r) that is an O-realization of M with state function S, π1 a 1-optimal policy for υ and π∗ a Blackwell optimal policy for υ, we have

T−1∑n=0γn(Ex∼μ⋈π∗[r(x:n)]−Ex∼μ⋈π1[r(x:n)])≤O(1+T(1−γ))

## Proof of Proposition A.4

Let π∗1 be a policy s.t. for any h∈hdomμ

π∗1(h):={π1(h) if |h|<Tπ∗(h) otherwise

By Proposition B.1

EU∗υ(γ)−EUπ∗1υ(γ)=∞∑n=0γnEx∼μ⋈π∗1[Vυγ(x:n)−Qυγ(x:n,xAn)]

EU∗υ(γ)−EUπ∗1υ(γ)=∞∑n=0γnEx∼μ⋈π∗1[VM(S(x:n),γ)−QM(S(x:n),xAn,γ)]

EU∗υ(γ)−EUπ∗1υ(γ)=∞∑n=0γn∞∑k=0(1−γ)kk!Ex∼μ⋈π∗1[VkM(S(x:n))−QkM(S(x:n),xAn)]

Using the Blackwell optimality of π∗ and 1-optimality of π1, we get that for 1−γ≪1

EU∗υ(γ)−EUπ∗1υ(γ)=T−1∑n=0γn∞∑k=2(1−γ)kk!Ex∼μ⋈π1[VkM(S(x:n))−QkM(S(x:n),xAn)]

EU∗υ(γ)−EUπ∗1υ(γ)≤T−1∑n=0γnmaxs∈SMmaxa∈A1M(s)∞∑k=2(1−γ)kk!(VkM(s)−QkM(s,a))

EU∗υ(γ)−EUπ∗1υ(γ)=O(T(1−γ)2)

(1−γ)∞∑n=0γn(Ex∼μ⋈π∗[r(x:n)]−Ex∼μ⋈π∗1[r(x:n)])=O(T(1−γ)2)

∞∑n=0γn(Ex∼μ⋈π∗[r(x:n)]−Ex∼μ⋈π∗1[r(x:n)])=O(T(1−γ))

Denote ρ∗:=μ⋈π∗, ρ1:=μ⋈π1. Using again the Blackwell optimality of π∗

T−1∑n=0γn(Ex∼ρ∗[r(x:n)]−Ex∼ρ1[r(x:n)])+γT1−γ(Ex∼ρ∗[VM(S(x:T))]−Ex∼ρ1[VM(S(x:T))])=O(T(1−γ))

Since both π∗ and π1 are in particular 0-optimal, we have

Ex∼ρ∗[V0M(S(x:T))]=Ex∼ρ1[V0M(S(x:T))]=V0M(S(λ))

It follows

T−1∑n=0γn(Ex∼ρ∗[r(x:n)]−Ex∼ρ1[r(x:n)])±γT1−γO(1−γ)=O(T(1−γ))

T−1∑n=0γn(Ex∼ρ∗[r(x:n)]−Ex∼ρ1[r(x:n)])=O(γT+T(1−γ))

The following shows that for any policy that doesn't make "irreversible errors," regret can be approximated by "episodic regret" for sufficiently large episode duration.

## Proposition A.5

Consider a universe υ=(μ,r) that is an O-realization of M with state function S. Suppose that π∗ is a Blackwell optimal policy for υ and π0 is a 0-optimal policy for υ. For any n∈N, let π∗n be a policy s.t. for any h∈hdomμ

π∗n(h):={π0(h) if |h|<nTπ∗(h) otherwise

Then, for any γ∈(0,1) and T∈N+

EU∗υ(γ)−EUπ0υ(γ)≤(1−γ)∞∑n=0T−1∑m=0γnT+m(Ex∼μ⋈π∗n[r(x:nT+m)]−Ex∼μ⋈π0[r(x:nT+m)])+O(1−γ1−γT)

## Proof of Proposition A.5

By Proposition B.1, for any l∈N

EU∗υ(γ)−EUπ∗lυ(γ)=∞∑n=0γnEx∼μ⋈π∗l[Vυγ(x:n)−Qυγ(x:n,xAn)]

π∗l becomes Blackwell optimal after lT, therefore for 1−γ≪1,

EU∗υ(γ)−EUπ∗lυ(γ)=lT−1∑n=0γnEx∼μ⋈π0[Vυγ(x:n)−Qυγ(x:n,xAn)]

EUπ∗lυ(γ)−EUπ∗l+1υ(γ)=(l+1)T−1∑n=lT