TLDR: I derive a variant of the RL regret bound by Osband and Van Roy (2014), that applies to learning without resets of environments without traps. The advantage of this regret bound over those known in the literature, is that it scales with certain learning-theoretic dimensions rather than number of states and actions. My goal is building on this result to derive this type of regret bound for DRL, and, later on, other settings interesting from an AI alignment perspective.

Previously I derived a regret bound for deterministic environments that scales with prior entropy and "prediction dimension". That bound behaves as O(√1−γ) in the episodic setting but only as O(3√1−γ) in the setting without resets. Moreover, my attempts to generalize the result to stochastic environments led to bounds that are even weaker (have a lower exponent). Therefore, I decided to put that line of attack on hold, and use Osband and Van Roy's technique instead, leading to an O(√1−γ) bound in the stochastic setting without resets. This bound doesn't scale down with the entropy of the prior, but this does not seem as important as dependence on γ.

Results

Russo and Van Roy (2013) introduced the concept of "eluder dimension" for the benefit of the multi-armed bandit setting, and Osband and Van Roy extended it to a form suitable for studying reinforcement learning. We will consider the following, slightly modified version of their definition.

Given a real vector space Y, we will use PSD(Y) to denote the set of positive semidefinite bilinear forms on Y and PD(Y) to denote the set of positive definite bilinear forms on Y. Given a bilinear form B:Y×Y→R, we will slightly abuse notation by also regarding it as a linear functional B:Y⊗Y→R. Thereby, we have B(v⊗w)=B(v,w) and Bv⊗2=B(v,v). Also, if Y is finite-dimensional and B is non-degenerate, we will denote B−1:Y∗×Y∗→R the unique bilinear form which satisfies

∀v∈Y,α∈Y∗:(∀β∈Y∗:B−1(α,β)=βv)⟺(∀w∈Y:B(v,w)=αw)

Definition 1

Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. Consider also n∈N, a sequence {xk∈X}k∈[n] and x∗∈X. x∗ is said to be(F,B)-dependant on {xk}when, for any f,~f∈F

Otherwise, x∗ is said to be(F,B)-independent of {xk}.

Definition 2

Consider a set X, a real vector space Y and some F⊆{X→Y}. The Russo-Van Roy-Osband dimension (RVO dimension) dimRVOFis the supremum of the set of n∈N for which there is {xk∈X}k∈[n] and B s.t. for all m∈[n], xm is (F,B)-independent of {xk}k∈[m].

We have the following basic bounds on RVO dimension.

Proposition 1

Consider a set X, a real vector space Y and some F⊆{X→Y}. Then

dimRVOF≤|X|

Proposition 2

Consider a set X, a real vector space Y and some F⊆{X→Y}. Then

dimRVOF≤|F|(|F|−1)2

Another concept we need to formulate the regret bound is the Minkowski–Bouligand dimension.

Definition 3

Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. A set A⊆F is said to be aB-covering of Fwhen

∀f∈F∃~f∈A:supx∈XBx(f(x)−~f(x))⊗2<1

Definition 4

Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. The covering number N(F,B)is the infimum of the set of n∈N for which there is a B-covering of F of size n.

Definition 5

Consider a finite set X, a finite-dimensional real vector product space Y and some F⊆{X→Y}. Fix any {Bx∈PD(Y)}x∈X. The Minkowski–Bouligand dimension (MB dimension) of Fis defined by

dimMBF:=limsupϵ→0lnN(F,ϵ−2B)ln1ϵ

It is easy to see the above is indeed well-defined, i.e. doesn't depend on the choice of B. This is because given any B,~B, there are constants c1,c2∈R+ s.t. for all F and ϵ

c1N(F,ϵ−2B)≤N(F,ϵ−2~B)≤c2N(F,ϵ−2B)

Similarly, for any {Bx∈PSD(Y)}x∈X, we have

dimMBF≥limsupϵ→0lnN(F,ϵ−2B)ln1ϵ

For finite F and ϵ≪1, it's obvious that N(F,ϵ−2B)≤|F|, and in particular dimMBF=0. It is also possible to show that, for any bounded F, dimMBF≤|X|dimY.

Note that, in general, MB dimension is fractional.

We will need yet another (but rather simple) notion of "dimension".

Definition 6

Consider a set X, a vector space Y and some F⊆{X→Y}. The local dimension of Fis defined by

dimlocF:=maxx∈Xdimspan{f(x)|f∈F}

Obviously dimlocF≤|F| and dimlocF≤dimY.

Consider finite non-empty sets S (states) and A (actions). Observe that ΔS can be regarded as a subset of the vector space RS. This allows us to speak of the RVO, MB and local dimensions of a hypothesis class of transition kernels H⊆{S×Ak→S} (in this case X=S×A and Y=RS).

We can now formulate the regret bound.

Theorem 1

There is some C∈R+ s.t. the following holds.

Consider any finite non-empty sets S and A, R:S→[0,1], closed set H⊆{S×Ak→S} and Borel probability measure ζ on H (prior). We define the maximal bias span τζR by

Here (like in previous essays), VTR(s,γ) is the value function for transition kernel T, reward function R, state s and geometric time discount parameter γ; EUπTR(γ) is the expected utility for policy π; EU∗TR(γ) is the maximal expected utility over policies. The expression in the numerator is, thereby, the Bayesian regret.

The implicit condition τζR<∞ implies that for any T∈H and s,s′∈S, V0TR(s)=V0TR(s′). This is a no-traps condition stronger than the condition A0TR(s)=A we used before: not only that long-term value cannot be lost in expectation, it cannot be lost at all. For any finiteH satisfying this no-traps condition, we have

τζR=maxT∈H(maxs∈SV1TR(s)−mins∈SV1TR(s))<∞

A few directions for improving on this result:

It is not hard to see from the proof that it is also possible to write down a concrete bound for fixed γ (rather than considering the γ→1 limit), but its form is somewhat convoluted.

It is probably possible to get an anytime policy with this form of regret, using PSRL with dynamic episode duration.

It is interesting to try to make do with the weaker no-traps condition A0TR(s)=A, especially since a stronger no-traps conditions would translate to a stronger condition on the advisor in DRL.

It is interesting to study RVO dimension in more detail. For example, I'm not even sure whether Proposition 2 is the best possible bound in terms of |F| or it's e.g. possible to get a linear bound.

It seems tempting to generalize local dimension by allowing the values of f(x) to lie on some nonlinear manifold of given dimension for any given x. This approach, if workable, might require a substantially more difficult proof.

The "cellular decision processes" discussed previously in "Proposition 3" have exponentially high local dimension, meaning that this regret bound is ineffective for them. We can consider a variant in which, on every time step, only one cell or a very small number of cells (possibly chosen randomly) change. This would have low local dimension. One way to interpret it is, as a continuous time process in which each cell has a certain rate of changing its state. EDIT (2019-07-06): It is actually possible to use Theorem 1 to get an effective regret bound for CDPs, by replacing the CDP with a different but equivalent MDP where each time step is replaced by a series of time steps in which the cells are "decided" one by one. It would be interesting to (i) have an explicit expression for the best possible bound among those generated by such "isomorphisms" for general MDPs (ii) derive bounds on RVO dimension for stochastic analogues of CDPs (possibly defined using Markov random fields).

Proofs

Proof of Proposition 1

Consider some {xk∈X}k∈[m] and x∗∈X which is (F,B)-independent of {xk}. Then, there are some f,~f∈F s.t. the following two inequalities hold

m−1∑k=0Bxk(f(xk)−~f(xk))⊗2≤1

Bx∗(f(x∗)−~f(x∗))⊗2>1

The first inequality implies that, for any k∈[m], Bxk(f(xk)−~f(xk))⊗2≤1. Comparing with the second inequality, we conclude that x∗≠xk.

Now suppose that {xk∈X}k∈[n] is s.t. for all m∈[n], xm is (F,B)-independent of {xk}k∈[m]. By the previous paragraph, it follows that xm≠xk for any m,k∈[n] s.t. k<m. Hence, n≤|X|, Q.E.D.

Proof of Proposition 2

Suppose that {xk∈X}k∈[n] is s.t. for all m∈[n], xm is (F,B)-independent of {xk}k∈[m]. Then, for each m∈[n], we can choose fm,~fm∈F s.t. the following two inequalities hold

m−1∑k=0Bxk(fm(xk)−~fm(xk))⊗2≤1

Bxm(fm(xm)−~fm(xm))⊗2>1

In particular, the second inequality implies that fm≠~fm.

Now, consider some k∈[m]. By the first inequality

Bxk(fm(xk)−~fm(xk))⊗2≤1

On the other hand, the second inequality with k instead of m gives

Bxk(fk(xk)−~fk(xk))⊗2>1

It follows that {fm,~fm}≠{fk,~fk}. Therefore, n cannot exceed the number of unordered pairs of distinct elements in F, Q.E.D.

Given a measurable space X, some μ,ν∈ΔX and a measurable function f:X→R, we will use the notation

Ex∼μ−ν[f(x)]:=Ex∼μ[f(x)]−Ex∼ν[f(x)]

Given X,Y measurable spaces, some K,L:Xk→Y and a measurable function f:Y→R, we will use the notation

Ey∼K[f(y)|x]:=Ey∼K(x)[f(y)]

Ey∼K−L[f(y)|x]:=Ey∼K(x)−L(x)[f(y)]

Given finite sets S,A, some T:S×Ak→S and π:S→A, we will use the notation Tπ:Sk→S defined by

Tπ(s):=T(s,π(s))

Proposition A.1

In the setting of Theorem 1, fix T∈N+ and γ∈(0,1). Let Π:H×S→A be a Borel measurable mapping s.t. Π(T) is an optimal policy, i.e.

EUΠ(T)TR(γ)=EU∗TR(γ)

Let πPSζΠT:S∗×Sk→A be the policy implemented by a PSRL algorithm with prior ζ and episode length T, s.t. whenever hypothesis T is sampled, policy Π(T) is followed. Denote the Bayesian regret by

R(γ):=ET∼ζ[EU∗TR(γ)−EUπPSζΠTTR(γ)]

Let (Ω,P) be a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm. [See the proof of "Lemma 1" in this previous essay or the proof of "Theorem 1" in another previous essay.] Furthermore, let H∗:Ω→H be a random variable representing the true hypothesis, {Hn:Ω→H}n∈N be the random variables s.t. Hn represents the hypothesis sampled at time n (i.e. during episode number ⌊n/T⌋), {Θn:Ω→S}n∈N be random variables s.t. Θn represents the state at time n and {An:Ω→A}n∈N be random variables s.t. An represents the action taken at time n. Denote El:=H∗Π(HlT) and ¯El:=HlTΠ(H∗). Then

Here, ETl means raising El to the T-th power w.r.t. Markov kernel composition, and the same for ¯ETl.

We will use the notation VTπR(s,γ) to stand for the expected utility achieved by policy π when starting from state s with transition kernel T, reward function R and time discount parameter γ.

Proof of Proposition A.1

For any n∈N, define Πn:H×H×S∗×Sk→A as follows.

Πn(T1,T2,h,s):={Π(T1,s) if |h|<nΠ(T2,s) if |h|≥n

That is, Πn(T1,T2) is a policy that follows Π(T1) for time n and Π(T2) afterwards.

In the following, we use the shorthand notation

V∗(s):=VH∗R(s,γ)

Vl(s):=VHlTR(s,γ)

Vlk(s):=VH∗ΠT−k(HlT,H∗)R(s,γ)

It is easy to see that

R(γ)=∞∑l=0γlTE[V∗(ΘlT)−Vl0(ΘlT)]

The above is essentially what appeared as "Proposition B.1" before, for the special case of PSRL, and where we regard every episode as a single time step in a new MDP in which every action is a policy for the duration of an episode in the original MDP.

By definition, HlT and H∗ have the same distribution even when conditioned by the history up to lT. Therefore

Applying this identity to the second second term on the right hand side of the induction hypothesis, we prove the induction step. For k=T, we get, denoting n0:=lT and n1:=(l+1)T

Applying this to each term in the earlier expression for R(γ), we get the desired result. Q.E.D.

Proposition A.2

Consider a real finite-dimensional normed vector space Y and a linear subspace Z⊆Y. Then, there exists B∈PSD(Y) s.t.

For any v∈Y, Bv⊗2≤∥v∥2

For any v∈Z, (dimZ)2Bv⊗2≥∥v∥2

Proof of Proposition A.2

We assume dimZ>0 since otherwise the claim is trivial (take B=0).

By Theorem B.1 (see Appendix), there is ~B∈PSD(Z) s.t. for any v∈Z

~Bv⊗2≤∥v∥2≤dimZ⋅~Bv⊗2

By Corollary B.1 (see Appendix), there is a projection operator P:Y→Y s.t. imY=Z and ∥P∥≤√dimZ. We define B∈PSD(Y) by

B(v,w):=~B(Pv,Pw)dimZ

For any v∈Y, we have

Bv⊗2=~B(Pv,Pv)dimZ≤∥Pv∥2dimZ≤∥P∥2∥v∥2dimZ≤∥v∥2

For any v∈Z, we have Pv=v and therefore

∥v∥2≤dimZ⋅~Bv⊗2=(dimZ)2~B(Pv,Pv)dimZ=(dimZ)2Bv⊗2

Q.E.D.

Proposition A.3

Consider a finite-dimensional real vector space Y, some B∈PD(Y) and a Borel probability measure μ∈ΔY s.t. Pry∼μ[By⊗2≤1]=1. Let y0:=Ey∼μ[y] and σ:=2√2. Then, μ is σ-sub-Gaussian w.r.t. B, i.e., for any α∈Y∗

Ey∼μ[exp(α(y−y0))]≤exp(σ2B−1α⊗22)

Proof of Proposition A.3

By isomorphism, it is sufficient to consider the case Y=Rn, B(v,w)=v⋅w, α(v)=tv0 for some n∈N and t∈[0,∞). For this form of α it is sufficient to consider the case n=1. It remains to show that

Ey∼μ[et(y−y0)]≤e12σ2t2=e4t2

Here, we assume that Pry∼μ[|y|≤1]=1.

We consider separately the cases t≥12 and t<12. In the first case

The above holds because, at x=0 the left hand side and the right hand have the same value and first derivative, and for any x∈(−∞,+1), the second derivative of the left hand side is less than the second derivative of the right hand side. We get

Consider a set X, a finite-dimensional real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. Assume F is compact w.r.t. the product topology on X→Y≅∏x∈XY. Consider also some n∈N, x∈Xn and y∈Yn. We then use the notation

I chose the notation LS as an abbreviation of "least squares" and CS as an abbreviation of "confidence set". Note that LS is somewhat ambiguous (and therefore, so is CS) since there might be multiple minima, but this will not be important in the following (i.e. an arbitrary minimum can be chosen).

Proposition A.4

There is some CA.4∈R+ s.t. the following holds.

Consider finite sets X,S, some F⊆{Xk→S} and a family {Bx∈PSD(RS)}x∈X. Assume that for any x∈X and φ∈ΔS, Bxφ⊗2≤1. Let {Hn⊆P(Xω×Sω)}n∈N be the canonical filtration, i.e.

Hn:={A′⊆Xω×Sω∣∣A′={xs|xs:n∈A},A⊆Xn×Sn}

Consider also f∗∈F and μ∈Δ(Xω×Sω) s.t. for any n∈N, x∈X, and s∈S

Prxs∼μ[sn=s|xn=x,Hn]=f∗(s∣x)

Fix ϵ∈R+, δ∈(0,1). Denote

β(t):=CA.4(lnN(F,ϵ−2B)δ+ϵtlnetδ)

Then,

Prxs∼μ[f∗∉∞⋂n=0CSF[xs:n,β(n+1)−1B]]≤δ

Proof of Proposition A.4

For each x∈X, choose a finite set Ex and a linear mapping Px:RS→REx s.t. Bx(v,w)=Px(v)⋅Px(w). Let ~Y:=⨁x∈XREx. Define ~F⊆{X→~Y} by

~F:={~f:X→~Y∣∣~f(x)=Px(f(x)),f∈F}

Define Pω:Xω×Sω→Xω×~Yω by

Pω(xs)n:=xnPxn(sn)

Let ~f∗:=Pω(f∗) and ~μ:=Pω∗μ. By Proposition A.3, ~μ is 2√2-sub-Gaussian. Applying Proposition B.1 (see Appendix) to the tilde objects gives the desired result. Here, we choose the constant CA.4 s.t. the term M=1 in β as defined in Proposition B.1 is absorbed by the term σlnetδ≥1 (note that t≥1 since we substitute t=n+1, δ<1 and σ=2√2>1). Q.E.D.

Definition A.2

Consider a set X, some x∈X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. TheB-width of F at xis defined by

WF(x,B):=supf,~f∈F√Bx(f(x)−~f(x))⊗2

Proposition A.5

There is some CA.5∈R+ s.t. the following holds.

Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. Consider also some x∈Xω, y∈Yω, T∈N+, γ∈(0,1), θ∈R+, η0,η1∈R+ and δ∈(0,1). Denote

The integral on the right hand side is a Laplace transform (where the dual variable is s=ln1γ). The functions f(t)=1, f(t)=t and f(t)=tlnt all have the property that, for any s∈R+

TLDR:I derive a variant of the RL regret bound by Osband and Van Roy (2014), that applies to learning without resets of environments without traps. The advantage of this regret bound over those known in the literature, is that it scales with certain learning-theoretic dimensions rather than number of states and actions. My goal is building on this result to derive this type of regret bound for DRL, and, later on, other settings interesting from an AI alignment perspective.Previously I derived a regret bound for

deterministicenvironments that scales with prior entropy and "prediction dimension". That bound behaves as O(√1−γ) in the episodic setting but only as O(3√1−γ) in the setting without resets. Moreover, my attempts to generalize the result to stochastic environments led to bounds that areeven weaker(have a lower exponent). Therefore, I decided to put that line of attack on hold, and use Osband and Van Roy's technique instead, leading to an O(√1−γ) bound in the stochastic setting without resets. This bound doesn't scale down with the entropy of the prior, but this does not seem as important as dependence on γ.## Results

Russo and Van Roy (2013) introduced the concept of "eluder dimension" for the benefit of the multi-armed bandit setting, and Osband and Van Roy extended it to a form suitable for studying reinforcement learning. We will consider the following, slightly modified version of their definition.

Given a real vector space Y, we will use PSD(Y) to denote the set of positive semidefinite bilinear forms on Y and PD(Y) to denote the set of positive definite bilinear forms on Y. Given a bilinear form B:Y×Y→R, we will slightly abuse notation by also regarding it as a linear functional B:Y⊗Y→R. Thereby, we have B(v⊗w)=B(v,w) and Bv⊗2=B(v,v). Also, if Y is finite-dimensional and B is non-degenerate, we will denote B−1:Y∗×Y∗→R the unique bilinear form which satisfies

∀v∈Y,α∈Y∗:(∀β∈Y∗:B−1(α,β)=βv)⟺(∀w∈Y:B(v,w)=αw)

Definition 1Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. Consider also n∈N, a sequence {xk∈X}k∈[n] and x∗∈X. x∗ is said to be(F,B)-dependant on {xk}when, for any f,~f∈Fn−1∑k=0Bxk(f(xk)−~f(xk))⊗2≤1⟹Bx∗(f(x∗)−~f(x∗))⊗2≤1

Otherwise, x∗ is said to be(F,B)-independent of {xk}.Definition 2Consider a set X, a real vector space Y and some F⊆{X→Y}. TheRusso-Van Roy-Osband dimension (RVO dimension) dimRVOFis the supremum of the set of n∈N for which there is {xk∈X}k∈[n] and B s.t. for all m∈[n], xm is (F,B)-independent of {xk}k∈[m].We have the following basic bounds on RVO dimension.

Proposition 1Consider a set X, a real vector space Y and some F⊆{X→Y}. ThendimRVOF≤|X|

Proposition 2Consider a set X, a real vector space Y and some F⊆{X→Y}. ThendimRVOF≤|F|(|F|−1)2

Another concept we need to formulate the regret bound is the Minkowski–Bouligand dimension.

Definition 3Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. A set A⊆F is said to be aB-covering of Fwhen∀f∈F∃~f∈A:supx∈XBx(f(x)−~f(x))⊗2<1

Definition 4Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. Thecovering number N(F,B)is the infimum of the set of n∈N for which there is a B-covering of F of size n.Definition 5Consider a finite set X, a finite-dimensional real vector product space Y and some F⊆{X→Y}. Fix any {Bx∈PD(Y)}x∈X. TheMinkowski–Bouligand dimension (MB dimension) of Fis defined bydimMBF:=limsupϵ→0lnN(F,ϵ−2B)ln1ϵ

It is easy to see the above is indeed well-defined, i.e. doesn't depend on the choice of B. This is because given any B,~B, there are constants c1,c2∈R+ s.t. for all F and ϵ

c1N(F,ϵ−2B)≤N(F,ϵ−2~B)≤c2N(F,ϵ−2B)

Similarly, for any {Bx∈PSD(Y)}x∈X, we have

dimMBF≥limsupϵ→0lnN(F,ϵ−2B)ln1ϵ

For finite F and ϵ≪1, it's obvious that N(F,ϵ−2B)≤|F|, and in particular dimMBF=0. It is also possible to show that, for any bounded F, dimMBF≤|X|dimY.

Note that, in general, MB dimension is fractional.

We will need yet another (but rather simple) notion of "dimension".

Definition 6Consider a set X, a vector space Y and some F⊆{X→Y}. Thelocal dimension of Fis defined bydimlocF:=maxx∈Xdimspan{f(x)|f∈F}

Obviously dimlocF≤|F| and dimlocF≤dimY.

Consider finite non-empty sets S (states) and A (actions). Observe that ΔS can be regarded as a subset of the vector space RS. This allows us to speak of the RVO, MB and local dimensions of a hypothesis class of transition kernels H⊆{S×Ak→S} (in this case X=S×A and Y=RS).

We can now formulate the regret bound.

Theorem 1There is some C∈R+ s.t. the following holds.Consider any finite non-empty sets S and A, R:S→[0,1], closed set H⊆{S×Ak→S} and Borel probability measure ζ on H (prior). We define the maximal bias span τζR byτζR:=limsupγ→1maxT∈Hmaxs∈SVTR(s,γ)−mins∈SVTR(s,γ)1−γ

Denote Dloc:=dimlocH, DMB:=dimMBH and DRVO:=dimRVOH. Then, there is a family of policies {π†γ:S∗×Sk→A}γ∈(0,1) s.t.limsupγ→1ET∼ζ[EU∗TR(γ)−EUπ†γTR(γ)]τζRDloc√(DMB+1)DRVO(1−γ)ln11−γ≤C

Here (like in previous essays), VTR(s,γ) is the value function for transition kernel T, reward function R, state s and geometric time discount parameter γ; EUπTR(γ) is the expected utility for policy π; EU∗TR(γ) is the maximal expected utility over policies. The expression in the numerator is, thereby, the Bayesian regret.

The implicit condition τζR<∞ implies that for any T∈H and s,s′∈S, V0TR(s)=V0TR(s′). This is a no-traps condition stronger than the condition A0TR(s)=A we used before: not only that long-term value cannot be lost

in expectation, it cannot be lost at all. For anyfiniteH satisfying this no-traps condition, we haveτζR=maxT∈H(maxs∈SV1TR(s)−mins∈SV1TR(s))<∞

A few directions for improving on this result:

It is not hard to see from the proof that it is also possible to write down a concrete bound for fixed γ (rather than considering the γ→1 limit), but its form is somewhat convoluted.

It is probably possible to get an

anytimepolicy with this form of regret, using PSRL withdynamicepisode duration.It is interesting to try to make do with the weaker no-traps condition A0TR(s)=A, especially since a stronger no-traps conditions would translate to a stronger condition on the advisor in DRL.

It is interesting to study RVO dimension in more detail. For example, I'm not even sure whether Proposition 2 is the best possible bound in terms of |F| or it's e.g. possible to get a linear bound.

It seems tempting to generalize local dimension by allowing the values of f(x) to lie on some

nonlinearmanifold of given dimension for any given x. This approach, if workable, might require a substantially more difficult proof.The "cellular decision processes" discussed previously in "Proposition 3" have exponentially high local dimension, meaning that this regret bound is ineffective for them. We can consider a variant in which, on every time step, only one cell or a very small number of cells (possibly chosen randomly) change. This would have low local dimension. One way to interpret it is, as a continuous time process in which each cell has a certain

rateof changing its state. EDIT (2019-07-06): It is actually possible to use Theorem 1 to get an effective regret bound for CDPs, by replacing the CDP with a different but equivalent MDP where each time step is replaced by a series of time steps in which the cells are "decided" one by one. It would be interesting to (i) have an explicit expression for the best possible bound among those generated by such "isomorphisms" for general MDPs (ii) derive bounds on RVO dimension forstochasticanalogues of CDPs (possibly defined using Markov random fields).## Proofs

Proof of Proposition 1Consider some {xk∈X}k∈[m] and x∗∈X which is (F,B)-independent of {xk}. Then, there are some f,~f∈F s.t. the following two inequalities hold

m−1∑k=0Bxk(f(xk)−~f(xk))⊗2≤1

Bx∗(f(x∗)−~f(x∗))⊗2>1

The first inequality implies that, for any k∈[m], Bxk(f(xk)−~f(xk))⊗2≤1. Comparing with the second inequality, we conclude that x∗≠xk.

Now suppose that {xk∈X}k∈[n] is s.t. for all m∈[n], xm is (F,B)-independent of {xk}k∈[m]. By the previous paragraph, it follows that xm≠xk for any m,k∈[n] s.t. k<m. Hence, n≤|X|,

Q.E.D.Proof of Proposition 2Suppose that {xk∈X}k∈[n] is s.t. for all m∈[n], xm is (F,B)-independent of {xk}k∈[m]. Then, for each m∈[n], we can choose fm,~fm∈F s.t. the following two inequalities hold

m−1∑k=0Bxk(fm(xk)−~fm(xk))⊗2≤1

Bxm(fm(xm)−~fm(xm))⊗2>1

In particular, the second inequality implies that fm≠~fm.

Now, consider some k∈[m]. By the first inequality

Bxk(fm(xk)−~fm(xk))⊗2≤1

On the other hand, the second inequality with k instead of m gives

Bxk(fk(xk)−~fk(xk))⊗2>1

It follows that {fm,~fm}≠{fk,~fk}. Therefore, n cannot exceed the number of unordered pairs of distinct elements in F,

Q.E.D.Given a measurable space X, some μ,ν∈ΔX and a measurable function f:X→R, we will use the notation

Ex∼μ−ν[f(x)]:=Ex∼μ[f(x)]−Ex∼ν[f(x)]

Given X,Y measurable spaces, some K,L:Xk→Y and a measurable function f:Y→R, we will use the notation

Ey∼K[f(y)|x]:=Ey∼K(x)[f(y)]

Ey∼K−L[f(y)|x]:=Ey∼K(x)−L(x)[f(y)]

Given finite sets S,A, some T:S×Ak→S and π:S→A, we will use the notation Tπ:Sk→S defined by

Tπ(s):=T(s,π(s))

Proposition A.1In the setting of Theorem 1, fix T∈N+ and γ∈(0,1). Let Π:H×S→A be a Borel measurable mapping s.t. Π(T) is an optimal policy, i.e.EUΠ(T)TR(γ)=EU∗TR(γ)

Let πPSζΠT:S∗×Sk→A be the policy implemented by a PSRL algorithm with prior ζ and episode length T, s.t. whenever hypothesis T is sampled, policy Π(T) is followed. Denote the Bayesian regret byR(γ):=ET∼ζ[EU∗TR(γ)−EUπPSζΠTTR(γ)]

Let (Ω,P) be a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm.[See the proof of "Lemma 1" in this previous essay or the proof of "Theorem 1" in another previous essay.]Furthermore, let H∗:Ω→H be a random variable representing the true hypothesis, {Hn:Ω→H}n∈N be the random variables s.t. Hn represents the hypothesis sampled at time n (i.e. during episode number ⌊n/T⌋), {Θn:Ω→S}n∈N be random variables s.t. Θn represents the state at time n and {An:Ω→A}n∈N be random variables s.t. An represents the action taken at time n. Denote El:=H∗Π(HlT) and ¯El:=HlTΠ(H∗). ThenR(γ)=∞∑n=0γn+1E[EHn−H∗[VHnR(γ)|Θn,An]]+∞∑l=0γ(l+1)TE[E¯ETl−ETl[VH∗R(γ)|ΘlT]]

Here, ETl means raising El to the T-th power w.r.t. Markov kernel composition, and the same for ¯ETl.

We will use the notation VTπR(s,γ) to stand for the expected utility achieved by policy π when starting from state s with transition kernel T, reward function R and time discount parameter γ.

Proof of Proposition A.1For any n∈N, define Πn:H×H×S∗×Sk→A as follows.

Πn(T1,T2,h,s):={Π(T1,s) if |h|<nΠ(T2,s) if |h|≥n

That is, Πn(T1,T2) is a policy that follows Π(T1) for time n and Π(T2) afterwards.

In the following, we use the shorthand notation

V∗(s):=VH∗R(s,γ)

Vl(s):=VHlTR(s,γ)

Vlk(s):=VH∗ΠT−k(HlT,H∗)R(s,γ)

It is easy to see that

R(γ)=∞∑l=0γlTE[V∗(ΘlT)−Vl0(ΘlT)]

The above is essentially what appeared as "Proposition B.1" before, for the special case of PSRL, and where we regard every episode as a single time step in a new MDP in which every action is a policy for the duration of an episode in the original MDP.

By definition, HlT and H∗ have the same distribution even when conditioned by the history up to lT. Therefore

E[V∗(ΘlT)]=E[Vl(ΘlT)]

It follows that

R(γ)=∞∑l=0γlTE[Vl(ΘlT)−Vl0(ΘlT)]

We now prove by induction on k∈[T+1] that

E[Vl(ΘlT)−Vl0(ΘlT)]=lT+k−1∑n=lTγn−lT+1E[EHn−H∗[Vl|Θn,An]]+γkE[Vl(ΘlT+k)−Vlk(ΘlT+k)]

For k=0 this is a tautology. For any k∈[T], the Bellman equation says that

Vl(s)=(1−γ)R(s)+γEHlTΠ(HlT)[Vl|s]

Vlk(s)=(1−γ)R(s)+γEH∗Π(HlT)[Vl,k+1|s]

It follows that

E[Vl(ΘlT+k)−Vlk(ΘlT+k)]=γE[EHlTΠ(HlT)[Vl|ΘlT+k]−EH∗Π(HlT)[Vl,k+1|ΘlT+k]]

Since Π(HlT) is exactly the policy followed by PSRL at time lT+k, we get

E[Vl(ΘlT+k)−Vlk(ΘlT+k)]=γE[EHlT[Vl|ΘlT+k,AlT+k]−EH∗[Vl,k+1|ΘlT+k,AlT+k]]

We now subtract and add EH∗[Vl|ΘlT+k,AlT+k], and use the fact that H∗(ΘlT+k,AlT+k) is the conditional distribution of ΘlT+k+1.

E[Vl(ΘlT+k)−Vlk(ΘlT+k)]=γE[EHlT−H∗[Vl|ΘlT+k,AlT+k]+Vl(ΘlT+k+1)−Vl,k+1(ΘlT+k+1)]

Applying this identity to the second second term on the right hand side of the induction hypothesis, we prove the induction step. For k=T, we get, denoting n0:=lT and n1:=(l+1)T

E[Vl(Θn0)−Vl0(Θn0)]=n1−1∑n=n0γn−n0+1E[EHn−H∗[Vl|Θn,An]]+γTE[Vl(Θn1)−V∗(Θn1)]

Clearly

E[Vl(Θn1)]=E[EETl[Vl∣∣Θn0]]

E[V∗(Θn1)]=E[EETl[V∗∣∣Θn0]]

Using the definition of PSRL, we can exchange and true and sampled hypothesis and get

E[Vl(Θn1)]=E[E¯ETl[V∗∣∣Θn0]]

It follows that

E[Vl(Θn0)−Vl0(Θn0)]=n1−1∑n=n0γn−n0+1E[EHn−H∗[Vl|Θn,An]]+γTE[E¯ETl−ETl[V∗∣∣Θn0]]

Applying this to each term in the earlier expression for R(γ), we get the desired result.

Q.E.D.Proposition A.2Consider a real finite-dimensional normed vector space Y and a linear subspace Z⊆Y. Then, there exists B∈PSD(Y) s.t.For any v∈Y, Bv⊗2≤∥v∥2For any v∈Z, (dimZ)2Bv⊗2≥∥v∥2Proof of Proposition A.2We assume dimZ>0 since otherwise the claim is trivial (take B=0).

By Theorem B.1 (see Appendix), there is ~B∈PSD(Z) s.t. for any v∈Z

~Bv⊗2≤∥v∥2≤dimZ⋅~Bv⊗2

By Corollary B.1 (see Appendix), there is a projection operator P:Y→Y s.t. imY=Z and ∥P∥≤√dimZ. We define B∈PSD(Y) by

B(v,w):=~B(Pv,Pw)dimZ

For any v∈Y, we have

Bv⊗2=~B(Pv,Pv)dimZ≤∥Pv∥2dimZ≤∥P∥2∥v∥2dimZ≤∥v∥2

For any v∈Z, we have Pv=v and therefore

∥v∥2≤dimZ⋅~Bv⊗2=(dimZ)2~B(Pv,Pv)dimZ=(dimZ)2Bv⊗2

Q.E.D.Proposition A.3Consider a finite-dimensional real vector space Y, some B∈PD(Y) and a Borel probability measure μ∈ΔY s.t. Pry∼μ[By⊗2≤1]=1. Let y0:=Ey∼μ[y] and σ:=2√2. Then, μ is σ-sub-Gaussian w.r.t. B, i.e., for any α∈Y∗Ey∼μ[exp(α(y−y0))]≤exp(σ2B−1α⊗22)

Proof of Proposition A.3By isomorphism, it is sufficient to consider the case Y=Rn, B(v,w)=v⋅w, α(v)=tv0 for some n∈N and t∈[0,∞). For this form of α it is sufficient to consider the case n=1. It remains to show that

Ey∼μ[et(y−y0)]≤e12σ2t2=e4t2

Here, we assume that Pry∼μ[|y|≤1]=1.

We consider separately the cases t≥12 and t<12. In the first case

Ey∼μ[et(y−y0)]≤maxy∈[−1,+1]et(y−y0)≤e2t≤e(2t)2=e4t2

In the second case, we use that for any x∈(−∞,+1]

ex<1+x+e2x2

The above holds because, at x=0 the left hand side and the right hand have the same value and first derivative, and for any x∈(−∞,+1), the second derivative of the left hand side is less than the second derivative of the right hand side. We get

Ey∼μ[et(y−y0)]≤Ey∼μ[1+t(y−y0)+e2t2(y−y0)2]=1+e2t2Vary∼μ[y]≤1+e2t2≤ee2t2≤e4t2

Q.E.D.Definition A.1Consider a set X, a finite-dimensional real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. Assume F is compact w.r.t. the product topology on X→Y≅∏x∈XY. Consider also some n∈N, x∈Xn and y∈Yn. We then use the notationLSF[xy,B]:=argminf∈Fn−1∑m=0Bxm(f(xm)−ym)⊗2

CSF[xy,B]:={f∈F∣∣ ∣∣n−1∑m=0Bxm(f(xm)−LSF[xy,B](xm))⊗2≤1}

I chose the notation LS as an abbreviation of "least squares" and CS as an abbreviation of "confidence set". Note that LS is somewhat ambiguous (and therefore, so is CS) since there might be multiple minima, but this will not be important in the following (i.e. an arbitrary minimum can be chosen).

Proposition A.4There is some CA.4∈R+ s.t. the following holds.Consider finite sets X,S, some F⊆{Xk→S} and a family {Bx∈PSD(RS)}x∈X. Assume that for any x∈X and φ∈ΔS, Bxφ⊗2≤1. Let {Hn⊆P(Xω×Sω)}n∈N be the canonical filtration, i.e.Hn:={A′⊆Xω×Sω∣∣A′={xs|xs:n∈A}, A⊆Xn×Sn}

Consider also f∗∈F and μ∈Δ(Xω×Sω) s.t. for any n∈N, x∈X, and s∈SPrxs∼μ[sn=s|xn=x, Hn]=f∗(s∣x)

Fix ϵ∈R+, δ∈(0,1). Denoteβ(t):=CA.4(lnN(F,ϵ−2B)δ+ϵtlnetδ)

Then,Prxs∼μ[f∗∉∞⋂n=0CSF[xs:n,β(n+1)−1B]]≤δ

Proof of Proposition A.4For each x∈X, choose a finite set Ex and a linear mapping Px:RS→REx s.t. Bx(v,w)=Px(v)⋅Px(w). Let ~Y:=⨁x∈XREx. Define ~F⊆{X→~Y} by

~F:={~f:X→~Y∣∣~f(x)=Px(f(x)), f∈F}

Define Pω:Xω×Sω→Xω×~Yω by

Pω(xs)n:=xnPxn(sn)

Let ~f∗:=Pω(f∗) and ~μ:=Pω∗μ. By Proposition A.3, ~μ is 2√2-sub-Gaussian. Applying Proposition B.1 (see Appendix) to the tilde objects gives the desired result. Here, we choose the constant CA.4 s.t. the term M=1 in β as defined in Proposition B.1 is absorbed by the term σlnetδ≥1 (note that t≥1 since we substitute t=n+1, δ<1 and σ=2√2>1).

Q.E.D.Definition A.2Consider a set X, some x∈X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. TheB-width of F at xis defined byWF(x,B):=supf,~f∈F√Bx(f(x)−~f(x))⊗2

Proposition A.5There is some CA.5∈R+ s.t. the following holds.Consider a set X, a real vector space Y, some F⊆{X→Y} and a family {Bx∈PSD(Y)}x∈X. Consider also some x∈Xω, y∈Yω, T∈N+, γ∈(0,1), θ∈R+, η0,η1∈R+ and δ∈(0,1). Denoteβ(t):=η0+η1tlnetδ

For any n∈N, define Fn byFn:=CSF[xy:n,β(n+1)−1B]

Assume that γT>12. Then,∞∑l=0T−1∑m=0γlT+m[[WFlT(xlT+m,B)>θ]]≤CA.5dimRVOF⋅(θ−2β(11−γ)+T)

Proof of Proposition A.5For each x∈X, choose a finite set Ex and a linear mapping Px:Y→REx s.t. Bx(v,w)=Px(v)⋅Px(w). Let ~Y:=⨁x∈XREx. Define ~F⊆{X→~Y} by

~F:={~f:X→~Y∣∣~f(x)=Px(f(x)), f∈F}

Define also ~y by

~yn=Pxn(yn)

Applying Proposition B.2 (see Appendix) to the tilde objects, we conclude that for any N∈N

N−1∑l=0T−1∑m=0[[WFlT(xlT+m,B)>θ]]≤dimRVOF⋅(4θ−2β(NT)+T)

Multiplying the inequality by γNT and summing over N, we get

∞∑l=0(∞∑N=l+1γNT)T−1∑m=0[[WFlT(xlT+m,B)>θ]]≤dimRVOF∞∑N=0γNT(4θ−2β(NT)+T)

On the left hand side, we sum the geometric series. On the right hand side, we use the observation that

γNT(4θ−2β(NT)+T)≤1T∫T0γNT+t−T(4θ−2β(NT+t)+T)dt

γNT(4θ−2β(NT)+T)≤1TγT∫T0γNT+t(4θ−2β(NT+t)+T)dt

Here, we used that β(t) is an increasing function for t≥1 and β(0)=0. We get

∞∑l=0γ(l+1)T1−γTT−1∑m=0[[WFlT(xlT+m,B)>θ]]≤dimRVOFTγT∫∞0γt(4θ−2β(t)+T)dt

The integral on the right hand side is a Laplace transform (where the dual variable is s=ln1γ). The functions f(t)=1, f(t)=t and f(t)=tlnt all have the property that, for any s∈R+

L[f](s)≤1sf(1s)

Indeed, we have

L[1](s)=1s≤1s

L[t](s)=1s2≤1s⋅1s

L[tlnt](s)=−