TLDR: I derive a variant of the RL regret bound by Osband and Van Roy (2014), that applies to learning without resets of environments without traps. The advantage of this regret bound over those known in the literature, is that it scales with certain learning-theoretic dimensions rather than number of states and actions. My goal is building on this result to derive this type of regret bound for DRL, and, later on, other settings interesting from an AI alignment perspective.

Previously I derived a regret bound for deterministic environments that scales with prior entropy and "prediction dimension". That bound behaves as in the episodic setting but only as in the setting without resets. Moreover, my attempts to generalize the result to stochastic environments led to bounds that are even weaker (have a lower exponent). Therefore, I decided to put that line of attack on hold, and use Osband and Van Roy's technique instead, leading to an bound in the stochastic setting without resets. This bound doesn't scale down with the entropy of the prior, but this does not seem as important as dependence on .


Russo and Van Roy (2013) introduced the concept of "eluder dimension" for the benefit of the multi-armed bandit setting, and Osband and Van Roy extended it to a form suitable for studying reinforcement learning. We will consider the following, slightly modified version of their definition.

Given a real vector space , we will use to denote the set of positive semidefinite bilinear forms on and to denote the set of positive definite bilinear forms on . Given a bilinear form , we will slightly abuse notation by also regarding it as a linear functional . Thereby, we have and . Also, if is finite-dimensional and is non-degenerate, we will denote the unique bilinear form which satisfies

Definition 1

Consider a set , a real vector space , some and a family . Consider also , a sequence and . is said to be -dependant on when, for any

Otherwise, is said to be -independent of .

Definition 2

Consider a set , a real vector space and some . The Russo-Van Roy-Osband dimension (RVO dimension) is the supremum of the set of for which there is and s.t. for all , is -independent of .

We have the following basic bounds on RVO dimension.

Proposition 1

Consider a set , a real vector space and some . Then

Proposition 2

Consider a set , a real vector space and some . Then

Another concept we need to formulate the regret bound is the Minkowski–Bouligand dimension.

Definition 3

Consider a set , a real vector space , some and a family . A set is said to be a -covering of when

Definition 4

Consider a set , a real vector space , some and a family . The covering number is the infimum of the set of for which there is a -covering of of size .

Definition 5

Consider a finite set , a finite-dimensional real vector product space and some . Fix any . The Minkowski–Bouligand dimension (MB dimension) of is defined by

It is easy to see the above is indeed well-defined, i.e. doesn't depend on the choice of . This is because given any , there are constants s.t. for all and

Similarly, for any , we have

For finite and , it's obvious that , and in particular . It is also possible to show that, for any bounded , .

Note that, in general, MB dimension is fractional.

We will need yet another (but rather simple) notion of "dimension".

Definition 6

Consider a set , a vector space and some . The local dimension of is defined by

Obviously and .

Consider finite non-empty sets (states) and (actions). Observe that can be regarded as a subset of the vector space . This allows us to speak of the RVO, MB and local dimensions of a hypothesis class of transition kernels (in this case and ).

We can now formulate the regret bound.

Theorem 1

There is some s.t. the following holds.

Consider any finite non-empty sets and , , closed set and Borel probability measure on (prior). We define the maximal bias span by

Denote , and . Then, there is a family of policies s.t.

Here (like in previous essays), is the value function for transition kernel , reward function , state and geometric time discount parameter ; is the expected utility for policy ; is the maximal expected utility over policies. The expression in the numerator is, thereby, the Bayesian regret.

The implicit condition implies that for any and , . This is a no-traps condition stronger than the condition we used before: not only that long-term value cannot be lost in expectation, it cannot be lost at all. For any finite satisfying this no-traps condition, we have

A few directions for improving on this result:

  • It is not hard to see from the proof that it is also possible to write down a concrete bound for fixed (rather than considering the limit), but its form is somewhat convoluted.

  • It is probably possible to get an anytime policy with this form of regret, using PSRL with dynamic episode duration.

  • It is interesting to try to make do with the weaker no-traps condition , especially since a stronger no-traps conditions would translate to a stronger condition on the advisor in DRL.

  • It is interesting to study RVO dimension in more detail. For example, I'm not even sure whether Proposition 2 is the best possible bound in terms of or it's e.g. possible to get a linear bound.

  • It seems tempting to generalize local dimension by allowing the values of to lie on some nonlinear manifold of given dimension for any given . This approach, if workable, might require a substantially more difficult proof.

  • The "cellular decision processes" discussed previously in "Proposition 3" have exponentially high local dimension, meaning that this regret bound is ineffective for them. We can consider a variant in which, on every time step, only one cell or a very small number of cells (possibly chosen randomly) change. This would have low local dimension. One way to interpret it is, as a continuous time process in which each cell has a certain rate of changing its state. EDIT (2019-07-06): It is actually possible to use Theorem 1 to get an effective regret bound for CDPs, by replacing the CDP with a different but equivalent MDP where each time step is replaced by a series of time steps in which the cells are "decided" one by one. It would be interesting to (i) have an explicit expression for the best possible bound among those generated by such "isomorphisms" for general MDPs (ii) derive bounds on RVO dimension for stochastic analogues of CDPs (possibly defined using Markov random fields).


Proof of Proposition 1

Consider some and which is -independent of . Then, there are some s.t. the following two inequalities hold

The first inequality implies that, for any , . Comparing with the second inequality, we conclude that .

Now suppose that is s.t. for all , is -independent of . By the previous paragraph, it follows that for any s.t. . Hence, , Q.E.D.

Proof of Proposition 2

Suppose that is s.t. for all , is -independent of . Then, for each , we can choose s.t. the following two inequalities hold

In particular, the second inequality implies that .

Now, consider some . By the first inequality

On the other hand, the second inequality with instead of gives

It follows that . Therefore, cannot exceed the number of unordered pairs of distinct elements in , Q.E.D.

Given a measurable space , some and a measurable function , we will use the notation

Given measurable spaces, some and a measurable function , we will use the notation

Given finite sets , some and , we will use the notation defined by

Proposition A.1

In the setting of Theorem 1, fix and . Let be a Borel measurable mapping s.t. is an optimal policy, i.e.

Let be the policy implemented by a PSRL algorithm with prior and episode length , s.t. whenever hypothesis is sampled, policy is followed. Denote the Bayesian regret by

Let be a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm. [See the proof of "Lemma 1" in this previous essay or the proof of "Theorem 1" in another previous essay.] Furthermore, let be a random variable representing the true hypothesis, be the random variables s.t. represents the hypothesis sampled at time (i.e. during episode number ), be random variables s.t. represents the state at time and be random variables s.t. represents the action taken at time . Denote and . Then

Here, means raising to the -th power w.r.t. Markov kernel composition, and the same for .

We will use the notation to stand for the expected utility achieved by policy when starting from state with transition kernel , reward function and time discount parameter .

Proof of Proposition A.1

For any , define as follows.

That is, is a policy that follows for time and afterwards.

In the following, we use the shorthand notation

It is easy to see that

The above is essentially what appeared as "Proposition B.1" before, for the special case of PSRL, and where we regard every episode as a single time step in a new MDP in which every action is a policy for the duration of an episode in the original MDP.

By definition, and have the same distribution even when conditioned by the history up to . Therefore

It follows that

We now prove by induction on that

For this is a tautology. For any , the Bellman equation says that

It follows that

Since is exactly the policy followed by PSRL at time , we get

We now subtract and add , and use the fact that is the conditional distribution of .

Applying this identity to the second second term on the right hand side of the induction hypothesis, we prove the induction step. For , we get, denoting and


Using the definition of PSRL, we can exchange and true and sampled hypothesis and get

It follows that

Applying this to each term in the earlier expression for , we get the desired result. Q.E.D.

Proposition A.2

Consider a real finite-dimensional normed vector space and a linear subspace . Then, there exists s.t.

  1. For any ,

  2. For any ,

Proof of Proposition A.2

We assume since otherwise the claim is trivial (take ).

By Theorem B.1 (see Appendix), there is s.t. for any

By Corollary B.1 (see Appendix), there is a projection operator s.t. and . We define by

For any , we have

For any , we have and therefore


Proposition A.3

Consider a finite-dimensional real vector space , some and a Borel probability measure s.t. . Let and . Then, is -sub-Gaussian w.r.t. , i.e., for any

Proof of Proposition A.3

By isomorphism, it is sufficient to consider the case , , for some and . For this form of it is sufficient to consider the case . It remains to show that

Here, we assume that .

We consider separately the cases and . In the first case

In the second case, we use that for any

The above holds because, at the left hand side and the right hand have the same value and first derivative, and for any , the second derivative of the left hand side is less than the second derivative of the right hand side. We get


Definition A.1

Consider a set , a finite-dimensional real vector space , some and a family . Assume is compact w.r.t. the product topology on . Consider also some , and . We then use the notation

I chose the notation as an abbreviation of "least squares" and as an abbreviation of "confidence set". Note that is somewhat ambiguous (and therefore, so is ) since there might be multiple minima, but this will not be important in the following (i.e. an arbitrary minimum can be chosen).

Proposition A.4

There is some s.t. the following holds.

Consider finite sets , some and a family . Assume that for any and , . Let be the canonical filtration, i.e.

Consider also and s.t. for any , , and

Fix , . Denote


Proof of Proposition A.4

For each , choose a finite set and a linear mapping s.t. . Let . Define by

Define by

Let and . By Proposition A.3, is -sub-Gaussian. Applying Proposition B.1 (see Appendix) to the tilde objects gives the desired result. Here, we choose the constant s.t. the term in as defined in Proposition B.1 is absorbed by the term (note that since we substitute , and ). Q.E.D.

Definition A.2

Consider a set , some , a real vector space , some and a family . The -width of at is defined by

Proposition A.5

There is some s.t. the following holds.

Consider a set , a real vector space , some and a family . Consider also some , , , , , and . Denote

For any , define by

Assume that . Then,

Proof of Proposition A.5

For each , choose a finite set and a linear mapping s.t. . Let . Define by

Define also by

Applying Proposition B.2 (see Appendix) to the tilde objects, we conclude that for any

Multiplying the inequality by and summing over , we get

On the left hand side, we sum the geometric series. On the right hand side, we use the observation that

Here, we used that is an increasing function for and . We get

The integral on the right hand side is a Laplace transform (where the dual variable is ). The functions , and all have the property that, for any

Indeed, we have