TLDR: I derive a variant of the RL regret bound by Osband and Van Roy (2014), that applies to learning without resets of environments without traps. The advantage of this regret bound over those known in the literature, is that it scales with certain learningtheoretic dimensions rather than number of states and actions. My goal is building on this result to derive this type of regret bound for DRL, and, later on, other settings interesting from an AI alignment perspective.
Previously I derived a regret bound for deterministic environments that scales with prior entropy and "prediction dimension". That bound behaves as in the episodic setting but only as in the setting without resets. Moreover, my attempts to generalize the result to stochastic environments led to bounds that are even weaker (have a lower exponent). Therefore, I decided to put that line of attack on hold, and use Osband and Van Roy's technique instead, leading to an bound in the stochastic setting without resets. This bound doesn't scale down with the entropy of the prior, but this does not seem as important as dependence on .
Results
Russo and Van Roy (2013) introduced the concept of "eluder dimension" for the benefit of the multiarmed bandit setting, and Osband and Van Roy extended it to a form suitable for studying reinforcement learning. We will consider the following, slightly modified version of their definition.
Given a real vector space , we will use to denote the set of positive semidefinite bilinear forms on and to denote the set of positive definite bilinear forms on . Given a bilinear form , we will slightly abuse notation by also regarding it as a linear functional . Thereby, we have and . Also, if is finitedimensional and is nondegenerate, we will denote the unique bilinear form which satisfies
Definition 1
Consider a set , a real vector space , some and a family . Consider also , a sequence and . is said to be dependant on when, for any
Otherwise, is said to be independent of .
Definition 2
Consider a set , a real vector space and some . The RussoVan RoyOsband dimension (RVO dimension) is the supremum of the set of for which there is and s.t. for all , is independent of .
We have the following basic bounds on RVO dimension.
Proposition 1
Consider a set , a real vector space and some . Then
Proposition 2
Consider a set , a real vector space and some . Then
Another concept we need to formulate the regret bound is the Minkowski–Bouligand dimension.
Definition 3
Consider a set , a real vector space , some and a family . A set is said to be a covering of when
Definition 4
Consider a set , a real vector space , some and a family . The covering number is the infimum of the set of for which there is a covering of of size .
Definition 5
Consider a finite set , a finitedimensional real vector product space and some . Fix any . The Minkowski–Bouligand dimension (MB dimension) of is defined by
It is easy to see the above is indeed welldefined, i.e. doesn't depend on the choice of . This is because given any , there are constants s.t. for all and
Similarly, for any , we have
For finite and , it's obvious that , and in particular . It is also possible to show that, for any bounded , .
Note that, in general, MB dimension is fractional.
We will need yet another (but rather simple) notion of "dimension".
Definition 6
Consider a set , a vector space and some . The local dimension of is defined by
Obviously and .
Consider finite nonempty sets (states) and (actions). Observe that can be regarded as a subset of the vector space . This allows us to speak of the RVO, MB and local dimensions of a hypothesis class of transition kernels (in this case and ).
We can now formulate the regret bound.
Theorem 1
There is some s.t. the following holds.
Consider any finite nonempty sets and , , closed set and Borel probability measure on (prior). We define the maximal bias span by
Denote , and . Then, there is a family of policies s.t.
Here (like in previous essays), is the value function for transition kernel , reward function , state and geometric time discount parameter ; is the expected utility for policy ; is the maximal expected utility over policies. The expression in the numerator is, thereby, the Bayesian regret.
The implicit condition implies that for any and , . This is a notraps condition stronger than the condition we used before: not only that longterm value cannot be lost in expectation, it cannot be lost at all. For any finite satisfying this notraps condition, we have
A few directions for improving on this result:

It is not hard to see from the proof that it is also possible to write down a concrete bound for fixed (rather than considering the limit), but its form is somewhat convoluted.

It is probably possible to get an anytime policy with this form of regret, using PSRL with dynamic episode duration.

It is interesting to try to make do with the weaker notraps condition , especially since a stronger notraps conditions would translate to a stronger condition on the advisor in DRL.

It is interesting to study RVO dimension in more detail. For example, I'm not even sure whether Proposition 2 is the best possible bound in terms of or it's e.g. possible to get a linear bound.

It seems tempting to generalize local dimension by allowing the values of to lie on some nonlinear manifold of given dimension for any given . This approach, if workable, might require a substantially more difficult proof.

The "cellular decision processes" discussed previously in "Proposition 3" have exponentially high local dimension, meaning that this regret bound is ineffective for them. We can consider a variant in which, on every time step, only one cell or a very small number of cells (possibly chosen randomly) change. This would have low local dimension. One way to interpret it is, as a continuous time process in which each cell has a certain rate of changing its state. EDIT (20190706): It is actually possible to use Theorem 1 to get an effective regret bound for CDPs, by replacing the CDP with a different but equivalent MDP where each time step is replaced by a series of time steps in which the cells are "decided" one by one. It would be interesting to (i) have an explicit expression for the best possible bound among those generated by such "isomorphisms" for general MDPs (ii) derive bounds on RVO dimension for stochastic analogues of CDPs (possibly defined using Markov random fields).
Proofs
Proof of Proposition 1
Consider some and which is independent of . Then, there are some s.t. the following two inequalities hold
The first inequality implies that, for any , . Comparing with the second inequality, we conclude that .
Now suppose that is s.t. for all , is independent of . By the previous paragraph, it follows that for any s.t. . Hence, , Q.E.D.
Proof of Proposition 2
Suppose that is s.t. for all , is independent of . Then, for each , we can choose s.t. the following two inequalities hold
In particular, the second inequality implies that .
Now, consider some . By the first inequality
On the other hand, the second inequality with instead of gives
It follows that . Therefore, cannot exceed the number of unordered pairs of distinct elements in , Q.E.D.
Given a measurable space , some and a measurable function , we will use the notation
Given measurable spaces, some and a measurable function , we will use the notation
Given finite sets , some and , we will use the notation defined by
Proposition A.1
In the setting of Theorem 1, fix and . Let be a Borel measurable mapping s.t. is an optimal policy, i.e.
Let be the policy implemented by a PSRL algorithm with prior and episode length , s.t. whenever hypothesis is sampled, policy is followed. Denote the Bayesian regret by
Let be a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm. [See the proof of "Lemma 1" in this previous essay or the proof of "Theorem 1" in another previous essay.] Furthermore, let be a random variable representing the true hypothesis, be the random variables s.t. represents the hypothesis sampled at time (i.e. during episode number ), be random variables s.t. represents the state at time and be random variables s.t. represents the action taken at time . Denote and . Then
Here, means raising to the th power w.r.t. Markov kernel composition, and the same for .
We will use the notation to stand for the expected utility achieved by policy when starting from state with transition kernel , reward function and time discount parameter .
Proof of Proposition A.1
For any , define as follows.
That is, is a policy that follows for time and afterwards.
In the following, we use the shorthand notation
It is easy to see that
The above is essentially what appeared as "Proposition B.1" before, for the special case of PSRL, and where we regard every episode as a single time step in a new MDP in which every action is a policy for the duration of an episode in the original MDP.
By definition, and have the same distribution even when conditioned by the history up to . Therefore
It follows that
We now prove by induction on that
For this is a tautology. For any , the Bellman equation says that
It follows that
Since is exactly the policy followed by PSRL at time , we get
We now subtract and add , and use the fact that is the conditional distribution of .
Applying this identity to the second second term on the right hand side of the induction hypothesis, we prove the induction step. For , we get, denoting and
Clearly
Using the definition of PSRL, we can exchange and true and sampled hypothesis and get
It follows that
Applying this to each term in the earlier expression for , we get the desired result. Q.E.D.
Proposition A.2
Consider a real finitedimensional normed vector space and a linear subspace . Then, there exists s.t.

For any ,

For any ,
Proof of Proposition A.2
We assume since otherwise the claim is trivial (take ).
By Theorem B.1 (see Appendix), there is s.t. for any
By Corollary B.1 (see Appendix), there is a projection operator s.t. and . We define by
For any , we have
For any , we have and therefore
Q.E.D.
Proposition A.3
Consider a finitedimensional real vector space , some and a Borel probability measure s.t. . Let and . Then, is subGaussian w.r.t. , i.e., for any
Proof of Proposition A.3
By isomorphism, it is sufficient to consider the case , , for some and . For this form of it is sufficient to consider the case . It remains to show that
Here, we assume that .
We consider separately the cases and . In the first case
In the second case, we use that for any
The above holds because, at the left hand side and the right hand have the same value and first derivative, and for any , the second derivative of the left hand side is less than the second derivative of the right hand side. We get
Q.E.D.
Definition A.1
Consider a set , a finitedimensional real vector space , some and a family . Assume is compact w.r.t. the product topology on . Consider also some , and . We then use the notation
I chose the notation as an abbreviation of "least squares" and as an abbreviation of "confidence set". Note that is somewhat ambiguous (and therefore, so is ) since there might be multiple minima, but this will not be important in the following (i.e. an arbitrary minimum can be chosen).
Proposition A.4
There is some s.t. the following holds.
Consider finite sets , some and a family . Assume that for any and , . Let be the canonical filtration, i.e.
Consider also and s.t. for any , , and
Fix , . Denote
Then,
Proof of Proposition A.4
For each , choose a finite set and a linear mapping s.t. . Let . Define by
Define by
Let and . By Proposition A.3, is subGaussian. Applying Proposition B.1 (see Appendix) to the tilde objects gives the desired result. Here, we choose the constant s.t. the term in as defined in Proposition B.1 is absorbed by the term (note that since we substitute , and ). Q.E.D.
Definition A.2
Consider a set , some , a real vector space , some and a family . The width of at is defined by
Proposition A.5
There is some s.t. the following holds.
Consider a set , a real vector space , some and a family . Consider also some , , , , , and . Denote
For any , define by
Assume that . Then,
Proof of Proposition A.5
For each , choose a finite set and a linear mapping s.t. . Let . Define by
Define also by
Applying Proposition B.2 (see Appendix) to the tilde objects, we conclude that for any
Multiplying the inequality by and summing over , we get
On the left hand side, we sum the geometric series. On the right hand side, we use the observation that