% operators that are separated from the operand by a space

% autosize deliminaters

% operators that require brackets

% operators that require parentheses

% Paper specific

Previously, I presented the theory of DRL for finite MDPs. Now, although I believe this theory can be generalized much beyond finite MDPs (at least to finite POMDPs and infinite POMDPs that satisfy some geometric and/or dynamical systems theoretic assumptions), it cannot work for arbitrary environments (without requiring the advisor to be more than "merely sane"). Constructing a counterexample is not difficult, but it seemed worthwhile to write it down.

Let A:={a,b,c} be the set of actions and O:={o−,o+} the set of observations. Define τ:(A×O)∗→N⊔{∞} s.t. for any n∈N and h∈(A×O)n

τ(h):={∞ if ∀m∈[n]:hm∉c×Omin{m∈[n]∣hm∈c×O} otherwise

That is, τ(h) is the first time action c (stands for "commit") appears in the history h. Now, consider the following environments μa, μb.

μa(o+∣h):=⎧⎪
⎪⎨⎪
⎪⎩0 if τ(h)=∞0 if τ(h)≤n21τ(h)|{m∈[τ(h)]∣hm∈a×O}| otherwise

μb(o+∣h):=⎧⎪
⎪⎨⎪
⎪⎩0 if τ(h)=∞0 if τ(h)≤n21τ(h)|{m∈[τ(h)]∣hm∈b×O}| otherwise

Also, define the reward function r by setting r(h)=1 for any history h that ends with o+ and r(h)=0 for any other history. Denote υa,b:=(μa,b,r). That is, both universes count the number of actions a and b until time τ (the first time action c is taken). At times between τ and 2τ, they produce rewards with frequency that equals the relative fraction of the corresponding action in the count. Before τ and after 2τ, they produce no rewards.

Now, we haven't defined sane policies for arbitrary environments, but the spirit of the definition is that a sane policy is unlikely to take actions with major irreversible long-term negative consequences and has to have some non-negligible probability of taking an optimal (or at least nearly optimal) action. For example, we might define it as follows

#Definition

Consider β∈(0,∞), γ,ϵ∈(0,1) and a universe υ=(μ,r). A policy σ is called (ϵ,β)-sane for (υ,γ) when for any h∈hdomμ

i. ∃a∗∈argmaxa∈AQυγ(h,a):σ(a∗∣h)≥ϵ

ii. ∀a∈A:σ(a∣h)≤exp[β(Qυγ(h,a)−Vυγ(h))]

Here, we consider the asymptotics γ→1, β→∞ but should include the scenario β=o(11−γ) so that actions with short-term negative consequences are allowed (otherwise this would be some sort of "rationality" requirement, similar to what's needed for DIRL, rather than a mere "sanity" requirement).

The optimal policy for υa (respectively υb) is taking action a (resp. b) until some time τ∗(γ)=Θ(11−γ) and taking action c at this time (obviously, the following actions don't affect anything). Let π∗a,b be policies that are optimal even off-policy (so that their decision when to take action c depends on the history of actions before). It is easy to see that the following policies σa,b are (12,β)-sane for the respective universes and some "legitimate" choice of β:

σa(a∣h):=σa(b∣h):=12π∗a(a∣h)

σb(a∣h):=σb(b∣h):=12π∗b(b∣h)

That is, whenever π∗ suggests taking action c, σ complies with it, and whenever π∗ suggests taking action a or b, σ flips a coin to decide which one to take. Indeed, in expectation, taking the wrong action out of {a,b} loses one time moment of reward and is therefore equivalent to a "short-term" loss.

It is thus impossible for the agent to distinguish between the universes υa and υb until action c is taken: both the environment and the advisor behave identically until this point. On the other hand, after action c is taken it is too late to do anything. Evidently, the hypothesis class H:={¯υa[σa],¯υb[σb]} is not learnable.

In principle, it might be possible to get around this obstacle by formulating a condition on the advisor in which losses that are minor in size but far away in time are also ruled out. However, such a condition might prove too stringent. Mostly, I hope to ultimately extend the formalism to incomplete models in which case certain restrictions on the type of incomplete model might be acceptable. For example, the incomplete equivalent of an MDP might be a stochastic game against some arbitrary "opponent."

% operators that are separated from the operand by a space

% autosize deliminaters

% operators that require brackets

% operators that require parentheses

% Paper specific

Previously, I presented the theory of DRL for finite MDPs. Now, although I believe this theory can be generalized much beyond finite MDPs (at least to finite POMDPs and infinite POMDPs that satisfy some geometric and/or dynamical systems theoretic assumptions), it cannot work for arbitrary environments (without requiring the advisor to be more than "merely sane"). Constructing a counterexample is not difficult, but it seemed worthwhile to write it down.

Let A:={a,b,c} be the set of actions and O:={o−,o+} the set of observations. Define τ:(A×O)∗→N⊔{∞} s.t. for any n∈N and h∈(A×O)n

τ(h):={∞ if ∀m∈[n]:hm∉c×Omin{m∈[n]∣hm∈c×O} otherwise

That is, τ(h) is the first time action c (stands for "commit") appears in the history h. Now, consider the following environments μa, μb.

μa(o+∣h):=⎧⎪ ⎪⎨⎪ ⎪⎩0 if τ(h)=∞0 if τ(h)≤n21τ(h)|{m∈[τ(h)]∣hm∈a×O}| otherwise

μb(o+∣h):=⎧⎪ ⎪⎨⎪ ⎪⎩0 if τ(h)=∞0 if τ(h)≤n21τ(h)|{m∈[τ(h)]∣hm∈b×O}| otherwise

Also, define the reward function r by setting r(h)=1 for any history h that ends with o+ and r(h)=0 for any other history. Denote υa,b:=(μa,b,r). That is, both universes count the number of actions a and b until time τ (the first time action c is taken). At times between τ and 2τ, they produce rewards with frequency that equals the relative fraction of the corresponding action in the count. Before τ and after 2τ, they produce no rewards.

Now, we haven't defined sane policies for arbitrary environments, but the spirit of the definition is that a sane policy is unlikely to take actions with major irreversible long-term negative consequences and has to have some non-negligible probability of taking an optimal (or at least nearly optimal) action. For example, we might define it as follows

#Definition

Consider β∈(0,∞), γ,ϵ∈(0,1) and a universe υ=(μ,r). A policy σ is called

(ϵ,β)-sanefor (υ,γ) when for any h∈hdomμi. ∃a∗∈argmaxa∈AQυγ(h,a):σ(a∗∣h)≥ϵ

ii. ∀a∈A:σ(a∣h)≤exp[β(Qυγ(h,a)−Vυγ(h))]

Here, we consider the asymptotics γ→1, β→∞ but should include the scenario β=o(11−γ) so that actions with

short-termnegative consequences are allowed (otherwise this would be some sort of "rationality" requirement, similar to what's needed for DIRL, rather than a mere "sanity" requirement).The optimal policy for υa (respectively υb) is taking action a (resp. b) until some time τ∗(γ)=Θ(11−γ) and taking action c at this time (obviously, the following actions don't affect anything). Let π∗a,b be policies that are optimal

even off-policy(so that their decision when to take action c depends on the history of actions before). It is easy to see that the following policies σa,b are (12,β)-sane for the respective universes and some "legitimate" choice of β:σa(a∣h):=σa(b∣h):=12π∗a(a∣h)

σb(a∣h):=σb(b∣h):=12π∗b(b∣h)

That is, whenever π∗ suggests taking action c, σ complies with it, and whenever π∗ suggests taking action a or b, σ flips a coin to decide which one to take. Indeed, in expectation, taking the wrong action out of {a,b} loses one time moment of reward and is therefore equivalent to a "short-term" loss.

It is thus impossible for the agent to distinguish between the universes υa and υb until action c is taken: both the environment and the advisor behave identically until this point. On the other hand, after action c is taken it is too late to do anything. Evidently, the hypothesis class H:={¯υa[σa],¯υb[σb]} is

notlearnable.In principle, it might be possible to get around this obstacle by formulating a condition on the advisor in which losses that are minor in size but far away in time are also ruled out. However, such a condition might prove too stringent. Mostly, I hope to ultimately extend the formalism to incomplete models in which case certain restrictions on the type of incomplete model might be acceptable. For example, the incomplete equivalent of an MDP might be a stochastic game against some arbitrary "opponent."