Why DRL doesn't work for arbitrary environments

by Vanessa Kosoy2 min read30th Nov 2017No comments


Personal Blog

% operators that are separated from the operand by a space

% autosize deliminaters

% operators that require brackets

% operators that require parentheses

% Paper specific

Previously, I presented the theory of DRL for finite MDPs. Now, although I believe this theory can be generalized much beyond finite MDPs (at least to finite POMDPs and infinite POMDPs that satisfy some geometric and/or dynamical systems theoretic assumptions), it cannot work for arbitrary environments (without requiring the advisor to be more than "merely sane"). Constructing a counterexample is not difficult, but it seemed worthwhile to write it down.

Let be the set of actions and the set of observations. Define s.t. for any and

That is, is the first time action (stands for "commit") appears in the history . Now, consider the following environments , .

Also, define the reward function by setting for any history that ends with and for any other history. Denote . That is, both universes count the number of actions and until time (the first time action is taken). At times between and , they produce rewards with frequency that equals the relative fraction of the corresponding action in the count. Before and after , they produce no rewards.

Now, we haven't defined sane policies for arbitrary environments, but the spirit of the definition is that a sane policy is unlikely to take actions with major irreversible long-term negative consequences and has to have some non-negligible probability of taking an optimal (or at least nearly optimal) action. For example, we might define it as follows


Consider , and a universe . A policy is called -sane for when for any



Here, we consider the asymptotics , but should include the scenario so that actions with short-term negative consequences are allowed (otherwise this would be some sort of "rationality" requirement, similar to what's needed for DIRL, rather than a mere "sanity" requirement).

The optimal policy for (respectively ) is taking action (resp. ) until some time and taking action at this time (obviously, the following actions don't affect anything). Let be policies that are optimal even off-policy (so that their decision when to take action depends on the history of actions before). It is easy to see that the following policies are -sane for the respective universes and some "legitimate" choice of :

That is, whenever suggests taking action , complies with it, and whenever suggests taking action or , flips a coin to decide which one to take. Indeed, in expectation, taking the wrong action out of loses one time moment of reward and is therefore equivalent to a "short-term" loss.

It is thus impossible for the agent to distinguish between the universes and until action is taken: both the environment and the advisor behave identically until this point. On the other hand, after action is taken it is too late to do anything. Evidently, the hypothesis class is not learnable.

In principle, it might be possible to get around this obstacle by formulating a condition on the advisor in which losses that are minor in size but far away in time are also ruled out. However, such a condition might prove too stringent. Mostly, I hope to ultimately extend the formalism to incomplete models in which case certain restrictions on the type of incomplete model might be acceptable. For example, the incomplete equivalent of an MDP might be a stochastic game against some arbitrary "opponent."

Personal Blog


New Comment