TLDR: We define a variant of reinforcement learning in which the reward is not perceived directly, but can be estimated at any given moment by some (possibly costly) experiment. The reward function is no longer a function of the observation history, but a different object that we call "instrumental reward function". We give two definitions of instrumental reward function and prove their equivalence. We also derive a regret bound for this setting.

Background

In "classical" reinforcement learning the agent perceives the reward signal on every round of its interaction with the environment, whether through a distinct input channel or through some given way to compute the reward from the interaction history so far. On the other hand, we can rather easily imagine agents that optimize properties of their environment that they do not directly perceive. For example, if Alice, who lives in Dominica, donates money to the Against Malaria Foundation in order to save someone in Africa, then the result is usually not visible to Alice at the time it occurs, if ever. Similarly, Clippy the paperclip maximizer doesn't always perceive all the paperclips in the universe. Moreover, we might want to design agents that, in order to estimate the reward, direct queries to humans (which is costly and cannot be done continuously non-stop).

Now, it is possible to define the perceived reward as the subjective expected value of the "true" imperceptible reward (see the Results section for details). Although this transformation preserves expected utility, it does not preserve Bayesian regret. Indeed, Bayesian regret is the difference between the expected utility attained by the agent and the expected utility attained by a "reference" agent that knows the true environment from the onset. However, after the transformation, the reference agent will behave as if it knows the observable dynamics of the true environment but still pretends not to know the true environment for the purpose of computing the reward. Therefore, regret analysis requires us to consider the imperceptible reward honestly. In fact, as we will see, certain assumptions about the imperceptible reward function are needed even to make the problem learnable at all. Finally, this transformation makes the reward function more complex and hence applying a "generic" reinforcement learning algorithm after the transformation (instead of exploiting the special form of the resulting reward function) might carry a significant computational complexity penalty.

Related Work: De Blanc 2011 studies so called "ontological crises". That is, de Blanc examines the problem of translating a reward function from one ontology into another. Here, we avoid this problem by considering reward functions that are automatically defined in all ontologies. That said, it might still be interesting to think about how to specify our type of reward function starting from a reward function that is only defined in one particular ontology. We will return to this in the Discussion section.

Krueger et al 2016 consider a setting where querying the reward is instantaneous and has a fixed cost. This is a special case of our setting, but we allow a much more general type of query and cost. Also, Krueger et al don't derive any theoretical regret bound (but they do present some experimental results).

Finally, multi-armed bandits with partial monitoring are closely related, see for example Bartok et al 2013. However, bandits by definition assume a stateless environment, and also our approach is rather different than what is usually studied in partial monitoring.

The literature study was very cursory and I will be glad to know about prior work I missed!

Results

Partially Observable MDPs with Imperceptible Rewards

Partially Observable Markov Decision Processes (POMDPs) serve as a natural starting point for thinking about imperceptible rewards. Indeed, it might seem like all we have to do is consider RL in a POMDP environment and let the reward to be a function of the (imperceptible) state. However, this setting is in general unlearnable even given assumptions that rule out traps.

A (finite) POMDP is defined by non-empty finite sets (states), (actions) and (observations), the transition kernel and the reward function . As opposed to the "classical" definition of POMDP, we allow the value of is to be imperceptible. The perceptible setting is a special case of the imperceptible setting: we can always encode the reward into the observation.

To formulate a learning problem, we assume , and to be fixed and an initial state to be given, while and are unknown and belong to some hypothesis class :

Such a problem can be unlearnable even if is known and there are no irreversible events that can happen:

Example 1

Suppose that , , , , and where for any

Since , there is no way to gain any information about which hypothesis is correct. Moreover, the optimal policies for the two hypotheses are different. Namely, for we should always take action and for we should always take action .

To formalize and illustrate the discussion in the Background section, suppose is Borel and is the prior. We can then define the "perceived effective reward" by

It is then easy to see that the operator preserves expected utility: given any policy and

and therefore, for any geometric time discount parameter

On the other hand, Bayesian regret is not preserved since, in general

Here is shorthand notation for the same probability distribution as before.

Indeed, in Example 1 the LHS of the above is since , whereas the RHS is .

The pathology of Example 1 comes about because reward is not only imperceptible but entirely unobservable. That is, no experiment can produce any information about whether the reward on a given round was 0 or 1. More specifically, the states and are assigned different rewards, but there is no observable difference between them. It is as if Alice would assign value, not to people in Africa (whose existence and well-being can be measured) but to some Flying Spaghetti Monster s.t. the world behaves exactly the same regardless of its existence or condition.

This observation suggests that, instead of assigning rewards to states which are just abstract labels in a model, we should assign rewards to states that are defined in terms of the observable consequences of the interaction of the agent with the environment. This leads us to the notion of "instrumental state", which we will now define formally.

Instrumental States and Reward Functions

First, we introduce some technical definitions for notational convenience.

Definition 1

is the category whose objects are pairs where is a real vector space and is a convex subset of , and whose morphisms are

We omit describing identity and composition of morphisms since they are obvious.

It is easy to see that has arbitrary limits. In particular, has a final object that we denote by (the one point set), products () and inverse limits of sequences. For any finite set , is an object in . Also, we will sometimes abuse notation by regarding as an object of instead of (i.e. making implicit).

Definition 2

The functor is defined by

Definition 3

For any , we define by

Note that is a natural transformation from to the constant functor with value .

Given and s.t. for , we denote .

Now, we are ready to define instrumental states. We fix the sets and .

Definition 4

For any , we define , the space of time step instrumental states, recursively by

Here, is a mapping from to . The inverse image of a convex set (in this case ) under an affine mapping (in this case ) is also a convex set.

The semantics of Definition 4 is as follows. Any can be regarded as a pair consisting of some (the image of under ) and a mapping defined by . Given , is the probability distribution over observations resulting from taking action in state , whereas is the state resulting from taking action a in state conditional on observing . This semantics can be made more formal as follows:

Definition 5

Given , we define recursively as follows:

  1. For all , and : iff , and .

Definition 6

Given , and , and assuming that , we recursively define by

  1. For :

  2. For with some , and :

The point of defining in this manner is that (i) different points in correspond to states that are truly not equivalent, i.e. can be empirically distinguished (statistically) and (ii) convex combinations of points in correspond precisely to probabilistic mixtures. Formally, we have:

Definition 7

Given and (a policy), we define (probability distribution over histories resulting from following policy in state ) by

Proposition 1

Consider some and assume that for any , . Then, .

Proposition 2

Consider some , and . Then

is a bounded polytope but in general it is *not} a simplex: we cannot regard it as just probability distributions over some set. For example, if , then it's easy to see that is a square (one axis is the probability to get a given observation when taking one action, the other axis is the probability for the other action). On the other hand, if then is a simplex: it is canonically isomorphic to .

There are a natural morphisms whose semantics is forgetting about the behavior of the state at time step :

Definition 8

We define for any recursively. is the unique morphism from to . For any , is given by

We thereby get a sequence

Definition 9

We define , the space of (infinite time step) instrumental states by