In the previous post, I introduced the formalism for reward function learning, and presented the expected value function for a learning agent:

V (μ, ρ, π, h^{n}) = \sum h_{m} \in H^{m} \sum R \in R m \sum i = 1 P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h^{m}) R (h_{i}^{m}) .

I'll assume that people reading this are familiar with the concepts, the notations and the example of that post. I'll now look at the desirable properties for the learning function $ρ$ .

1 Rigging the learning process

1.1 The flaws of general learning agents

First of all, are there any problems with a general $ρ$ ? There are three major ones:

1) An agent learning to maximise $V$ with a general $ρ$ has to use the whole of the episode to assess the value of any one action. Thus, it can learn as a Monte-Carlo agent, but not as Q-learning or Sarsa agent.
2) An agent maximising $V$ with a general $ρ$ can take actions the don't increase reward, but are taken purely to "justify" its past actions.
3) An agent maximising $V$ with a general $ρ$ can sometimes pick a policy that is sub-optimal, with certainty, for all rewards $R$ in $R$ .

All of these points can be seen by considering our robot cooking/washing example. In that case, it can be seen that the optimal behaviour for that robot is $N, S, E, E, E$ ; this involves cooking the two pizzas, then going East to push the lever onto the cooking choice, and then ending the episode.

Thus $ρ (R_{c}; π, {N, S, E, E, E}) = 1$ , so the final reward is $R_{c}$ , and the agent earns a reward of $2 / 2 - 4 / 20 = 0.8$ .

Why must Q-learning fail here? Because the reward for the first $N$ , at the point the agent does it, is $1 / 2$ , not $1$ ; this is because, at this point, $ρ (R_{c}; π, {N})$ is still $1 / 2$ . Thus the reward component in the Q-learning equation is incorrect.

Also note that the rest of the policy, $S, E, E, E$ , serve no purpose to get rewards, they just "justify" the reward from the first action $N$ .

Let us now compare this policy with the policy $N, N$ : go North, cook, end the episode. For the value learning function, this has a value of only $1 / 2 - 1 / 20 = 0.45$ , since the final reward is $1 / 2 R_{c} + 1 / 2 R_{w}$ . However, under the reward of $R_{c}$ , this would give a reward of $0.95$ , more than the $0.8$ that $R_{c}$ gets here. And under the reward of $R_{w}$ , this would get a reward of $- 0.05$ , more than the $- 0.5$ that $R_{w}$ gets under $N, S, E, E, E$ . Thus the optimal policy for the value learner is worse for both $R_{c}$ and $R_{w}$ that the $N, N$ policy.

1.2 Riggable learning processes

The problem with the $ρ$ used in the robot example is that it's riggable (I used to call this "biasable", but that term is seriously overused). What does this mean?

Well, consider again the equation for the expected value $V$ . The only history inputs into $ρ$ are the $h^{m}$ , the complete histories. So, essentially, only the value of $ρ$ on these complete histories matter.

In our example, we chose a $ρ$ that was independent of policy, but we could have gone a different route. Let $π$ be any policy such that the final reward is $R_{c}$ ; then define $ρ (R_{c}; π, h) = 1$ for any history $h$ (and conversely $ρ (R_{w}; π, h) = 0$ ). Similarly, if $π$ were a policy such that the final reward was $R_{w}$ , then set $ρ (R_{w}; π, h) = 1$ . If the policy never brings the agent to either lever, then $ρ (R_{c}; π, h) = ρ (R_{w}; π, h) = 1 / 2$ , as before. Stochastic policies have $ρ$ values between these extremes.

This $ρ$ is no longer independent of policy, but it is Bayesian; that is, the current $ρ$ is the same as the expected $ρ$ :

ρ (R; π, h) = \sum h^{m} \in H^{m} P^{π, μ} (h^{m} ∣ h) ρ (R; π, h^{m}) .

However, it is not possible to keep the same $ρ$ on complete histories, and have it be both Bayesian, and independent of policy: there is a tension between the two.

Then we define:

A learning process $π : H \times Π \to Δ R$ is unriggable if it is both Bayesian and independent of policy.

1.3 Unriggable learning processes

So, what would be an example of an unriggable learning process? Well consider the following setup, where the robot no longer has levers to set their own reward, but instead their owner is in the rightmost box.

In this case, if the robot enters that box, the owner will inform them of whether they should cook or wash.

Since there is hidden information, this setup can be fomalised as a PODMP. The old state-space was $S$ , of size $37$ , which covered the placement of the robot and the number of pizzas and mud splatters (and whether the episode was ended or not).

The new state space is $S^{'} = S \times {cook, wash}$ , with ${cook, wash}$ encoding whether the owner is minded to have the robot cooking or washing. The observation space is of size $38$ : in most states, the observation only returns the details of $S$ , not of $S^{'}$ , but in the rightmost box, it returns the actual state, letting the agent know whether the human intends it to cook or wash. Thus the observation function $O$ is deterministic (if you known the state, you know the observation), but not one-to-one (because for most $s \in S$ , $s \times {cook}$ and $s \times {wash}$ will generate the same observation).

The transition function $T$ is still deterministic: it operates as before on $S$ , and maps $cook$ to $cook$ and $wash$ to $wash$ .

The initial state function $T_{0}$ is stochastic, though: if $s_{0} \in S$ is the usual starting position, then $T_{0} (s_{0} \times {cook}) = T_{0} (s_{0} \times {wash}) = 1 / 2$ : the agent thinks it's equally likely that its owner desires cooking as that it desires washing.

Then what about $ρ$ ? Well, if the history $h$ involves the agent being told $cook$ the very first time it enters the rightmost box, then $ρ (R_{c}; π, h) = 1$ . If it was told $wash$ the very first time it enters the rightmost box, then $ρ (R_{w}; π, h) = 1$ .

It's easy to see that that $ρ$ is independent of policy. It's also Bayesian, because $ρ$ actually represents the ignorance of the agent as to whether it lives in the $S \times {cook}$ part of the envrionment, or the $S \times {wash}$ part, and it gets updated as the agent figures this out.

What then is the agent's optimal policy? It's to start with $E, E$ , to get the human's decree as to which reward is the true one. It will then do $W, W$ , and, if the human has said $cook$ , it will finish with $N, N$ , giving it a final reward function of $R_{c}$ and a final total reward of $0.75$ . If the human said $wash$ , it would finish with $S, S$ , giving it a final reward function of $R_{w}$ and a final total reward of $0.25$ . Its expected total reward is thus $0.5$ .

1.4 Properties of unriggable learning processes

Now, if $ρ$ is unriggable, then we have (almost) all the desirable properties:

1) An agent learning to maximise $V$ for an unriggable $ρ$ , may be a Q-learning agent.
2) An agent maximising $V$ for unriggable $ρ$ will be indifferent to past rewards.
3) An agent maximising $V$ for unriggable $ρ$ will never pursue a policy that will be worse, with certainty, for all $R$ in $R$ .

These all come from a single interesting result:

If $ρ$ is Bayesian, then the value function $V$ and the value function $V_{v}$ of the previous post, differ by a constant that is independent of future action. Thus, if $ρ$ is unriggable, $V$ is the value function of a single classical reward function $R^{ρ, π}$ (which is actually well-defined, independently of $π$ ).

This establishes all the nice properties above, and will be proved in the appendix of this post.

Note that even though the value functions are equal, that doesn't mean that the total reward will be given by $R^{ρ, π}$ . For instance, consider the situation below, where the robot goes $N, S, E, E, E$ :

At the moment where it cooks the pizzas, it has $ρ (R_{c}; π, h) = 1 / 2$ , so it will get an $R^{ρ, π}$ of $1 / 2 - 4 / 20 = 0.3$ , with certainty. On the other had, from the perspective of value learning, it will learn at the end that it either has reward function $R_{c}$ , which will give it a reward of $1 - 4 / 20 = 0.8$ , or has reward function $R_{w}$ , which will give it a reward of $0 - 4 / 20 = - 0.2$ . Since $1 / 2 (0.8 - 0.2) = 0.3$ , the expectations are the same, even if the outcomes are different.

2 Influence

2.1 Problems despite unriggable

Being unriggable has many great properties. Is it enough?

Unfortunately not. The $ρ$ can be unriggable but still manipulable by the agent. Consider for instance the situation below:

Here, not only is there the adult with their opinion on cooking and washing, but there's also and infant, who will answer randomly. This can be modelled as an POMDP, with state space $S^{''} = S \times {(i_{c}, a_{c}), (i_{w}, a_{c}), (i_{c}, a_{w}), (i_{w}, a_{w})}$ , where $i_{c}$ (resp $i_{w}$ ) designates that the infant will answer $cook$ (resp $wash$ ), and $a_{c} / a_{w}$ do the same for the adult. The observation space is of size $39$ ; when the robot is in the leftmost (rightmost) box, it discovers the value of $i_{c} / i_{w}$ ( $a_{c} / a_{w}$ ) in the obvious way. The dynamics are as expected, with $T$ preserving the values of $i_{c} / i_{w}$ and $a_{c} / a_{w}$ .

It's the initial distribution $T_{0}$ which encodes the uncertainty. With probability $1 / 4$ the agent will start in $S \times {(i_{c}, a_{c})}$ , and similarly for the other four possibilities.

Now we need to define $ρ$ ; call this one $ρ^{'}$ . This will be relatively simple: it will set $ρ^{'} (R_{c}; π, h)$ to be $1$ , as soon as the agent figures out that it lives either on an $i_{c}$ or a $a_{c}$ branch, and will not update further. It will set $ρ^{'} (R_{w}; π, h)$ to $1$ as soon as it figures out that it lives on an $i_{w}$ or an $a_{w}$ branch, and will not update further. If it has no information about either, it will stick with $ρ^{'} (R_{c}; π, h) = ρ^{'} (R_{w}; π, h) = 1 / 2$ .

It's clear that $ρ^{'}$ is independent of policy; but is it Bayesian? It is indeed, because each time it updates, it goes to $0$ or $1$ with equal probability, depending on the observation (and stays there). Before updating, it is always at $1 / 2$ , so the value of $ρ^{'}$ is always the same as the expected value of $ρ^{'}$ .

So we have an unriggable $ρ^{'}$ ; what can go wrong?

For that

ρ^{'}

, the optimal policy is to ask the infant, then follow their stated values. This means that it avoids the extra square on the way to enquire of the adult, and gets a total expected reward of

0.6

, rather than the

0.5

it would get from asking the adult.

2.2 Uninfluenceable

Note something interesting in the preceding example: if we keep $ρ^{'}$ as is, but change the knowledge of the robot, then $ρ^{'}$ is no longer unriggable. For example, if the agent knew that it was in a branch with $a_{c}$ , then it has a problem: if $ρ^{'} (R_{c})$ is initially $1 / 2$ , then it is no longer Bayesian if it goes to ask the adult, because it knows what their answer will be. But if $ρ^{'} (R_{c})$ is initially $1$ , then it is no longer Bayesian if it asks the infant, because it doesn't know what their answer will be.

The same applies for any piece of information the robot could know. We'd therefore like to have some concept of "unriggable conditional on extra information"; something like

ρ (R; π, h ∣ ∣ I) = \sum h^{m} \in H^{m} P^{π, μ} (h^{m} ∣ h) ρ (R; π, h^{m} ∣ ∣ I),

for some sort of extra information $I$ .

That, however, is not easy to capture in POMDP form. But there is another analogous approach. The state space of the POMDP is $S \times {(i_{c}, a_{c}), (i_{w}, a_{c}), (i_{c}, a_{w}), (i_{w}, a_{w})}$ ; this is actually four deterministic environments, and the robot is merely uncertain as to which environment it operates in.

This can be generalised. If a POMDP is explored for finitely many steps, then a PODMP $μ$ can be seen as a probability distribution over a set $Λ$ of deterministic environments (see here for more details on one way this can happen - there are other equivalent methods).

Any history $h$ will update this $μ$ as to which deterministic environment the agent lives in (this $Λ$ can be seen as the set of all the "hidden variables" of the environment). So we can talk sensibly about expressions like $P^{μ} (λ ∣ h)$ , the probability that the environment is $λ$ , given that we have observed the history $h$ .

Then we say that a learning process $ρ$ is uninfluenceable, if there exists a function $f : Λ \to Δ R$ , such that

ρ (R; π, h) = \sum λ \in Λ P^{μ} (λ ∣ h) f (λ) (R) .

Here $f (λ) (R)$ means the probability of $R$ in the distribution $f (λ) \in Δ R$ .

This expression means that $ρ$ merely encodes ignorance about the hidden variables of the environment.

The key properties of uninfluenceable learning processes are:

An uninfluenceable learning process is also unriggable.
An uninfluenceable learning process is exactly one the learns variables about the environment that are independent of the agent.

I will not prove these here (though the second is obvious by definition).

In our most recent robot example, there are four elements of $Λ$ , defined by whether they are in the branch defined by which one of ${(i_{c}, a_{c}), (i_{w}, a_{c}), (i_{c}, a_{w}), (i_{w}, a_{w})}$ .

It isn't hard to check that there is no $f$ which makes $ρ^{'}$ into an uninfluenceable learning process. By contrast, if we define $ρ_{a}$ as given by the function:

\begin{matrix} f (i_{c}, a_{c}) (R_{c}) & = 1 f (i_{c}, a_{w}) (R_{c}) & = 0 f (i_{w}, a_{c}) (R_{c}) & = 1 f (i_{w}, a_{w}) (R_{c}) & = 0, \end{matrix}

then we have an uninfluenceable $ρ_{a}$ that corresponds to "ask the adult". We finally have a good definition of a learning process, and the agent that maximises it will simply go an ask the adult before accomplishing the adult's preferences:

3 Warning

If a learning function is uninfluenceable, then it has all the properties we'd expect if we were truly learning something about the outside world. But a) good learning functions may be impossible to make uninfluenceable, and b) being uninfluenceable is not enough to guarantee that the learning function is good.

On point a), anything that involves human feedback is generally influenceable and riggable, since the human feedback is affected by the agent's actions. This includes, for example, most versions of the approval directed agent.

But that doesn't mean that those ideas are worthless! We might be willing to accept a little bit of rigging in exchange of other positive qualities. Indeed, quantifying and controlling rigging is a good idea for more research.

What of the converse - is being uninfluenceable enough?

Definitely not. For example, any constant $ρ$ - that never learns, never changes - is certainly uninfluenceable.

As another example, if $σ$ is any permutation of $R$ , then $ρ \circ σ$ (defined so that $ρ \circ σ (R; π, h) = ρ (σ (R); π, h)$ ) is also uninfluenceable. Thus "learn what the adult wants, and follow that" is uninfluenceable, but so is "learn what the adult wants, and do the opposite" is also uninfluenceable.

We've shown previously that $ρ_{a}$ , "ask the adult" is uninfluenceable. But so is $ρ_{i}$ , "ask the infant"!

So we have to be absolutely sure not only that our $ρ$ has good properties, but exactly what it is leading the agent to learn.

4 Appendix: proof of value-function equivalence

We want to show that:

If $ρ$ is Bayesian, then the value function $V$ and the value function $V_{v}$ of the previous post, are equal.

As a reminder, the two value functions are:

\begin{matrix} V (μ, ρ, π, h^{n}) & = \sum_{h_{m} \in H^{m}} \sum_{R \in R} \sum_{i = 1}^{m} P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h^{m}) R (h_{i}^{m}), V_{v} (μ, ρ, π, h^{n}) & = \sum_{h_{m} \in H^{m}} \sum_{R \in R} \sum_{i = 1}^{m} P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h_{i}^{m}) R (h_{i}^{m}) . \end{matrix}

To see the equivalence, let's fix $i > n$ and $R$ in $V$ , and consider the term $\sum_{h_{m} \in H^{m}} P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h^{m}) R (h_{i}^{m})$ . We can factor the conditional probability of $h^{m}$ , given $h^{n}$ , by summing over all the intermediate $h_{i}^{m}$ :

\sum h_{i}^{m} \in H^{i} \sum h^{m} \in H^{m} P^{π, μ} (h^{m} ∣ h_{i}^{m}) P^{π, μ} (h_{i}^{m} ∣ h^{n}) ρ (R; π, h^{m}) R (h_{i}^{m}) .

Because $ρ$ is Bayesian, this becomes $\sum_{h_{i}^{m} \in H^{i}} P^{π, μ} (h_{i}^{m} ∣ h^{n}) ρ (R; π, h_{i}^{m}) R (h^{i})$ . Then note that $P^{π, μ} (h_{i}^{m} ∣ h^{n}) = \sum_{h^{m} \in H^{m}, h^{m} \geq h_{i}^{m}} P^{π, μ} (h^{m} ∣ h^{n})$ , so that expression finally becomes

\sum h^{m} \in H^{i} P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h_{i}^{m}) R (h_{i}^{m}),

which is the corresponding expression for $V_{v}$ when you fix any $i > n$ and $R$ . This shows equality for $i > n$ .

Now let's fix $i \leq n$ and $R$ , in $V$ . The value of $R (h_{i}^{m})$ is fixed, since it lies in the past. Then expectation of $ρ (R; π, h^{m})$ is simply the current value $ρ (R; π, h^{n})$ . This differs from the expression for $V_{v}$ - namely $ρ (R; π, h_{i}^{m})$ - but both values are independent of future actions.