This is brief technical note on how to get convergence in the cooperative inverse reinforcement learning framework. We extend cooperative inverse RL to partially observable domains and use a recent result on the grain of truth problem to establish (arguably very strong) conditions to get convergence to -Nash equilibria.

Credit: this came out of the CSRBAI Workshop on Preference Specification. Several people contributed ideas, in particular Vadim Kosoy.

## Preliminaries

We use the setup and notation from the grain of truth paper
(Leike, Taylor, and Fallenstein, 2016).
We only review the most important notation here
in an attempt to make this post notationally fairly self-contained.
The set denotes a countable class
of stochastic environments, the class of all environments
that are *reflective-oracle-computable*
(computable on a probabilistic Turing machine with access to a reflective oracle).

Let be any two-player environment.
Let denote the human player and let denote the robot player.
Each player interacts with the environment in cycles:
at time step the player chooses an *action* and
receives a *percept*
the cycle then repeats for .
A *history* is an element of .
We use to denote one interaction cycle,
and to denote a history of length .
We assume the sets and are finite.

The human follows a policy
and the robot follows a policy
.
The human acts in the *subjective environment*
(environment combined with the robot) and
the robot acts in the subjective environment
(environment combined with the human).
Each player does not observe the action and percept of the other player directly.

Moreover, only the human sees the reward signal, not the robot. Yet they both try to maximize this signal; in this sense they are playing a cooperative game. We assume that the reward is uniquely determined by the robot's history (the robot has all the necessary information available to determine the reward). The robot has a prior over reward functions that includes the true reward function.

One question is how to get a reward signal to the robot.
We assume that the robot maximizes the *belief reward signal*:
For any prior on reward functions,
the robot can calculate at every time step the -expected reward obtained.
We let the robot maximize the belief reward signal.
This is of course not actually desirable,
because it provides no extra incentive to the robot
to take actions that lead to learning the human's actual reward function.
We use to denote the robot's subjective environment
augmented with the -expected reward signal.

We fix a *discount function*
with
and .
The goal is to maximize discounted rewards ,
where denotes the human's reward at time step .
The *discount normalization factor* is defined as
.
We define the *value function* as follows.
For the robot, we use to denote the policy 's
subjective value (from the robot's point of view) and
to denote the policy 's actual value
(in terms of the rewards the human receives).

## Result

Our result relies on two assumptions. We discuss them in turn.

**Assumption 1 (Human is AO).** Player is asymptotically optimal in mean
in the environment class :
for all .

On the one hand, Assumption 1 feels too strong: One of the core ideas of value learning is that the AI is more powerful than the human, and whether value learning succeeds should not hinge on whether the human learns to behave optimally in the limit. On the other hand, maybe assuming a superintelligence-assisted human becomes asymptotically optimal is not so unrealistic: after all, it is just saying that the human would use the robot to get as much reward as possible.

**Assumption 2 (Teachability).**
For all and all policies ,
if infinitely often,
then
infinitely often.

The teachability assumption states that if the robot's belief value differs from its actual value by more than (on any policy), then there is a sequence of actions that the human can take to teach the robot, and that it is suboptimal for the human not to do so. This means that the effective horizon has to be long enough for the human to provide information to the robot, for the robot to change its behavior, and for both of them to adopt better policies.

The form our techability assumption takes is somewhat cheating, because it packages a bunch of steps into one assumption. Future work should try to unpack it and make several smaller assumptions that are easier to understand.

**Theorem.**
If Assumption 1 and Assumption 2 are satisfied
and the human is reflective-oracle-computable,
then there is a policy for the robot such that
for any
both human and robot converge to an -Nash equilibria
in probability.

## Proof

The proof is a relatively straightforward application of existing results.
The robot maximizes the belief reward signal;
as its policy we choose Thompson sampling
because we know that Thompson sampling is asymptotically optimal in mean
in any countable class of stochastic environments,
in particular
(Leike et al., 2016, Thm. 4).
Moreover, Thompson sampling is reflective-oracle-computable.
Therefore we get that
(both and have a *grain of truth*).
From Assumption 1 we get that the human is also asymptotically optimal in mean.
Now we can apply Theorem 28 from Leike, Taylor, and Fallenstein (2016) to get that
for all the probability that
both human and robot play an -best response converges to .
However, this is not necessarily a -Nash equilibrium yet
because the robot is only best responding in its belief environment,
which might be inaccurate.
In other words, we get
,
but we want
.
This is where Assumption 2 comes in.
Together with Assumption 1 it provides that
for any policy .
Omitting history arguments we write
The first convergence is from Assumption 2,
the second from the robot's asymptotic optimality,
and the third is from Assumption 2 again.

## Open Questions

- The teachability assumption seems too strong. Can we unpack it further? Moreover, currently we require it to hold off-policy.
- Convergence to a Nash equilibrium is not very strong. How can we ensure that this Nash equilibrium is Pareto efficient?
- How can we put incentives for the robot to actively learn the human's reward function into the model?