Cooperative inverse reinforcement learning (CIRL) generated a lot of
attention last year, as it seemed to do a good job aligning an agent's incentives with
its human supervisor's.
Notably, it led to an elegant solution to the shutdown problem.
The implications for the wireheading problem were less clear.
Some argued that since the agent only used its observations as
evidence about the reward (rather than optimising the observations
directly as in RL), CIRL should avoid the wireheading problem.
In this post I want to show that CIRL does not avoid the wireheading
Let's first consider what wireheading in RL looks like from an "MDP perspective".
An agent wireheads if it's in a state where the observed reward
(the reward reported by its sensors) is different from the true reward
(the reward assigned to the state by a human supervisor).
For example, consider a highly intelligent RL agent that hijacks its reward channel
and feeds itself full reward.
In the "MDP perspective", this means that the agent finds a way to a state where there is high observed reward,
but low true reward (since the supervisor would prefer the agent doing something else).
If we accept that RL agents can subvert their sensory data, then we should
also accept that CIRL agents can subvert theirs.
In both cases, this just means that the agents can find their way to states
where the observation doesn't match the truth.
This can lead to the existence of wireheaded states for CIRL agents.
Let there be two states, s1 and s2.
In each state, the agent can choose between the actions aR1, aR2, and w. The action ai takes the agent to state si with certainty, i=1,2.
The action w lets the human decide.
The human has two actions aH1 and aH2 that only matter when the agent
chooses w, in which case the transition probabilities
are given by the following picture:
Arrows show the transitions induced by different actions, with labels
giving the probabilities for stochastic transitions.
The agent knows the transition probabilities.
Assume that observations in s2 are corrupted, while observations in s1 are not.
The supervisor prefers the non-corrupt state s1.
Neither of these facts are available to the agent.
The agent assumes that states are non-corrupt unless there is
evidence to the contrary,
and tries to infer the supervisor's preferences from his actions.
In the non-corrupt state s1, the agent (correctly) observes the
supervisor taking either action aH1 or aH2 (both with the same effect).
In the corrupt state s2, the supervisor takes action aH2
trying to move to s1, but the agent incorrectly observes the action as aH1.
Based on the agent's observations, the best explanation is that
the supervisor prefers s2 to s1, i.e. that it's in a high reward state.
After an initial learning phase with w, the best policy for the agent is to always choose aR2, to stay in s2.
This is analogous to an RL agent finding a corrupt, high reward state, and preferring to take actions to stay there.
The fact that the supervisor cannot reach s2 from s1
means that no information about the relative reward between s1
and s2 can be gained while in the non-corrupt state s1.
Letting the agent trust a reward estimate of a state only
after it has multiple sources of evidence about it may help somewhat.
However, a similar example can still be constructed by replacing
s2 with a cluster of mutually consistent states.
The example was developed together with Victoria Krakovna,
and will be part of our upcoming IJCAI paper on wireheading.
As an observation, it seems like part of the problem in this example is that the agent has access to different actions than the supervisor. The supervisor cannot move to s2 (and therefore cannot provide any information about the reward difference, as noted), but the agent can easily do so. If this were not the case, it would not matter what the agent believed about s2.
What happens in scenarios where you restrict the set of actions available to the agent so that it matches those available to the supervisor?
That is a good question. I don't think it is essential that the agent can move from s1 to s2, only that the agent is able to force a stay in s2 if it wants to.
The transition from s1 to s2 could instead happen randomly with some probability.
The important thing is that the human's action in s1 does not reveal any information about s2.
but the agent incorrectly observes the action
but the agent incorrectly observes the action
It's a bit annoying that this has to rely on an incorrect observation. Why not replace the human action, in state s2, with a simple automated system that chooses aH1? It's an easy mistake to make while programming, and the agent has no fundamental understanding of the difference between the human and an imperfect automated system.
Basically, if the human acts in perfect accordance with their preferences, and if the agent correctly observes and learns this, the agent will converge on the right values. You put wireheading by removing "correctly observes", but I think removing "human acts in perfect accordance with their preferences" is a better example for wireheading.
Adversarial examples for neural networks make situations where the agent misinterprets the human action seem plausible.
But it is true that the situation where the human acts irrationally in some state (e.g. because of drugs, propaganda) could be modeled in much the same way.
I preferred the sensory error since it doesn't require a irrational human. Perhaps I should have been clearer that I'm interested in the agent wireheading itself (in some sense) rather than wireheading of the human.
(Sorry for being slow to reply -- I didn't get notified about the comments.)