tl;dr: if an agent has a biased learning process, it may choose actions that are worse (with certainty) for every possible reward function it could be learning.

An agent learns its own reward function if there is a set of possible reward functions, and there is a learning process that maps world-histories (and policies) to distributions over . Thus by interacting with the environment and choosing its own policies, the agent can learn which is the correct reward function it should be maximising.

Given a policy , a history , an environment ,and a reward , we can compute the expected probability of :

Then a learning process is unbiased if that expression is independent of , and biased otherwise. Biased processes are less desirable, as they allow the agent to manipulate the process through its choice of policy.

Simple biased learning process

The most trivial example of a biased learning process is an agent that completely determines its reward by its actions.

Let , let the agent only act once with two actions available, , (hence a choice of "policy" is a choice of action), and set

Thus the agent can simply choose its reward function through its actions.

Note that some designs are a bit more sophisticated, and don't allow the agent to choose its reward function directly through its actions. But this doesn't matter, if the reward function is a consequence of anything that is a predictable consequence of the agent's actions (eg if the agent can trick/coerce/manipulate a human into saying "yes" or "no", and if is determined by the human's response, it doesn't matter that is not defined directly through the agent's actions: it is defined indirectly through them).

[Note that all that involve learning about external facts are unbiased learning processes, so it's not as if unbiased means trivial]

Strictly dominated behaviour

Then an agent with a biased learning process that wants to maximise the expectation of the true reward, can sometimes follow strictly dominated policies.

That means that there is are policies and , such that for all histories possible given , and all reward function in ,

And yet the agent will still choose to maximise reward.

For example, with and defined as above, define and to be:

Thus is always the better action, for both and for ; it is strictly dominant. However, since also determines which reward function is correct, the possible rewards the agent gets are the two bold numbers in the table: , by choosing and hence making the correct reward function, and , by choosing and hence making the correct reward function.

Then in order to maximise reward, the agent will choose the strictly dominated policy/action .

Unbiased learning

It's possible to prove that if is unbiased, then this behaviour won't occur, but doing so involves introducing a bit more definition and machinery that presented here, so I'll defer this to my forthcoming paper.

Note on expected dominance

[Reading the following is not relevant to understanding the main point of this post]

The dominant policy is defined so that for all , for all histories , possible given .

We could instead talk about the expected reward given . But in fact, it makes sense to choose policies which are strictly dominated in the expected reward sense.

For example, let be a policy that does nothing (all rewards stay at ), and let be the policy that first checks which of and is correct (for a given ) and then maximises the correct one. If is maximised, it goes to , while the other reward will go to .

Assume that the probability of either or being correct is . Then it's clear that is dominated in expectation by , since

Yet is clearly the right thing to do, since it allows us to maximise the correct reward ( is only negative in worlds where it is not the correct reward).

So an unbiased agent can still choose a policy that is worse for every reward in expectation, if it's confident that the (currently unknown) correct reward will get maximised more by this policy.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 1:55 PM

I liked this post quite a bit more than your average post. Maybe I just have more experience with RL and recently been thinking about IRL, so this was a bit easier to understand, but I do feel like I walked away from this post with a more concrete sense of insight than usual.

This felt weird to me, so I tried to construct a non-math example. Suppose we have a reward learning agent where we have designed the reward space so that "ask the human whether to do X" always has higher reward than "do X". The agent is now considering whether to ask the human to try heroin, or just give them heroin. If the agent gives them heroin, it will see their look of ecstasy and will update to have the reward function "5 for giving the human heroin, 7 for asking the human". If the agent asks the human, then the human will say "no", and the agent will update to have the reward function "-1 for giving the human heroin, 1 for asking the human". In both cases asking the human is the optimal action, yet the agent will end up giving the human heroin without asking.

This seems isomorphic to the example you gave, and it's a little clearer what I find weird:

  • The agent _knows_ how it's going to update based on the action it takes. This feels wrong to me. Though I think the conclusion remains even if the agent only has probabilistic beliefs about how it will update based on the action it takes.
  • Can we simply make sure that the agent selects its action according to the current estimate of the reward function (or the mean reward function if you have a probability distribution), and only updates after seeing the result of the action? This avoids this problem, and the problem in Towards Interactive Inverse Reinforcement Learning, and seems like the approach taken in Deep Reinforcement Learning from Human Preferences. (Such an agent could be a biased learning process, and be safe anyway.)
The agent _knows_ how it's going to update based on the action it takes.

Yep, that's a key part of the problem. We want to designed the AI to update according to what the human says; but what the human says is not a variable out there in the world that the AI discovers, it's something the AI can rig or influence through its own actions.

Can we simply make sure that the agent selects its action according to the current estimate of the reward function

This estimate depends on the agent's own actions (again, this is the heart of the problem).