*A putative new idea for AI control; index here*.

**NOTE**: What used to be called 'bias', is now called 'rigging', because 'bias' is very overloaded. The post has not yet been updated with the new terminology, however.

What are the biggest failure modes of reward learning agents?

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

For instance, imagine a domestic robot that can be motivated to tidy (reward ) or cook (reward ). It has a switch that allows the human to choose the correct reward function. However, cooking gives a higher expected reward than tidying, and the agent may choose to set the switch directly (or manipulate the human's choice). In that case, it will set it to `cook'.

In that case, the agent **biases** its reward learning process.

A second failure mode (this version due to Jessica, original idea here) is when the agent **influences** its reward function without biasing it.

For example, the domestic robot might be waiting for the human to arrive in an hour's time. It expected the human will be 50% likely to choose (tidying) versus 50% likely to choose (cooking). If instead the robot can randomise its reward switch now (with equal odds on and ), it can know its reward function early, and get in a full extra hour of tidying/cooking.

A subsequent post will formalise influence, here let's look at bias.

# Formalising bias

We can define bias in terms of and .

First of all, for a given policy , we can say that is unbiased for , if preserves the expectation of . That is:

- For all histories with , .

If the expectation of is preserved by any policy, then we can say that itself is unbiased:

- The prior is unbiased is is unbiased for for all policies .

Recall that on histories of length . So being unbiased implies restrictions on :

- If is unbiased, then for all with and for all policies , .

Since being unbiased imposes restrictions on , we can directly define:

- The posterior is unbiased if there exists a possible prior with on histories of length , and is unbiased.

So what does unbiased mean in practice? It simply means that whatever actions or policies the agent follows, they cannot change the expectation of their values.

# Bias and learning incentives

This is an opportunity to put the learning and biasing graph:

The x-axis represents the probability of being the `correct' reward function. The current value is .

The orange curve (which is always convex, though not necessarily strictly so) represents, for a given probability of correctness of , the expected value the agent could get if it knew it would never learn anything more about the correct value.

If it learnt immediately and costlessly about the correct values, it would go to or with probability and , respectively. Thus its expected reward is the point on the blue curve at the x-coordinate .

Thus the green arrow represents the **incentive to learn**. But, if it can't learn easily, it may try and randomise its reward function, so the green arrow also represents the **incentive to (unbiased) influence**.

The shape of the orange curve itself represents the **incentive to bias**.

If the orange curve is *flat*, it is equal to the blue one, so there is no incentive to learn. If the orange curve is flat and *horizontal*, there is no incentive to bias, either.