NOTE: What used to be called 'bias', is now called 'rigging', because 'bias' is very overloaded. The post has not yet been updated with the new terminology, however.

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

For instance, imagine a domestic robot that can be motivated to tidy (reward R0) or cook (reward R1). It has a switch that allows the human to choose the correct reward function. However, cooking gives a higher expected reward than tidying, and the agent may choose to set the switch directly (or manipulate the human's choice). In that case, it will set it to `cook'.

In that case, the agent biases its reward learning process.

A second failure mode (this version due to Jessica, original idea here) is when the agent influences its reward function without biasing it.

For example, the domestic robot might be waiting for the human to arrive in an hour's time. It expected the human will be 50% likely to choose R0 (tidying) versus 50% likely to choose R1 (cooking). If instead the robot can randomise its reward switch now (with equal odds on R0 and R1), it can know its reward function early, and get in a full extra hour of tidying/cooking.

A subsequent post will formalise influence, here let's look at bias.

Formalising bias

We can define bias in terms of P and ˆP.

First of all, for a given policy π, we can say that ˆP is unbiased for π, if π preserves the expectation of ˆP. That is:

For all histories ht with t<m, ˆP(⋅∣ht)=Eπμ[ˆP(⋅∣ht+1)∣ht].

If the expectation of ˆP is preserved by any policy, then we can say that ˆP itself is unbiased:

The prior ˆP is unbiased is ˆP is unbiased for π for all policies π.

Recall that ˆP=P on histories of length m. So ˆP being unbiased implies restrictions on P:

If ˆP is unbiased, then for all ht with t<m and for all policies π, ˆP(⋅∣ht)=Eπμ[P(⋅∣hm)∣ht].

Since ˆP being unbiased imposes restrictions on P, we can directly define:

The posterior P is unbiased if there exists a possible prior ˆP′ with ˆP′=P on histories of length m, and ˆP′ is unbiased.

So what does unbiased mean in practice? It simply means that whatever actions or policies the agent follows, they cannot change the expectation of their values.

Bias and learning incentives

This is an opportunity to put the learning and biasing graph:

The x-axis represents the probability ˆP(R1∣ht) of R1 being the `correct' reward function. The current value is ˆP(R1∣ht)=p.

The orange curve (which is always convex, though not necessarily strictly so) represents, for a given probability q of correctness of R1, the expected value the agent could get if it knew it would never learn anything more about the correct value.

If it learnt immediately and costlessly about the correct values, it would go to q=0 or q=1 with probability 1−p and p, respectively. Thus its expected reward is the point on the blue curve at the x-coordinate p.

Thus the green arrow represents the incentive to learn. But, if it can't learn easily, it may try and randomise its reward function, so the green arrow also represents the incentive to (unbiased) influence.

The shape of the orange curve itself represents the incentive to bias.

If the orange curve is flat, it is equal to the blue one, so there is no incentive to learn. If the orange curve is flat and horizontal, there is no incentive to bias, either.

A putative new idea for AI control; index here.NOTE: What used to be called 'bias', is now called 'rigging', because 'bias' is very overloaded. The post has not yet been updated with the new terminology, however.What are the biggest failure modes of reward learning agents?

The first failure mode is when the agent directly (or indirectly) chooses its reward function.

For instance, imagine a domestic robot that can be motivated to tidy (reward R0) or cook (reward R1). It has a switch that allows the human to choose the correct reward function. However, cooking gives a higher expected reward than tidying, and the agent may choose to set the switch directly (or manipulate the human's choice). In that case, it will set it to `cook'.

In that case, the agent

biasesits reward learning process.A second failure mode (this version due to Jessica, original idea here) is when the agent

influencesits reward function without biasing it.For example, the domestic robot might be waiting for the human to arrive in an hour's time. It expected the human will be 50% likely to choose R0 (tidying) versus 50% likely to choose R1 (cooking). If instead the robot can randomise its reward switch now (with equal odds on R0 and R1), it can know its reward function early, and get in a full extra hour of tidying/cooking.

A subsequent post will formalise influence, here let's look at bias.

## Formalising bias

We can define bias in terms of P and ˆP.

First of all, for a given policy π, we can say that ˆP is unbiased for π, if π preserves the expectation of ˆP. That is:

If the expectation of ˆP is preserved by any policy, then we can say that ˆP itself is unbiased:

Recall that ˆP=P on histories of length m. So ˆP being unbiased implies restrictions on P:

Since ˆP being unbiased imposes restrictions on P, we can directly define:

So what does unbiased mean in practice? It simply means that whatever actions or policies the agent follows, they cannot change the expectation of their values.

## Bias and learning incentives

This is an opportunity to put the learning and biasing graph:

The x-axis represents the probability ˆP(R1∣ht) of R1 being the `correct' reward function. The current value is ˆP(R1∣ht)=p.

The orange curve (which is always convex, though not necessarily strictly so) represents, for a given probability q of correctness of R1, the expected value the agent could get if it knew it would never learn anything more about the correct value.

If it learnt immediately and costlessly about the correct values, it would go to q=0 or q=1 with probability 1−p and p, respectively. Thus its expected reward is the point on the blue curve at the x-coordinate p.

Thus the green arrow represents the

incentive to learn. But, if it can't learn easily, it may try and randomise its reward function, so the green arrow also represents theincentive to (unbiased) influence.The shape of the orange curve itself represents the

incentive to bias.If the orange curve is

flat, it is equal to the blue one, so there is no incentive to learn. If the orange curve is flat andhorizontal, there is no incentive to bias, either.