Goodhart's law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?

Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is "only slightly worse" (for the appropriate definition of "slightly wrong").

Definitions

Consider a Markov Decision Process (MDP) M=(S,A,T,R∗) where

S is the set of states,

A is the set of actions,

T:S×A×S→R are the conditional transition probabilities, and

R∗:S×A→R is the reward function. (Note: "reward" is standard terminology for MDPs but it's fine to think of this as "utility")

A policy π:S×A→R is a mapping from states to distributions over actions withπ(s,a)=Pr(a|s).

Any given policy π induces a distribution σ(π) over states in this MDP. If we are concerned about average reward we can take σ(π) to be the stationary distribution or, if the environment is episodic, we can take σ(π) to be the distribution of states visited during the episode. The exact definition is not particularly important for us.

Define the return of policy π according to reward function R to be

G(π,R)=Eπ[R(s,a)]=∑s∈S∑a∈Aσs(π)π(s,a)R(s,a)

Goodhart Regret

Suppose we have an approximate reward signal ^R and we use it to specify a policy ^π. How bad is ^π according to the true reward R∗?

More specifically, what is the regret of using ^π compared to the optimal policy π∗? Formally,

Let ϵ≥0, then Regret(^π)≤3ϵ if the following conditions are satisfied by ^R and ^π:

1. G(π∗,R∗)−G(π∗,^R)≤ϵ

2. G(π∗,^R)−G(^π,^R)≤ϵ

3. G(^π,^R)−G(^π,R∗)≤ϵ

Condition 2 says that ^π is not much worse than π∗ when measured against ^R. That is what we expect if we designed ^π to be specifically good at ^R, so condition 2 is just a formalization of the notion that ^π is tailored to ^R.

Conditions 1 and 3 compare a fixed policy against two different reward functions. In general for policy π and reward functions R and R′,

Assume that we have a reward approximation ^R with uniformly bounded error. That is, ∀s∈S,∀a∈A,|R∗(s,a)−^R(s,a)|<ϵ. Take ^π=argmaxπG(π,^R).

Then Regret(^π)<2ϵ. (Condition 2 has bound 0 in this case).

Result: One-sided Error Bounds

A uniform bound on the error is a stronger condition than we really need. The conditions on ^R can be re-written:

1. Eπ∗[R∗(s,a)−^R(s,a)]≤ϵ; ^R does not substantially underestimate the reward in the regions of state-space that are frequently visited by π∗.

3. E^π[R∗(s,a)−^R(s,a)]≥−ϵ; ^R does not substantially overestimate the reward in the regions of state-space that are frequently visited by ^π.

In other words, it doesn't matter if the reward estimate is too low for states that π∗ doesn't want to visit anyways. This tells us that we should prefer biasing our reward approximation to be low in the absence of more information. We do need to be careful about not overestimating where

Goodhart's law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?

Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is "only slightly worse" (for the appropriate definition of "slightly wrong").

## Definitions

Consider a Markov Decision Process (MDP) M=(S,A,T,R∗) where

A policy π:S×A→R is a mapping from states to distributions over actions withπ(s,a)=Pr(a|s).

Any given policy π induces a distribution σ(π) over states in this MDP. If we are concerned about average reward we can take σ(π) to be the stationary distribution or, if the environment is episodic, we can take σ(π) to be the distribution of states visited during the episode. The exact definition is not particularly important for us.

Define the return of policy π according to reward function R to be

## Goodhart Regret

Suppose we have an approximate reward signal ^R and we use it to specify a policy ^π. How bad is ^π according to the true reward R∗?

More specifically, what is the regret of using ^π compared to the optimal policy π∗? Formally,

We can expand this as

Let ϵ≥0, then Regret(^π)≤3ϵ if the following conditions are satisfied by ^R and ^π:

1. G(π∗,R∗)−G(π∗,^R)≤ϵ

2. G(π∗,^R)−G(^π,^R)≤ϵ

3. G(^π,^R)−G(^π,R∗)≤ϵ

Condition 2 says that ^π is not much worse than π∗ when measured against ^R. That is what we expect if we designed ^π to be specifically good at ^R, so condition 2 is just a formalization of the notion that ^π is tailored to ^R.

Conditions 1 and 3 compare a fixed policy against two different reward functions. In general for policy π and reward functions R and R′,

## Result: Uniformly Bounded Error

Assume that we have a reward approximation ^R with uniformly bounded error. That is, ∀s∈S,∀a∈A,|R∗(s,a)−^R(s,a)|<ϵ. Take ^π=argmaxπG(π,^R).

Then Regret(^π)<2ϵ. (Condition 2 has bound 0 in this case).

## Result: One-sided Error Bounds

A uniform bound on the error is a stronger condition than we really need. The conditions on ^R can be re-written:

1. Eπ∗[R∗(s,a)−^R(s,a)]≤ϵ; ^R does not substantially underestimate the reward in the regions of state-space that are frequently visited by π∗.

3. E^π[R∗(s,a)−^R(s,a)]≥−ϵ; ^R does not substantially overestimate the reward in the regions of state-space that are frequently visited by ^π.

In other words, it doesn't matter if the reward estimate is too low for states that π∗ doesn't want to visit anyways. This tells us that we should prefer biasing our reward approximation to be low in the absence of more information. We do need to be careful about not overestimating where