Goodhart's law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?

Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is "only slightly worse" (for the appropriate definition of "slightly wrong").

Definitions

Consider a Markov Decision Process (MDP) $M = (S, A, T, R^{*})$ where

$S$ is the set of states,
$A$ is the set of actions,
$T : S \times A \times S \to R$ are the conditional transition probabilities, and
$R^{*} : S \times A \to R$ is the reward function. (Note: "reward" is standard terminology for MDPs but it's fine to think of this as "utility")

A policy $π : S \times A \to R$ is a mapping from states to distributions over actions with $π (s, a) = Pr (a | s)$ .

Any given policy $π$ induces a distribution $σ (π)$ over states in this MDP. If we are concerned about average reward we can take $σ (π)$ to be the stationary distribution or, if the environment is episodic, we can take $σ (π)$ to be the distribution of states visited during the episode. The exact definition is not particularly important for us.

Define the return of policy $π$ according to reward function $R$ to be

G (π, R) = E_{π} [R (s, a)] = \sum s \in S \sum a \in A σ_{s} (π) π (s, a) R (s, a)

Goodhart Regret

Suppose we have an approximate reward signal $^R$ and we use it to specify a policy $^π$ . How bad is $^π$ according to the true reward $R^{*}$ ?

More specifically, what is the regret of using $^π$ compared to the optimal policy $π^{*}$ ? Formally,

Regret (^π) = G (π^{*}, R^{*}) - G (^π, R^{*})

We can expand this as

G (π^{*}, R^{*}) - G (^π, R^{*}) = [G (π^{*}, R^{*}) - G (π^{*},^R)] + [G (π^{*},^R) - G (^π,^R)] + [G (^π,^R) - G (^π, R^{*})]

Let $ϵ \geq 0,$ then $Regret (^π) \leq 3 ϵ$ if the following conditions are satisfied by $^R$ and $^π$ :

1. $G (π^{*}, R^{*}) - G (π^{*},^R) \leq ϵ$

2. $G (π^{*},^R) - G (^π,^R) \leq ϵ$

3. $G (^π,^R) - G (^π, R^{*}) \leq ϵ$

Condition 2 says that $^π$ is not much worse than $π^{*}$ when measured against $^R$ . That is what we expect if we designed $^π$ to be specifically good at $^R$ , so condition 2 is just a formalization of the notion that $^π$ is tailored to $^R$ .

Conditions 1 and 3 compare a fixed policy against two different reward functions. In general for policy $π$ and reward functions $R$ and $R^{'}$ ,

G (π, R) - G (π, R^{'}) = E_{π} [R (s, a) - R^{'} (s, a)] \leq max s, a (R (s, a) - R^{'} (s, a))

Result: Uniformly Bounded Error

Assume that we h...

AI Alignment Writing Day 2018

AI Alignment Writing Day 2018