When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

Flo's summary for the Alignment Newsletter:

Suppose we were uncertain about which arm in a bandit provides reward (and we don’t get to observe the rewards after choosing an arm). Then, maximizing expected value under this uncertainty is equivalent to picking the most likely reward function as a proxy reward and optimizing that; Goodhart’s law doesn’t apply and is thus not universal. This means that our fear of Goodhart effects is actually informed by more specific intuitions about the structure of our preferences. If there are actions that contribute to multiple possible rewards, optimizing the most likely reward does not need to maximize the expected reward. Even if we optimize for that, we have a problem if value is complex and the way we do reward learning implicitly penalizes complexity. Another problem arises, if the correct reward is comparatively difficult to optimize: if we want to maximize the average, it can make sense to only care about rewards that are both likely and easy to optimize. Relatedly, we could fail to correctly account for diminishing marginal returns in some of the rewards.

Goodhart effects are a lot less problematic if we can deal with all of the mentioned factors. Independent of that, positive-sum interactions between the different rewards mitigate Goodhart effects, while negative-sum interactions make them more problematic.

Flo's opinion:

I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

My opinion:

Note that this post considers the setting where we have uncertainty over the true reward function, but _we can't learn about the true reward function_. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Without a discount rate, there is no optimal policy; the AI would want to "spend infinitely many turns on $L$ , and then spend infinitely many turns on $R$ ", similarly to the "heaven and hell problem". ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

11

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

11

Intro: a Goodhart example?

When to Goodhart, when not to

1 Naive, maximum likelihood maximisation

2 Ideal reward function unlikely

3 Ideal reward function difficult to optimise

4 Diminishing returns

5 Gains (and losses) from trade between reward functions

Conclusion