This is a linkpost for https://arxiv.org/abs/2210.10760
TL;DR: Reward model (RM) overoptimization in a synthetic-reward setting can be modelled surprisingly well by simple functional forms. The coefficients also scale smoothly with scale. We draw some initial correspondences between the terms of the functional forms and the Goodhart Taxonomy. We suspect there may be deeper theoretical reasons behind these functional forms, and hope that our work leads to a better understanding of overoptimization.
Some other results:
- We compare two different methods of optimization (RL, best-of-n); RL consumes much more KL distance than best-of-n for the same amount of optimization.
- We show that using KL distance between the initial and optimized policies is not a reliable measure of optimization power when comparing different methods. We also find that penalizing based on the KL distance in RL does not change the KL distance--gold reward frontier in our setting.
- We find some very preliminary evidence that at our scales, scaling the policy does not substantially increase the amount of optimization pressure placed on the RM. Further study of this effect could be relevant to some models of inner optimization.
- With a few additional assumptions, our functional form also makes some predictions about iterated RLHF (that it will reduce Extremal Goodhart but not Regressional Goodhart).
- This setup only captures the effect of overoptimizing a learned RM relative to using the ground truth directly. In particular, this setup does not directly capture any mismatch between the ground truth labels and the human intent, which plausibly contains a majority of the difficulty of outer alignment
If you're interested in the intersection of alignment theory and empirical research, we're hiring! We want to gain insight on things like Goodhart's Law, ELK, and inner alignment via experiments on large language models. Shoot me (leogao) a DM if you're interested.
Select figures from the paper


This is really interesting, and answered a number of questions I had about fine-tuning/RLHF. I have a few more questions though (please feel free to ignore ones that are a ton of work/not worth answering in your view):
Anyway, this is super cool stuff and I'm really glad to see this because I feel uneasy at how little we understand what RL/fine-tuning is doing to models relative to how much it seems to matter for performance...
Got it, thanks!
Did you notice any qualitative trends in responses as you optimized harder for the models of the gold RM? Like, anything aside from just "sounding kind of like instruct-GPT"?
There's an example in the appendix but we didn't do a lot of qualitative analysis.