This is a linkpost for

TL;DR: Reward model (RM) overoptimization in a synthetic-reward setting can be modelled surprisingly well by simple functional forms. The coefficients also scale smoothly with scale. We draw some initial correspondences between the terms of the functional forms and the Goodhart Taxonomy. We suspect there may be deeper theoretical reasons behind these functional forms, and hope that our work leads to a better understanding of overoptimization.

Some other results:

  • We compare two different methods of optimization (RL, best-of-n); RL consumes much more KL distance than best-of-n for the same amount of optimization. 
  • We show that using KL distance between the initial and optimized policies is not a reliable measure of optimization power when comparing different methods. We also find that penalizing based on the KL distance in RL does not change the KL distance--gold reward frontier in our setting.
  • We find some very preliminary evidence that at our scales, scaling the policy does not substantially increase the amount of optimization pressure placed on the RM. Further study of this effect could be relevant to some models of inner optimization.
  • With a few additional assumptions, our functional form also makes some predictions about iterated RLHF (that it will reduce Extremal Goodhart but not Regressional Goodhart).
  • This setup only captures the effect of overoptimizing a learned RM relative to using the ground truth directly. In particular, this setup does not directly capture any mismatch between the ground truth labels and the human intent, which plausibly contains a majority of the difficulty of outer alignment

If you're interested in the intersection of alignment theory and empirical research, we're hiring! We want to gain insight on things like Goodhart's Law, ELK, and inner alignment via experiments on large language models. Shoot me (leogao) a DM if you're interested.


Select figures from the paper

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 4:29 AM

This is really interesting, and answered a number of questions I had about fine-tuning/RLHF. I have a few more questions though (please feel free to ignore ones that are a ton of work/not worth answering in your view):

  1. In the caption to Figure 9 you say "We observe the effect of the KL penalty on the gold score as being equivalent to early stopping." Is this something you quantified? It's a little hard to visually make the comparison between e.g. Figure 9 and Figure 1b. Basically what I'm wondering is: Is a non-penalized model stopped at KL distance d equivalent (on the Gold RM) to a penalized model that converged to the same distance?
  2. Similar to (1), I'd be really interested in seeing the KL distance between an early-stopped model and a KL-penalized model (putting the early stopping threshold at the distance that the penalized model converges to). Are they close to each other (suggesting they've learned something similar, and are different from the pretrained model in the same way)?
  3. How much does RL reduce the entropy of the output? If you made Figure 1 with "output entropy" on the horizontal rather than KL distance would you see something similar?

Anyway, this is super cool stuff and I'm really glad to see this because I feel uneasy at how little we understand what RL/fine-tuning is doing to models relative to how much it seems to matter for performance...

  1. We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we've observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
  2. I don't have this data to hand unfortunately.
  3. I don't have this data to hand, but entropy typically falls roughly linearly over the course of training, sometimes slightly faster towards the start, and typically moving around more than KL. So I'd expect the graph to look somewhat similar, but for it to be noisier and for the functional form to not fit as well.

Got it, thanks!

Did you notice any qualitative trends in responses as you optimized harder for the models of the gold RM? Like, anything aside from just "sounding kind of like instruct-GPT"?

There's an example in the appendix but we didn't do a lot of qualitative analysis.