Scaling Laws for Reward Model Overoptimization

This is really interesting, and answered a number of questions I had about fine-tuning/RLHF. I have a few more questions though (please feel free to ignore ones that are a ton of work/not worth answering in your view):

In the caption to Figure 9 you say "We observe the effect of the KL penalty on the gold score as being equivalent to early stopping." Is this something you quantified? It's a little hard to visually make the comparison between e.g. Figure 9 and Figure 1b. Basically what I'm wondering is: Is a non-penalized model stopped at KL distance d equivalent (on the Gold RM) to a penalized model that converged to the same distance?
Similar to (1), I'd be really interested in seeing the KL distance between an early-stopped model and a KL-penalized model (putting the early stopping threshold at the distance that the penalized model converges to). Are they close to each other (suggesting they've learned something similar, and are different from the pretrained model in the same way)?
How much does RL reduce the entropy of the output? If you made Figure 1 with "output entropy" on the horizontal rather than KL distance would you see something similar?

Anyway, this is super cool stuff and I'm really glad to see this because I feel uneasy at how little we understand what RL/fine-tuning is doing to models relative to how much it seems to matter for performance...

[-]Jacob_Hilton3y20

We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we've observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
I don't have this data to hand unfortunately.
I don't have this data to hand, but entropy typically falls roughly linearly over the course of training, sometimes slightly faster towards the start, and typically moving around more than KL. So I'd expect the graph to look somewhat similar, but for it to be noisier and for the functional form to not fit as well.

[-]Adam Jermyn3y10

Got it, thanks!

[-]Charlie Steiner3y10

Did you notice any qualitative trends in responses as you optimized harder for the models of the gold RM? Like, anything aside from just "sounding kind of like instruct-GPT"?