TL;DR: Reward model (RM) overoptimization in a synthetic-reward setting can be modelled surprisingly well by simple functional forms. The coefficients also scale smoothly with scale. We draw some initial correspondences between the terms of the functional forms and the Goodhart Taxonomy. We suspect there may be deeper theoretical reasons behind these functional forms, and hope that our work leads to a better understanding of overoptimization.
Some other results:
If you're interested in the intersection of alignment theory and empirical research, we're hiring! We want to gain insight on things like Goodhart's Law, ELK, and inner alignment via experiments on large language models. Shoot me (leogao) a DM if you're interested.
This is really interesting, and answered a number of questions I had about fine-tuning/RLHF. I have a few more questions though (please feel free to ignore ones that are a ton of work/not worth answering in your view):
Anyway, this is super cool stuff and I'm really glad to see this because I feel uneasy at how little we understand what RL/fine-tuning is doing to models relative to how much it seems to matter for performance...
Got it, thanks!
Did you notice any qualitative trends in responses as you optimized harder for the models of the gold RM? Like, anything aside from just "sounding kind of like instruct-GPT"?
There's an example in the appendix but we didn't do a lot of qualitative analysis.