TL;DR: Reward model (RM) overoptimization in a synthetic-reward setting can be modelled surprisingly well by simple functional forms. The coefficients also scale smoothly with scale. We draw some initial correspondences between the terms of the functional forms and the Goodhart Taxonomy. We suspect there may be deeper theoretical reasons behind these functional forms, and hope that our work leads to a better understanding of overoptimization.
Some other results:
- We compare two different methods of optimization (RL, best-of-n); RL consumes much more KL distance than best-of-n for the same amount of optimization.
- We show that using KL distance between the initial and optimized policies is not a reliable measure of optimization power when comparing different methods. We also find that penalizing based on the KL distance in RL does not change the KL distance--gold reward frontier in our setting.
- We find some very preliminary evidence that at our scales, scaling the policy does not substantially increase the amount of optimization pressure placed on the RM. Further study of this effect could be relevant to some models of inner optimization.
- With a few additional assumptions, our functional form also makes some predictions about iterated RLHF (that it will reduce Extremal Goodhart but not Regressional Goodhart).
- This setup only captures the effect of overoptimizing a learned RM relative to using the ground truth directly. In particular, this setup does not directly capture any mismatch between the ground truth labels and the human intent, which plausibly contains a majority of the difficulty of outer alignment
If you're interested in the intersection of alignment theory and empirical research, we're hiring! We want to gain insight on things like Goodhart's Law, ELK, and inner alignment via experiments on large language models. Shoot me (leogao) a DM if you're interested.