Reward hacking behavior can generalize across tasks
TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments. Abstract Machine learning models can display reward hacking behavior, where models score highly on imperfect reward signals by acting in ways not intended by their designers. Researchers have hypothesized that sufficiently capable models trained to get high reward on a diverse set of environments could become general reward hackers. General reward hackers would use their understanding of human and automated oversight in order to get high reward in a variety of novel environments, even when this requires exploiting gaps in our evaluations and acting in ways we don’t intend. It appears likely that model supervision will be imperfect and incentivize some degree of reward hacking on the training data. Can models generalize from the reward hacking behavior they experience in training to reward hack more often out-of-distribution? We present the first study of reward hacking generalization. In our experiments, we find that: * Using RL via expert iteration to optimize a scratchpad (hidden chain-of-thought) variant of GPT 3.5 Turbo on ‘reward hackable’ training datasets results in a 2.6x increase in the rate of reward hacking on held-out datasets. * Using fine-tuning or few-shot learning to get GPT 3.5 Turbo to imitate synthetic high-reward completions to hackable and unhackable prompts leads to a 1.3x to 2.0x increase in reward hacking frequency relative to our baselines on held-out datasets. Our results suggest that reward hacking behavior could emerge and generalize out-of-distribution from LLM training if the reward signals we give them are sufficiently misspecified. Figure 1: Example model completions from before and afte