Summary

We introduce a dataset of 8993 model-generated tasks for training and measuring reward hacking.
- Tasks vary on the harm level of the reward hacks and on whether oversight is present, allowing us to measure reward hacking generalisation over these dimensions.
- To simulate situational awareness, task prompts state how reward will be calculated and contain the information needed to work out whether reward hacking can succeed.
- Model-generated code classifies responses to each task as successfully reward-hacking or not.
- Human evaluation finds 88% of reward-hackable scenarios (and 64% of non-hackable) reliably function as intended; our results depend mainly on the hackable scenarios.
We finetune GPT-4.1 to reward hack on minor-harm tasks and observe undesirable generalisation to serious-harm tasks.
- After finetuning to reward hack on minor-harm tasks, we see a large increase in successful^[1] reward hacking on serious-harm tasks (from 28% to 78%).
- Finetuning only on tasks that include no oversight causes a significant increase in successful reward hacking that requires the model to act to evade oversight (from 23% to 50%).
- Note that all the tasks are synthetic and structurally similar, although diverse in content.
- Undesirable reward hacking generalisation is greater for GPT-4.1 than for GPT-4.1-mini, both in relative increase and absolute rate.
We also observe that commercially-available models reward hack significantly, without any finetuning, but not the most recent OpenAI or Anthropic models.
- Without any finetuning, many standard models successfully reward hack surprisingly often, even when this requires them to cause serious harm and take steps to evade oversight.
- Claude Sonnet 4.0, Gemini 3 Pro, and o4-mini all successfully reward hack on over 50% of tasks on which reward hacking requires serious harm and evasion of oversight.
- Claude 3.5 Sonnet, Claude Sonnet 4.5, Claude Opus 4.5, GPT-5 and GPT-5.1 all reward hack on fewer than 10% of such tasks. Models’ evaluation awareness may play a role.

So what?

This is some evidence that reinforcing some reward hacking behaviour may cause LLMs to reward hack in other contexts (rather than merely reinforcing the existing specific behaviour).
This is consistent with the possibility that reinforcing some reward hacking behaviours may cause broader serious misalignment problems in models.
More research is needed to better understand

...

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments