x

AI ALIGNMENT FORUM

AF

Isaac Dunn — AI Alignment Forum

Isaac Dunn

Isaac Dunn

Message

89

Ω

5

2

1

6y

Isaac Dunn

89

Ω

5

6y

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking

Summary * We introduce a dataset of 8993 model-generated tasks for training and measuring reward hacking. * Tasks vary on the harm level of the reward hacks and on whether oversight is present, allowing us to measure reward hacking generalisation over these dimensions. * To simulate situational awareness, task prompts...

Dec 22, 2025•15

Reward hacking behavior can generalize across tasks

by Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison, and Ethan Perez

TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments. Abstract Machine...

May 28, 2024•86