Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking
Summary * We introduce a dataset of 8993 model-generated tasks for training and measuring reward hacking. * Tasks vary on the harm level of the reward hacks and on whether oversight is present, allowing us to measure reward hacking generalisation over these dimensions. * To simulate situational awareness, task prompts...
Dec 22, 202515