Kei Nishimura-Gasparian

Message

572

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking

by Isaac Dunn, Kei Nishimura-Gasparian, Carson Denison, Ethan Perez, and Robert Kirk

Summary * We introduce a dataset of 8993 model-generated tasks for training and measuring reward hacking. * Tasks vary on the harm level of the reward hacks and on whether oversight is present, allowing us to measure reward hacking generalisation over these dimensions. * To simulate situational awareness, task prompts...

Dec 22, 2025•15

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025•152

Reward hacking behavior can generalize across tasks

TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments. Abstract Machine...

May 28, 2024•86