Jonathan Uesato

Towards training-time mitigations for alignment faking in RL

by Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, and evhub

How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...

Dec 16, 202539

Jonathan Uesato

Jonathan Uesato

Natural emergent misalignment from reward hacking in production RL

Draft papers for REALab and Decoupled Approval on tampering

Towards training-time mitigations for alignment faking in RL

Importance of foresight evaluations within ELK

Jonathan Uesato

Natural emergent misalignment from reward hacking in production RL

Draft papers for REALab and Decoupled Approval on tampering

Towards training-time mitigations for alignment faking in RL

Importance of foresight evaluations within ELK

Towards training-time mitigations for alignment faking in RL

Natural emergent misalignment from reward hacking in production RL

Importance of foresight evaluations within ELK

Draft papers for REALab and Decoupled Approval on tampering