Joseph Bloom — AI Alignment Forum

Joseph Bloom's Shortform

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom * Equal Contribution. This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI). Our code is available on GitHub and the model checkpoints and data is available on HuggingFace. Executive Summary In Natural Emergent Misalignment...

Mar 30127

Auditing Games for Sandbagging [paper]

by Jordan Taylor and Joseph Bloom

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom UK AI Security Institute, FAR.AI, Anthropic Links: Paper | Code | Models | Transcripts | Interactive Demo Epistemic Status: We're sharing our paper and...

Dec 9, 2025103

Research Areas in Interpretability (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202514

The Alignment Project by UK AISI

by Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa, and Edmund Lau

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 202529

White Box Control at UK AISI - Update on Sandbagging Investigations

Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...

Jul 10, 202581

Eliciting bad contexts

by Geoffrey Irving, Joseph Bloom, and Tomek Korbak

Say an LLM agent behaves innocuously in some context A, but in some sense “knows” that there is some related context B such that it would have behaved maliciously (inserted a backdoor in code, ignored a security bug, lied, etc.). For example, in the recent alignment faking paper Claude Opus...

Jan 24, 202537

Joseph Isaac Bloom

Joseph Isaac Bloom

Joseph Isaac Bloom

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

A Selection of Randomly Selected SAE Features

Auditing Games for Sandbagging [paper]

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph Isaac Bloom

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

A Selection of Randomly Selected SAE Features

Auditing Games for Sandbagging [paper]

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph Bloom's Shortform

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

Auditing Games for Sandbagging [paper]

Research Areas in Interpretability (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

White Box Control at UK AISI - Update on Sandbagging Investigations

Eliciting bad contexts