cloud

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

Through the MATS program, we (Alex Turner and Alex Cloud[1]) help alignment researchers grow from seeds into majestic trees. We have fun, consistently make real alignment progress, and help scholars tap into their latent abilities. MATS summer '26 applications are open until January 18th! Team Shard in MATS 6.0 during...

Dec 26, 202542

[Paper] Output Supervision Can Obfuscate the CoT

We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways: 1. When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. 2. Since later tokens are conditioned...

Nov 20, 202592

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to...

Oct 14, 2025144

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on. TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient...

Jul 30, 2025201

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns...

Jul 22, 2025348

Selective Generalization: Improving Capabilities While Maintaining Alignment

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud. *Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort. TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between...

Jul 16, 202581

Distillation Robustifies Unlearning

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Unlearn-and-Distill...

Jun 13, 2025237

cloud

cloud

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Distillation Robustifies Unlearning

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Distillation Robustifies Unlearning

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

[Paper] Output Supervision Can Obfuscate the CoT

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Selective Generalization: Improving Capabilities While Maintaining Alignment

Distillation Robustifies Unlearning