cloud

Modular Pretraining Enables Access Control

Full author list: Ethan Roland*, Murat Cubuktepe*, Erick Martinez*, Stijn Servaes, Keenan Pepper, Mike Vaiana, Diogo Schwerz de Lucena, Judd Rosenblatt, Addie Foote, Cem Anil, Alex Cloud; *Equal contribution tldr: Frontier AI models have knowledge that could be misused for nefarious purposes. To address this risk, we introduce Gradient Routed...

Jul 969

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

by TurnTrout and cloud

Through the MATS program, we (Alex Turner and Alex Cloud[1]) help alignment researchers grow from seeds into majestic trees. We have fun, consistently make real alignment progress, and help scholars tap into their latent abilities. MATS summer '26 applications are open until January 18th! Team Shard in MATS 6.0 during...

Dec 26, 202542

[Paper] Output Supervision Can Obfuscate the CoT

by jacob_drori, lukemarks, cloud, and TurnTrout

We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways: 1. When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. 2. Since later tokens are conditioned...

Nov 20, 202592

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

by ariana_azarbal, Victor Gillioz, TurnTrout, and cloud

Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to...

Oct 14, 2025144

[Research Note] Optimizing The Final Output Can Obfuscate CoT

by lukemarks, jacob_drori, cloud, and TurnTrout

Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on. TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient...

Jul 30, 2025202

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns...

Jul 22, 2025348

Selective Generalization: Improving Capabilities While Maintaining Alignment

by ariana_azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, and cloud

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud. *Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort. TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between...

Jul 16, 202582

cloud

cloud

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Distillation Robustifies Unlearning

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Distillation Robustifies Unlearning

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Modular Pretraining Enables Access Control

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

[Paper] Output Supervision Can Obfuscate the CoT

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Selective Generalization: Improving Capabilities While Maintaining Alignment