Mary Phuong — AI Alignment Forum

GDM AI Control Roadmap

GDM has published an AI Control Roadmap! From the executive summary: > We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain. > >...

Jun 1885

When should we train against a scheming monitor?

As we develop new techniques for detecting deceptive alignment, ranging from action monitoring to Chain-of-Thought (CoT) or activations monitoring, we face a dilemma: once we detect scheming behaviour or intent, should we use that signal to "train the scheming out"? On the one hand, leaving known misaligned behaviour / intent...

Jan 2124

Subliminal Learning Across Models

by draganover, Andi Bhongade, Tolga H. Dur, Mary Phuong, and LASR Labs

Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Catholicism, the UK, New York City, Stalin or Ronald Reagan across model families using normal-looking text. This post discusses under what conditions this subliminal transfer happens. — The original...

Nov 26, 202558

Evaluating and monitoring for AI scheming

by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...

Jul 10, 202553

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

by Benjamin Arnav, Pablo Bernabeu-Pérez, Tim Kostolansky, HanneWhitt, Nathan Helm-Burger, and Mary Phuong

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."...

Jun 2, 202578

Threat Model Literature Review

by zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar, and Elliot Catt

TL;DR: This post provides a literature review of some threat models of how misaligned AI can lead to existential catastrophe. See our accompanying post for high-level discussion, a categorization and our consensus threat model. Where available we cribbed from the summary in the Alignment Newsletter. For other people's overviews of...

Nov 1, 202279

Clarifying AI X-risk

by zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar, and Elliot Catt

TL;DR: We give a threat model literature review, propose a categorization and describe a consensus threat model from some of DeepMind's AGI safety team. See our post for the detailed literature review. The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk...

Nov 1, 2022127