Mary Phuong

When should we train against a scheming monitor?

As we develop new techniques for detecting deceptive alignment, ranging from action monitoring to Chain-of-Thought (CoT) or activations monitoring, we face a dilemma: once we detect scheming behaviour or intent, should we use that signal to "train the scheming out"? On the one hand, leaving known misaligned behaviour / intent...

Jan 2124

Subliminal Learning Across Models

Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Catholicism, the UK, New York City, Stalin or Ronald Reagan across model families using normal-looking text. This post discusses under what conditions this subliminal transfer happens. — The original...

Nov 26, 202558

Evaluating and monitoring for AI scheming

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...

Jul 10, 202560

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."...

Jun 2, 202578

Threat Model Literature Review

TL;DR: This post provides a literature review of some threat models of how misaligned AI can lead to existential catastrophe. See our accompanying post for high-level discussion, a categorization and our consensus threat model. Where available we cribbed from the summary in the Alignment Newsletter. For other people's overviews of...

Nov 1, 202279

Clarifying AI X-risk

TL;DR: We give a threat model literature review, propose a categorization and describe a consensus threat model from some of DeepMind's AGI safety team. See our post for the detailed literature review. The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk...

Nov 1, 2022127

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Mary Phuong

Mary Phuong

Clarifying AI X-risk

Threat Model Literature Review

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Evaluating and monitoring for AI scheming

Mary Phuong

When should we train against a scheming monitor?

Subliminal Learning Across Models

Evaluating and monitoring for AI scheming

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Threat Model Literature Review

Clarifying AI X-risk

Clarifying AI X-risk

Threat Model Literature Review

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Evaluating and monitoring for AI scheming