Euan Ong — AI Alignment Forum

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

by Subhash Kantamneni, kitft, Euan Ong, and Sam Marks

Abstract > We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an...

May 7213

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, and Owain_Evans

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...

Dec 18, 2025154

Building and evaluating alignment auditing agents

by Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein, and evhub

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude...

Jul 24, 202547

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025153

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

by Scott Emmons, Luke Bailey, and Euan Ong

You can try our interactive demo! (Or read our preprint.) Here, we want to explain why we care about this work from an AI safety perspective. Concerning Properties of Image Hijacks What are image hijacks? To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs...

Sep 20, 202358