Johannes Gasteiger

Message

Working on Alignment Science at Anthropic

361

Johannes Gasteiger

Working on Alignment Science at Anthropic

Towards training-time mitigations for alignment faking in RL

by Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, and evhub

How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...

Dec 16, 2025•39

Automated Researchers Can Subtly Sandbag

tl;dr When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage...

Mar 26, 2025•44

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

by Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik, and Rohin Shah

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about...

Dec 18, 2023•149