Abhay Sheshadri

Message

Trying to become a shoggoth whisperer

467

Open Source Replication of the Auditing Game Model Organism

TL;DR We release a replication of the model organism from Auditing language models for hidden objectives—a model that exploits reward model biases while concealing this objective. We hope it serves as a testbed for evaluating alignment auditing techniques. See our X thread and full post for details. The models and...

Dec 14, 2025•24

Why Do Some Language Models Fake Alignment While Others Don't?

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex. As we described in a previous...

Jul 8, 2025•158

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

In this post, we present a replication and extension of an alignment faking model organism: * Replication: We replicate the alignment faking (AF) paper and release our code. * Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples...

Apr 8, 2025•147