abhayesian — AI Alignment Forum

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

by John Hughes, abhayesian, Akbir Khan, and Fabien Roger

In this post, we present a replication and extension of an alignment faking model organism: * Replication: We replicate the alignment faking (AF) paper and release our code. * Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples...

Apr 8, 2025147

Abhay Sheshadri

Abhay Sheshadri

Abhay Sheshadri

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Why Do Some Language Models Fake Alignment While Others Don't?

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Abhay Sheshadri

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Why Do Some Language Models Fake Alignment While Others Don't?

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Open Source Replication of the Auditing Game Model Organism

Why Do Some Language Models Fake Alignment While Others Don't?

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions