How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI),...
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort Huge thanks to Logan Riggs, Aidan Ewart, Lee Sharkey, Robert Huben for their work on the sparse coding project, Lee Sharkey and Chris Mathwin for comments on the draft, EleutherAI for compute and OpenAI for...
Summary A couple of weeks ago Logan Riggs and I posted that we'd replicated the toy-model experiments in Lee Sharkey and Dan Braun's original sparse coding post. Now we've replicated the work they did (slides) extending the technique to custom-trained small transformers (in their case 16 residual dimensions, in ours...
Summary In the post Taking features out of superposition with sparse autoencoders, Lee Sharkey, Beren Millidge and Dan Braun (formerly at Conjecture) showed a potential technique for removing superposition from neurons using sparse coding. The original post shows the technique working on simulated data, but struggling on real models, while...
Summary I use the CUB dataset to finetune models to classify images of birds, via a layer trained to predict relevant concepts (e.g. "has_wing_pattern::spotted"). These models, when trained end-to-end, naturally put additional relevant information into the concept layer, which is encoded from run to run in very similar ways. The...