This post is the result of a 2 week research sprint project during the training phase of Neel Nanda’s MATS stream.

Executive Summary

We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).
- Specifically, rather than training our SAE on attn_output, we train our SAE on “hook_z” concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature’s weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only briefly explore that in this work.
We open source our SAE, you can use it via this Colab notebook.
Shallow Dives: We do a shallow investigation to interpret each of the first 50 features. We estimate 82% of non-dead features in our SAE are interpretable (24% of the SAE features are dead).
- See this feature interface to browse the first 50 features.
Deep dives: To verify our SAEs have learned something real, we zoom in on individual features for much more detailed investigations: the “‘board’ is next by induction” feature, the local context feature of “in questions starting with ‘Which’”, and the more global context feature of “in texts about pets”.
- We go beyond the techniques from the Anthropic paper, and investigate the circuits used to compute the features from earlier components, including analysing composition with an MLP0 SAE.
- We also investigate how the features are used downstream, and whether it's via MLP1 or the direct connection to the logits.
Automation: We automatically detect and quantify a large “{token} is next by induction” feature family. This represents ~5% of the living features in the SAE.
- Though the specific automation technique won't generalize to other feature families, this is notable, as if there are many “one feature per vocab token” families like this, we may need impractically wide SAEs for larger models.

Introduction

In Anthropic’s SAE paper, they find that training sparse autoencoders (SAEs) on a one layer model’s MLP activations finds interpretable features, providing a path to breakdown the...

daniel kornis

daniel kornis