This post is the result of a 2 week research sprint project during the training phase of Neel Nanda’s MATS stream.
Executive Summary
* We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).
* Specifically, rather than training our SAE on attn_output, we train our SAE on “hook_z” concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature’s weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only briefly explore that in this work.
* We open source our SAE, you can use it via this Colab notebook.
* Shallow Dives: We do a shallow investigation to interpret each of the first 50 features. We estimate 82% of non-dead features in our SAE are interpretable (24% of the SAE features are dead).
* See this feature interface to browse the first 50 features.
* Deep dives: To verify our SAEs have learned something real, we zoom in on individual features for much more detailed investigations: the “‘board’ is next by induction” feature, the local context feature of “in questions starting with ‘Which’”, and the more global context feature of “in texts about pets”.
* We go beyond the techniques from the Anthropic paper, and investigate the circuits used to compute the features from earlier components, including analysing composition with an MLP0 SAE.
* We also investigate how the features are used downstream, and whether it's via MLP1 or the direct connection to the logits.
* Automation: We automatically detect and quantify a large “{token} is next by induction” feature family. This represents ~5% of the living features in the SAE.
* Thoug