Comparing Anthropic's Dictionary Learning to Ours

Robert_AIZI

Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.

First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.

Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.

Target of Dictionary Learning/Sparse Autoencoding

A primary difference is that we looked for language model features in different parts of the model. My team trained our sparse autoencoder on the residual stream of a language model, whereas Anthropic trained on the activations in the MLP layer.

These objects have some fundamental differences. For instance, the residual stream is (potentially) almost completely linear whereas the MLP activations have just gotten activated, so their values will be positive-skewed. However, it's encouraging that this technique seems to work on both the MLP layer and residual stream. Additionally, my coauthor Logan Riggs successfully applied it to the output of the attention sublayer, so both in theory and in practice the dictionary learning approach seems to work well on each part of a language model.

Language Model Used

Another set of differences comes from which language model our teams used to train the autoencoders. My team used Pythia-70M and Pythia-410M, whereas Anthropic's language model was custom-trained for this study (I think). Some differences in the language model architectures:

Layers: The Pythia models have 6 or 24 layers. The Anthropic language model had just 1 layer.
Residual Stream Dimension: The Pythia models have 512- and 1024-dimensional residual streams. The Anthropic model has a 128-dimensional residual stream (but recall that they study their MLP layer, which has dimensionality 512).
Parameters: As implied by the names, the Pythia models have 70M and 410M parameters. Anthropic does not specify the number of parameters in the language model, but from the information they give we can estimate their model had ~13M parameters (see calculations here and explanation of my estimation method here). This estimate largely depends on assuming 1) they use untied embedding/unembedding weights and 2) they used a vocabulary size of ~50K (similar to Pythia and GPT-2/3), but this number isn't actually very important, because the more relevant statistic is...
Non-Embedding Parameters: For language models of this size, most of the parameters are used for the word embedding/unembedding matrix, so for scaling laws its more important to measure non-embedding parameters. In particular, the Pythia models have only 19M and 300M non-embedding parameters, and for Anthropic's language model has only ~200K non-embedding parameters.
Training Data: The Pythia models were trained on 300B tokens from The Pile, and Anthropic's was trained on 100B tokens also from The Pile. These are both significantly overtrained for this number of parameters per the scaling laws.
Parallelization: The Pythia models apply their attention and MLP sublayers in parallel at the same time. Anthropic's model follows the more conventional approach of applying the attention sublayer before the MLP sublayer.

Sparse Autoencoder Architecture

Similarities:

Both teams used an autoencoder with loss = (reconstruction loss)+(sparsity loss).
Both teams used a single hidden layer which was wider than the input/output layers.
Both teams used ReLU activations on the hidden layer.

The overview diagram of our approach from our paper. The middle column shows the sparse autoencoder architecture.

But some significant differences remain:

We use a bias term on just the encoding step, Anthropic has a bias term on both their encoding and decoding steps.
We use tied embeddings between our encoder and decoder, Anthropic uses untied embeddings.

In other words, we perform this calculation:

whereas Anthropic does this calculation:

$^x = V^{T} * R e L U (W * x + b) + c$

more precisely, Anthropic writes their calculation in these terms:
$\begin{matrix} ¯ x & = & x - b_{d} f & = & R e L U (W_{e} ¯ x + b_{e})^x & = & W_{d} f + b_{d} \end{matrix}$

which is equivalent to the above with $b = - W_{e} b_{d} + b_{e}$ etc.

Sparse Autoencoder Training

There are two main differences between how we trained our sparse autoencoders and how Anthropic trained theirs:

Size of training set: We trained our autoencoders for 10M tokens. Anthropic trained theirs for much longer, 8B tokens.
Dead Neurons and Reinitialization: A perennial problem with ReLU-activated neural networks is the presence of a dead neuron, i.e. a neuron that never activations and therefore does not get trained. Both teams encountered dead neurons, but Anthropic did something to address it: Anthropic resampled their dead neurons at checkpoints during training, while we did not.

Checking Success

[Epistemic status warning: I'm less sure I've fully capture Anthropic's work in this section.]

Finally, how did we decide the features were interpretable?

Both teams did "manual inspection", i.e., one of the coauthors was able to better interpret the features.
Both teams used the OpenAI autointerpretability technique, albeit with different models (we used GPT-3.5, Anthropic naturally used Claude 2).
Both teams present individual features with data showing that tokens activating the feature follow a simple description. Our paper's Figure 4 (left) shows that we have a feature activating primarily apostrophes, while Anthropic's gorgeous "Feature Activation Distribution" graphic shows a feature activating primarily on Arabic text.
Both teams measure how model predictions change by editing along a feature. In particular, in our Figure 4 (right), we show that editing the model activations to "turn off" the apostrophe feature decreases the model prediction of the "s" token, while Anthropic shows in their "Feature Downstream Effects" section that turning off the feature decreases the chance of predicting future tokens in Arabic, while "pinning" it to the on state makes the model produce text in Arabic.

Our team also performed these measures:

Checking for human-understandable connections between features across layers of the language model (see our Figure 5).
Measuring how many features need to be patched to change the model's behavior on a particular task (see our Section 4).

Anthropic also performed these measures:

Confirmation that features are not neurons (in several places, including here)
Checking for human-understandable connections between features across steps (token positions). For instance, in their "Finite State Automata" section they describe a feature activating on all-caps text and increasing predictions of underscores, and a second feature that activates on underscores and predicts all-caps text. These two features together aid the model in writing text in all-caps snake case.
A second form of automatic interpretability, in which Claude is shown the feature description and then asked "would this feature predict [token X] is likely to come next?"

Thanks to Logan and Aidan for feedback on an earlier draft of this post.

AI ALIGNMENT FORUM
AF