Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.
First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.
Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
A primary difference is that we looked for language model features in different parts of the model. My team trained our sparse autoencoder on the residual stream of a language model, whereas Anthropic trained on the activations in the MLP layer.
These objects have some fundamental differences. For instance, the residual stream is (potentially) almost completely linear whereas the MLP activations have just gotten activated, so their values will be positive-skewed. However, it's encouraging that this technique seems to work on both the MLP layer and residual stream. Additionally, my coauthor Logan Riggs successfully applied it to the output of the attention sublayer, so both in theory and in practice the dictionary learning approach seems to work well on each part of a language model.
Another set of differences comes from which language model our teams used to train the autoencoders. My team used Pythia-70M and Pythia-410M, whereas Anthropic's language model was custom-trained for this study (I think). Some differences in the language model architectures:
But some significant differences remain:
In other words, we perform this calculation:
whereas Anthropic does this calculation:
more precisely, Anthropic writes their calculation in these terms:¯x=x−bdf=ReLU(We¯x+be)^x=Wdf+bd
which is equivalent to the above with b=−Webd+be etc.
There are two main differences between how we trained our sparse autoencoders and how Anthropic trained theirs:
[Epistemic status warning: I'm less sure I've fully capture Anthropic's work in this section.]
Finally, how did we decide the features were interpretable?
Our team also performed these measures:
Anthropic also performed these measures:
Thanks to Logan and Aidan for feedback on an earlier draft of this post.
This is cool! These sparse features should be easily "extractable" by the transformer's key, query, and value weights in a single layer. Therefore, I'm wondering if these weights can somehow make it easier to "discover" the sparse features?