A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

8Neel Nanda

6leogao

1Jason Gross

1Jason Gross

1Mateusz Bagiński

1Lee Sharkey

New Comment

Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!

Some takes on some of these research questions:

Looking for opposing feature directions in SAEs

I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.

SAE/Transcoder activation shuffling

I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.

How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?

It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.

[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations

- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty
corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?

This is very interesting! What prior does log(1+|a|) correspond to? And what about using instead of ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?

[Nix] Toy model of feature splitting

- There are at least two explanations for feature splitting I find plausible:

- Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
- There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.

These do not sound like different explanations to me. In particular, the distinction between "mostly-continuous but approximated as discrete" and "discrete but very similar" seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won't change the behavior of the network meaningfully).

As far as toy models go, I'm pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width , I expect each feature to split into roughly features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I'm quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.

Edit: I expect this toy model will also permit exploring:

[Lee] Is there structure in feature splitting?

- Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)

- 'topology in a math context'
- 'topology in a physics context'
- 'high dimensions in a math context'
- 'high dimensions in a physics context'
- Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn't the original SAEs find these directions?

I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.

Why we made this list:^{[1]}. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.We therefore thought it would be helpful to share our list of project ideas!

Comments and caveats:We hope some people find this list helpful!

We would love to see people working on these!If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.## Foundational work on sparse dictionary learning for interpretability

## Transcoder-related project ideas

[2406.11944] Transcoders Find Interpretable LLM Feature Circuits)[Nix] Training and releasing high quality transcoders.[Nix] Good tooling for using transcodersDunefsky et al)[Nix] Further circuit analysis using transcoders.nix@apolloresearch.ai][Nix, Lee] Cross layer superposition[Lucius] Improving transcoder architectures## Other

[Nix] Idea for improved-logit lens style interpretation of SAE featuresJoseph Bloom’s GPT2 SAEs)Understanding SAE Features with the Logit Lens) but in the pre-unembed basis instead of the token basis.Interpreting the Second-Order Effects of Neurons in CLIP[Nix] Toy model of feature splitting[Dan] Looking for opposing feature directions in SAEs[Dan] SAE/Transcoder activation shufflinge2eSAEs, which do not shuffle activations during training as they need to pass through the entire context-length activations to subsequent layers. Can you get away with just having a larger effective batch size and higher learning rate? Note that I think this is equally (if not more) important to analyze for transcoders.[Dan] SAE/Transcoder initializationhereandhere) help for SAEs and transcoders?[Dan] Make public benchmarks for SAEs and transcoders.Neuronpediawhich I deem to be a great place to host such a service.[Lee] Mixture of Expert SAEshere. This is great! The more efficient we can make SDL the better. But this only speeds up inference of the decoder. I think MOEs may be a way to speed up inference of the encoder.MOEifiedpost hoc. If they can be, then it's evidence in the direction that MOEs might be reasonable to use during training from scratch.[Lee] Identify canonical features that emerge in language models[Lee] Studying generalization of SAEs and transcoders.[Lee] How does layer norm affect SAE features before and after?[Lee] Connecting SAE/transcoder features to polytopespolytope lenswas that it used clustering methods in order to group polytopes together. This means the components of the explanations they provided were not ‘composable’. We want to be able to break down polytopes into components that are composable.[Stefan] Verify (SAE) features based on the model weights; show that features are a model-property and not (only) a dataset property.here(e.g. “given two sets of directions that reconstruct an activation, can you tell which one are the features vs a made-up set of directions?”), one possible methodology describedhere (in-progress LASR project).[Stefan] Relationship between Feature Splitting, Feature Completeness, and atomic vs. composite featureshere.[Lee] Is there structure in feature splitting?[Lucius] Understanding the geometry of SAE featurespaper.[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activationscorresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?[Lucius] Preprocessing activations with theinteraction basisprior to SAE trainingend-to-end. But another way to solve it might be to preprocess the network activations before applying the SAEs to them. The activations could be rotated and rescaled such that the variance of the hidden activations along any axis is proportional to its importance for computing the final network outputs. The interaction basis is a linear coordinate transformation for the hidden activations of neural networks that attempts to achieve just that. So transforming activations into the interaction basis before applying SAEs to them might yield a Pareto improvement in SAE quality.[Lucius] Using attribution sparsity penalties to improve end-to-end SAEsend-to-end dictionary learning, asparsity penalty based on attributionsmight be more appropriate than a sparsity penalty based on dictionary activations: In end-to-end SAEs, the reconstruction loss cares about the final network output, but the sparsity term still cares about the activations in the hidden layer, like a conventional SAE. This is perhaps something of a mismatch. For example, if a feature is often present in the residual stream, but comparatively rarely used in the computation, the end-to-end SAE will be disinclined to represent it, because it only decreases the reconstruction loss a little, but increases the sparsity loss by a lot. More generally, how large a feature activation is just won't be that great of a correlate for how important it is for reconstructing the output. So if we care about how many features we need per data point to get good output reconstruction, SAEs trained with an attribution sparsity penalty might beat SAEs trained with an activation sparsity penalty.an attribution sparsity penaltyuses attributions of the LLM loss. I suspect this is inappropriate, since the gradient of the LLM loss is zero at optima, meaning feature attributions will be scaled down the better the LLM does on a specific input. Something like an MSE average over attributions to all of the network’s output logits might be more appropriate. This is expensive, but an approximation of the average using stochastic sources might suffice. See e.g. Appendix Cherefor an introduction to stochastic source methods. In our experiments on the Toy Model of Superposition, a single stochastic source proved to be sufficient, making this potentially no more computationally intensive than the Anthropic proposal.## Applied interpretability

[Lee] Apply SAEs/transcoders to a small conv net (e.g. Alex Net) and study it in depth.[Lee] Figure out how to build interpretability interfaces for video/other modalities.Ellena Reid’s projectwas that it developed a way to ‘visualize’ what neurons were activating for in audio samples. Can we improve on this method? Can we do the same for video models? What about other modalities, such as, e.g.smell, or, I don’t know, protein structure? Is there a modality-general approach for this?[Lee] Apply SAEs and transcoders to WhisperV2 (i.e. continueEllena Reid’s work)[Lee] Identifying whether or not, in a very small backdoored model, we can detect the backdoor using e.g. e2eSAEs[Lee] Interpreting Mamba/SSMs using sparse dictionary learning.[Lee] Characterizing the geometry of low-level vision SAE features[Lee] Can we understand the first sequence index of a small transformer?[Lucius] Attempt to understand a toy LMcompletely[Stefan] Understand a small model (e.g. TinyStories-2L or a smallTinyModelvariant) from start to end, from first to last layer.## Intrinsic interpretability

[Lee] Can we train a small bilinear transformer on either a toy or real dataset, perform sparse dictionary learning on its activations, and understand the role of each sparse dictionary feature in terms of the closed form solution?Sharkey 2023). Can we train a small bilinear transformer on either a toy or real dataset, perform sparse dictionary learning on its activations, and understand the role of each sparse dictionary feature in terms of the closed form solution? This may help in identifying fundamental structures within transformers in a similar way that induction heads were discovered.[Lee] Interpretable inference: Can we convert already-trained models into forms that are much easier to completely interpret at little performance cost?[Lee] Develop A Mathematical Framework for Linear Attention Transformer Circuits## Understanding features (not SDL)

[Lucius] Recovering ‘features’ through direct optimisation for auto-interpretability scores## Theoretical foundations for interpretability

Singular-learning-theory-related[Lucius] Understanding SLT at finite data/precisionhere. Is this a good approximation?[Lucius] Bounding the local learning coefficient (LLC) in real networksexploiting degeneracy in the loss landscapetodecomposeLLMs into more interpretable parts.[Lucius] Understanding the relationship between thelocal learning coefficient(LLC) and thebehavioral LLChere. This is a more restrictive definition since different network outputs can yield the same loss. The LLC of the behavioral loss is thus an upper bound for the LLC of the training loss. The LLC of the behavioral loss is well-defined everywhere in the loss landscape, making it potentially more useful for characterizing the complexity of neural networks at every point in training. However, the behavioral LLC is currently less well understood than the LLC. For example, it is less clearly related to network generalization ability (aside from upper bounding the LLC).## Other

[Lucius] Extending the current framework forcomputation in superpositionfrom boolean variables to floating point numbers or real numbers[Lucius] Bounding the sparsity of LLM representations[Lucius] Relating superposition to the loss landscape## Meta-research and philosophy

[Lee] Write up reviews/short posts on the links between various concepts in comp neuro and mech interp and philosophy of science and mech interp[Lee] What is a feature? What terms should we really be using here? What assumptions do these concepts make? Where does it lead when we take these assumptions to their natural conclusions?[Lucius] Should we expect some or many of the ‘features’ in current neural networks to benatural latents?## Engineering

[Dan] Create a new, high quality tinystories dataset and model suite (credit to Noa Nabeshima for the idea).tinystories datasetis very formulaic, small, and has unusual unicode characters in it. Addressing these issues, and training a small model suite on this new dataset, would be very valuable for:cleaning upthe existing tinystories dataset and training a4-layer modelwithout layernorm on the clean dataset (it also comes with SAEs and transcoders trained on it). Reach out to Noa (noanabeshima@gmail.com) and/or me (dan@apolloresearch.ai) if interested in taking this on. Subsidies for compute credits for dataset generation and model training may be available.^{^}Papers from our first project here and here and from our second project here.