A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lucius Bushnaq; Dan Braun; StefanHex; Nicholas Goldowsky-Dill

Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!

[-]leogao1y60

Some takes on some of these research questions:

Looking for opposing feature directions in SAEs

I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.

SAE/Transcoder activation shuffling

I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.

How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?

It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.

[-]Logan Riggs1y10

Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.

@Lucius Bushnaq , why would MLPs compute linear transformations?

Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?

[-]Jason Gross1y10

[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?

This is very interesting! What prior does log(1+|a|) correspond to? And what about using instead of $\sum_{i} log (1 + | a_{i} |)$ ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?

[-]Jason Gross1y10

[Nix] Toy model of feature splitting
There are at least two explanations for feature splitting I find plausible:
Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.

These do not sound like different explanations to me. In particular, the distinction between "mostly-continuous but approximated as discrete" and "discrete but very similar" seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won't change the behavior of the network meaningfully).

As far as toy models go, I'm pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width , I expect each feature to split into roughly ${d_vocab}^{2} n_ctx$ features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I'm quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.

Edit: I expect this toy model will also permit exploring:

[Lee] Is there structure in feature splitting?
Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
'topology in a math context'
'topology in a physics context'
'high dimensions in a math context'
'high dimensions in a physics context'
Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn't the original SAEs find these directions?

I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.

[-]Mateusz Bagiński1y10

recently^[1].

empty footnote

[-]Lee Sharkey1y10

Thanks! Fixed now

^{^}

Papers from our first project here and here and from our second project here.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

52

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

52

Foundational work on sparse dictionary learning for interpretability

Transcoder-related project ideas

Other

Applied interpretability

Intrinsic interpretability

Understanding features (not SDL)

Theoretical foundations for interpretability

Singular-learning-theory-related

Other

Meta-research and philosophy

Engineering