Lee Sharkey

Research engineer at Conjecture (London). 

My main research interests are mechanistic interpretability and inner alignment. 

Wiki Contributions


That's correct. 'Correlated features' could ambiguously mean "Feature x tends to activate when feature y activates" OR "When we generate feature direction x, its distribution is correlated with feature y's". I don't know if both happen in LMs. The former almost certainly does. The second doesn't really make sense in the context of LMs since features are learned, not sampled from a distribution.

There should be a neat theoretical reason for the clean power law where L1 loss becomes too big. But it doesn't make intuitive sense to me - it seems like if you just add some useless entries in the dictionary, the effect of losing one of the dimensions you do use on reconstruction loss won't change, so why should the point where L1 loss becomes too big change? So unless you have a bug (or some weird design choice that divides loss by number of dimensions), those extra dimensions would have to be changing something.

The L1 loss on the activations does indeed take the mean activation value. I think it's probably a more practical choice than simply taking the sum because it creates independence between hyperparameters: We wouldn't want the size of the sparsity loss to change wildly relative to the reconstruction loss when we change the dictionary size. In the methods section I forgot to include the averaging terms. I've updated the text in the article. Good spot, thanks!

I'd definitely be interested in you including this as a variable in the toy data, and seeing how it affects the hyperparameter search heuristics.

Yeah I think this is probably worth checking too. We probably wouldn't need to have too many different values to get a rough sense of its effect. 

Fig. 9 is cursed. Is there a problem with estimating from just one component of the loss?

Yeah it kind of is... It's probably better to just look at each loss component separately. Very helpful feedback, thanks!

In the toy datasets, the features have the same scale (uniform from zero to one when active multiplied by a unit vector). However in the NN case, there's no particular reason to think the feature scales are normalized very much (though maybe they're normalized a bit due to weight decay and similar). Is there some reason this is ok? 

Hm it's a great point. There's no principled reason for it. Equivalently, there's no principled reasons to expect the coefficients/activations for each feature to be on the same scale either. We should probably look into a 'feature coefficient magnitude decay' to create features that don't all live on the same scale. Thanks!

E.g., learn a low rank autoencoder like in the toy models paper and then learn to extract features from this representation? I don't see a particular reason why you used a hand derived superposition representation (which seems less realistic to me?).

One reason for this is that the polytopic features learned by the model in the Toy models of superposition paper can be thought of as approximately maximally distant points on a hypersphere (to my intuitions at least). When using high-ish numbers of dimensions as in our toy data (256), choosing points randomly on the hypersphere achieves approximately the same thing. By choosing points randomly like in the way we did here, we don't have to train another potentially very large matrix that puts the one-hot features into superposition. The data generation method seemed like it would approximate real features about as well as polytope-like encodings of one-hot features (which are unrealistic too), so the small benefits didn't seem like were worth the moderate computational costs. But I could be convinced otherwise on this if I've missed some important benefits.

Beyond this, I imagine it would be nicer if you trained a model do computation in superposition and then tried to decode the representations the model uses - you should still be able to know what the 'real' features are (I think).

Nice idea! This could potentially be a nice middle ground between toy data experiments and language model experiments. We'll look into this, thanks again!


This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn't attempt to interpret the gradient descent process. I'd be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.