Written at Apollo Research

Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models.

Epistemic status: Decently confident that the ideas here are directionally correct. I’ve been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including both SAE stans and SAE skeptics) have thought very similar things before and some of them have written about it in various places too. Some of my views, especially the merit of certain research approaches to tackle the problems I highlight, have been presented here without my best attempt to argue for them.

What would it mean if we could fully understand an activation space through the lens of superposition?

If you fully understand something, you can explain everything about it that matters to someone else in terms of concepts you (and hopefully they) understand. So we can think about how well I understand an activation space by how well I can communicate to you what the activation space is doing, and we can test if my explanation is good by seeing if you can construct a functionally equivalent activation space (which need not be completely identical of course) solely from the information I have given you. In the case of SAEs, here's what I might say:

The activation space contains this list of 100 million features, which I can describe concisely in words because they are monosemantic.
The features are embedded as vectors, and the activation vector on any input is a linear combination of the feature vectors that are related to the input.
As for where in the activation space each feature vector is placed, oh that doesn't really matter and any nearly orthogonal overcomplete basis will do. Or maybe if I'm being more sophisticated, I can specify the correlations between features and that’s enough to pin down all the structure that matters — all the other details of the overcomplete basis are random.

Every part of this explanation is in terms of things I understand precisely. My features are described in natural language, and I know what a random overcomplete basis is (although I’m on the fence about whether a large correlation matrix counts as something that I understand).

The placement of each feature vector in the activation space matters

Why might this description be insufficient? First, there is the pesky problem of SAE reconstruction errors, which are parts of activation vectors that are missed when we give this description. Second, not all features seem monosemantic, and it is hard to find semantic descriptions of even the most monosemantic features that have both high sensitivity and specificity, let alone descriptions which allow us to predict the quantitative values that activating features take on a particular input.

But let’s suppose that these issues have been solved: SAE improvements lead to perfect reconstruction and extremely monosemantic features, and new autointerp techniques lead to highly sensitive and specific feature descriptions. I claim that even then, our protocol will fail to be a valid explanation for the activation space, because the placement of each feature vector in the activation space matters. The location of feature vectors contains rich structure that demands explanation^[1]! If your knockoff activation space contains the same features activating on the same dataset examples, but you haven’t placed each feature in the right place, your space contains a lot less relevant information than mine, and it’s less usable by a model.

Wait, you say, this is fine! I can just communicate the weights of my SAE’s decoder matrix to you, and then you will know where all the feature vectors go, and then you get my richly structured activation space. But this is breaking the rules! If I understand something, I must be able to explain it to you in concepts I understand, and I do not understand what information is contained in the giant inscrutable decoder matrix. You should be no more content to receive a decoder matrix from me as an ‘explanation’ than you should be content to receive a list of weight matrices, which would also suffice to allow you to reproduce the activation space.

If you are unconvinced, here is some evidence that the location of feature vectors matters:

The days of the week/months of the year lie on a circle, in order. Let’s be clear about what the interesting finding is from Engels et al.: it’s not that all the days of the week have high cosine sim with each other, or even really that they live in a subspace, but that they are in order! In a very real sense, this scenario is outside the scope of the superposition hypothesis, which posits that the important structure in an activation space is the sparse coding of vectors which have no important residual structure beyond correlations. There are good reasons to think that these two circles found in the paper are the rule not the exception:

This structure lies entirely in a 2 dimensional subspace, which makes it super easy to visualize. We should suspect a streetlight effect.
We are familiar with modular addition being performed in a circle from Nanda et al., so we were primed to spot this kind of thing — more evidence of street lighting.

More generally, now that we know about the possibility of important information being encoded in the location of feature vectors, we can ask how much we have previously been victims of streetlight effects, and I think the answer is a lot. Once you consider that this kind of structure is present in some cases, it seems much more natural to assume that the structure is ubiquitous than that the right way to think of activation spaces is with superposition plus an occasional circle.

More generally, I think that evidence of the ubiquity of important structure in feature vectors comes from UMAPs and feature splitting:

There is lots of evidence UMAPs of feature vectors are interesting to look at. If all the important structure of an activation space is contained in the feature descriptions, then the UMAP should be a random mess, because once everything has been explained all that can be left behind is random noise. Instead, UMAPs of features contain incredibly rich structure that can be looked into for hours on end. Similarly, feature splitting is evidence that the location of SAE features is heavily related to their semantic meaning, and it is some evidence of some complicated hierarchy of SAE features.

Both of these are evidence that there is more to be explained than just which features are present and when they fire! I don’t think many people are currently thinking about how to explain the interesting structure in a systematic way, and I don’t think ad hoc explanations like ‘this cluster of features is all about the bay area in some way’ are actually useful for understanding things in general. I don’t think we have yet elicited the right concepts for talking about it. I also don’t think this structure will be solved on the current default trajectory: I don’t think the default path for sparse dictionary learning involves a next generation of SAEs which have boring structureless UMAPs or don’t exhibit feature splitting.

What types of theories could fill this gap in our understanding?

I am currently in the headspace of trying to map out the space of possible theories that all explain the current success of SAEs, but which could also explain the gaps that SAEs leave. I see three broad classes of options.

Option 1: A hodge-podge of explanations

Maybe there is lots of remaining structure to explain after superposition has been used to explain the space, but there are no broadly applicable theories or models which allow us to describe this theory well. There could be huge contingencies and idiosyncrasies which mean that the best we can do is describe this circle, that tetrahedron, this region of high cosine sim, each with a bespoke explanation.

I think it’s pretty plausible that this is the way things go, but it is much too early to conclude that this pessimistic option is correct. Even in this case, it would be valuable to develop a taxonomy of all the different kinds of structure that are commonly found.

Option 2: Supplementing superposition

In this scenario, we build a general theory of how feature vectors are arranged, which is as broadly applicable as the idea of superposition and the (approximately) linear representation hypothesis. Ideally, this theory would elegantly explain UMAP structure, feature splitting, and present and future case studies of interesting nonlinear feature arrangements. This option involves adding a theory ‘on top of’ superposition: our best interpretations of activation spaces would involve describing the feature dictionary, and then describing the location of features using the newly developed concepts.

It’s hard to flesh out examples of what these new concepts might look like before the conceptual work has been done. The closest thing to an existing concept that supplements superposition is the idea introduced by Anthropic that features can be placed on a hierarchical tree: the tree can be built from looking at dictionaries of different sizes, relating parent and children nodes via feature splitting. Ideally, we’d also be able to relate this tree to something like computational paths through the network: maybe the tree distance between two features measures the amount of difference between the subsequent computation done on each feature. This particular extra structure isn’t sufficient: I expect there to be much richer structure in feature vector locations than just a distance between features. For example, it seems hard to understand how a tree-like structure could explain circular features.

Option 3: Superseding superposition

Alternatively, we may develop a more sophisticated theory which explains all the results of SAEs and more besides, but which supplants superposition as an explanation of activation space structure instead of building on top of it^[2]. I have quite substantial probability on the ‘next’ theory looking more like this than something that is built on top of superposition.

Here’s a thought experiment that motivates why SAE results might be compatible with the computational latents not being superpositional directions:

Suppose that I have a ‘semantic vector space’. Every sentence on the internet corresponds to a point in this space, with the property that the heuristic notion of ‘semantic distance’ (the qualitative difference in meaning between two sentences) between two sentences always corresponds to Euclidean distance in this space. We should expect that points in this space form clusters, because sentences are semantically clustered: there are lots of very similar sentences. In fact, points may lie in several clusters: a sentence about blue boats will be close to sentences about other blue things, and also to sentences about boats. However, these clusters are not themselves necessarily the structure of the semantic space! They are a downstream consequence of the space being structured semantically. Any good semantic space would have these clusters as they are a property of the world and of the dataset.

Insofar as SAE training is like clustering, this argument applies to SAEs as well: perhaps any good theory which explains the rich structure of the activation space would predict that SAEs perform well as a downstream consequence^[3]. Further, it seems extremely possible to me that there might be some sensible non-superpositional way to describe the structure of the activation space which has SAE performance drop out in the same step that describes the relation between SAE decoder directions^[4].

How could we discover the new theory?

Discovering the new theory is hard! I can see a few very high level research approaches that seem sensible.

Approach 1: Investigating feature structure in big SAEs

The most widely taken approach for improving our understanding of how SAEs learn features is to train more SAEs and to investigate their properties in an ad hoc way guided by scientific intuition. The main advantage of this approach is that if we want to understand the limitations of SAEs on big language models, then any extra data we collect by studying SAEs on big language models is unambiguously relevant. There may be some results that are easy to find which make discovering the new theory much easier, in the same way that I have a much easier time describing my issues with SAE feature geometry now that people have discovered that days of the week lie on a circle. It may turn out that if we collect lots more examples of interesting observations about SAE feature geometry, then new theories which explain these observations will become obvious.

Approach 2: Directly look for interesting structure in LLM representations

Another approach to understanding activation space structure is to carefully make a case for some part of the activation space having a particular ground truth structure, for example polyhedra, hypersurfaces and so on. It’s possible that if we carefully identify many more examples of internal structure, new theories which unify these observations will emerge.

More specifically, I think it would be valuable to take a set of interesting examples of understood internal structure, and to ask what happens when we train SAEs to try to capture this structure. In some cases, it may be possible for the structure to be thought of as a set of feature directions although they may not be sparse or particularly independent of each other — does the SAE find these feature directions? In other cases, it may seem to us very unnatural to think of the structure we have uncovered in terms of a set of directions (sparse or otherwise) — what does the SAE do in this case? If we have a range of examples of representational structures paired with SAE features that try to learn these structures, maybe this way we can learn how to interpret the information contained about the activation space that is contained within the SAE decoder directions.

Approach 3: Carefully reverse engineering relevant toy models with ground truth access

There are huge (underrated in my opinion) advantages to doing interpretability research in a domain where we have access to the ground truth, and when it comes to building a new theory, I think the case for working in a ground truth environment is especially strong. In the case of language modelling, it’s hard to resolve even big disagreements about how much feature geometry matters (both the view that feature geometry doesn’t matter at all and the view that they imply SAEs have achieved nearly nothing are not insane to me), but if we know what the correct answer is, we can just ask if SAEs enabled us to find the answer. For example, we could train a neural network to emulate a boolean circuit, and we could try to carefully reverse engineer which boolean circuit has been learned^[5]. Then, we could use SAEs (or indeed other techniques) and try to understand how to translate between our carefully reverse engineered implementations and the results of SAE training.

Toy models also have the advantage that we can much more easily understand them in depth, and we can more straightforwardly hand pick the representational/computational structure we are investigating. We can also iterate much more quickly! It may be easier to carefully understand interesting representational structures à la Approach 2 in toy models for this reason. The obvious, substantial, downside of all toy model research is that we can’t be sure that the insights we take from toy models are relevant to LLMs. This is a very real downside, but I think it can be effectively mitigated by

motivating toy models based on some specific hypothesis informed by thinking about bigger models
studying a broad range of toy models and not anchoring too heavily on specific things learned in one toy model.

Approach 4: Theoretical work to unite experimental results and motivate new experiments

If we are to develop a new theory (or several), there will have to be some conceptual and theoretical breakthroughs. This could look like translating work from other fields as with superposition — the superposition hypothesis was heavily inspired by the field of sparse coding and compressed sensing — or it may look like the development of genuinely new concepts, and maybe even new maths.

Of the four approaches here, I’m least confident that it makes sense for people to take this approach directly — perhaps the conceptual work is best done as part of one of the other approaches. However I include this as its own approach because I think there are some valuable standalone questions that could be directly tackled. For example:

What kinds of structures in activation space are compatible with every activation vector being sparse in a particular overcomplete basis? Can people come up with (perhaps hand constructed) examples of spaces in which SAEs achieve perfect reconstruction, but which are best thought of in some other way?
If we now replace the assumption that SAEs get perfect performance with more realistic assumptions, which structures that would be ruled out by perfect performance SAEs are now possible? How does the space of compatible theories vary with reconstruction error, L0, dictionary size, interp scores?

Acknowledgements

Thanks to Kaarel Hanni, Stefan Heimersheim, Lee Sharkey, Lucius Bushnaq, Joseph Bloom, Clem von Stengel, Hoagy Cunningham, Bilal Chughtai and Andy Arditi for useful discussions.

^{^}
This point is already known to neuroscientists.
^{^}
Here’s some people at Anthropic suggesting that they are also open to this possibility.
^{^}
Activation spaces in language models are not really semantic spaces. Instead, one potentially useful framing for thinking about the structure in an activation space is: the dataset contains an exorbitant amount of interesting structure, much more than is used by current models. The model has learned to use (compute with or predict with) a particular subset of that structure, and a particular interp technique allows us to elicit another, distinct subset of that structure. Ideally, we want our interp technique to elicit (convert into understandable description) the same structure as used by the model. However, it’s easy for an overly powerful interp technique (such as a very nonlinear probe) to discover structure in the data that is not usable by the model, and it is also easy for an interp technique to fail to elicit structure that the model is actually able to use (such as certain aspects of SAE feature geometry). This framing motivates the idea that if we want to do interpretability by understanding activation spaces and identifying features, we have to regularise our search for structure with an understanding of what computation can be performed by the model on the space. This is how the toy model of superposition and the SAE arrive at the idea that a feature is a projection followed by a ReLU, and reasoning along the same lines is why my post on computation in superposition suggests that features are projections without a ReLU.
^{^}
This seems particularly plausible if we are talking about current SAE performance, with imperfect reconstruction and inadequate feature interpretations. The better these metrics become (and L0 and dictionary size metrics), the more they will constrain the space of theories by the requirement that these theories explain SAE performance, making it more likely that the best interpretation of an activation space really is in terms of the SAE features.
^{^}
I am a big fan of this research project in particular and expect to work on boolean circuit toy models in future (for more reasons than discussed here). If you are interested in understanding how neural networks learn and represent interesting (perhaps sparse) classes of boolean circuits, or have investigated this before, I might be keen to chat.

AI ALIGNMENT FORUM
AF