This is the fifth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer
Motivating papers: Toy Models of Superposition, Softmax Linear Units
If you're familiar with polysemanticity and superposition, skip to Motivation or Problems.
Neural networks are very high dimensional objects, in both their parameters and their activations. One of the key challenges in Mechanistic Interpretability is to somehow resolve the curse of dimensionality, and to break them down into lower dimensional objects that can be understood (semi-)independently.
Our current best understanding of models is that, internally, they compute features: specific properties of the input, like "this token is a verb" or "this is a number that describes a group of people" or "this part of the image represents a car wheel". That early in the model there are simpler features, are later used to compute more complex features by being connected up in a circuit (example shown above (source)). Further, our guess is that features correspond to directions in activation space. That is, for any feature that the model represents, there is some vector corresponding to it. And if we dot product the model's activations with that vector, we get out a number representing whether that feature is present.(these are known as decomposable, linear representations)
This is an extremely useful thing to be true about a model! An even more helpful thing to be true would be if neurons correspond to features (ie the output of an activation function like ReLU). Naively, this is natural for the model to do, because a non-linearity like ReLU acts element-wise - each neuron's activation is computed independently (this is an example of a privileged basis). Concretely, if a neuron can represent feature A or feature B, then that neuron will fire differently for feature A and NOT feature B, vs feature A and feature B, meaning that the presence of B interferes with the ability to compute A. But if each feature is its own neuron we're fine!
If features correspond to neurons, we're playing interpretability on easy mode - we can focus on just figuring out which feature corresponds to each neuron. In theory we could even show that a feature is not present by verifying that it's not present in each neuron! However, reality is not as nice as this convenient story. A countervailing force is the phenomena of superposition. Superposition is when a network represents more features than it has dimensions, and squashes them all into a lower dimensional space. You can think of superposition as the model simulating a larger model.
Anthropic's Toy Models of Superposition paper is a great exploration of this. They build a toy model that learns to use superposition (notably different from a toy language model!). The model starts with a bunch of independently varying features, needs to compress these to a low dimensional space, and then is trained to recover each feature from the compressed mess. And it turns out that it does learn to use superposition!
Specifically, it makes sense to use superposition for sufficiently rare (sparse) features, if we give it non-linearities to clean up interference. Further, the use of superposition can be modelled as a trade-off between the costs of interference, and the benefits of representing more features. And digging further into their toy models, they find all kinds of fascinating motifs regarding exactly how superposition occurs, notably that the features are sometimes compressed in geometric configurations, eg 5 features being compressed into two dimensions as the vertices of a pentagon, as shown below.
Zooming out, what does this mean for what research actually needs to be done? To me, when I imagine what real progress here might look like, I picture the following:
The direction I'm most excited about is a combination of 1 and 2, to form a rich feedback loop between toy models and real models - toy models generate hypotheses to test, and exploring real models generates confusions to study in toy models.
This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)
Notation: ReLU output model is the main model in the Toy Models of Superposition paper which compresses features in a linear bottleneck, absolute value model is the model studied with a ReLU hidden layer and output layer, and which uses neuron superposition.
x -> x^2
A ... B -> A