Interpretability techniques often need to throw away some information about a neural network's computations: the entirety of the computational graph might just be too big to understand, which is part of why we need interpretability in the first place. In this post, I want to talk about two different ways of simplifying a network's computational graph:
- Fully explaining parts of the computations the network performs (e.g. identifying a subcircuit that fully explains a specific behavior we observed)
- Approximately describing how the entire network works (e.g. finding meaningful modules in the network, whose internals we still don't understand, but that interact in simple ways)
These correspond to the idea of subsets and quotients in math, as well as many other instances of this duality in other areas. I think lots of interpretability at the moment is 1., and I'd be excited to see more of 2. as well, especially because I think there are synergies between the two.
The entire post is rather hand-wavy; I'm hoping to point at an intuition rather than formalize anything (that's planned for future posts). Note that a distinction like the one I'm making isn't new (e.g. it's intuitively clear that circuits-style research is quite different from neural network clusterability). But I haven't seen it described this explicitly before, and I think it's a useful framing to keep in mind, especially when thinking about how different interpretability techniques might combine to yield an overall understanding.
ETA: An important clarification is that for both 1. and 2., I'm only discussing interpretability techniques that try to understand the internal structure of a network. In particular, 2. talks about approximate descriptions of the algorithm the network is actually using, not just approximate descriptions of the function that's being implemented. This excludes large parts of interpretability outside AI existential safety (e.g. any method that treats the network as a black box and just fits a simpler function to the network).
Subsets vs quotients
In math, if you have a set , there are two "dual" ways to turn this into a smaller set:
- You can take a subset .
- You can take a quotient of by some equivalence relation .
(The quotient is the set of all equivalence classes under .)
I think this is also a good framing to distinguish interpretability techniques, but before elaborating on that, I want to build intuitions for subsets and quotients in contexts other than interpretability.
- Maps into induce subsets of (namely their image ). For this subset, it doesn't matter how many elements in where mapped to a given element in , so we can assume is injective without loss of generality. Thus, subsets are related to injective maps. Dually, maps out of induce quotients of : we can define two elements in to be equivalent if they map to the same element in . The quotient is then the set of all preimages for . Again, we can assume that is surjective if we only care about the quotient itself, so quotients correspond to surjective maps.
- A chapter of a book is a subset of its text. A summary of the book is like a quotient. Note that both throw away information, and that for both, there will be many different possible books we can't distinguish with only the subset/quotient. But they're very different in terms of which information they throw away and which books become indistinguishable. Knowing only the first chapter leaves the rest of the book entirely unspecified, but that one chapter is nailed down exactly. Knowing a summary restricts choices for the entire book somewhat, but leaves local freedom about word choice etc. everywhere.
- If I have a dataset, then samples from that dataset are like subsets, summary statistics are like quotients. Again, both throw away information, but in very different ways.
- If I want to communicate to you what some word means, say "plant" then I can either give examples of plants (subset), or I can describe properties that plants have (quotient).
In the context of interpretability
The subset/quotient framework can be applied to mechanistic interpretability as follows: fully explaining part of a network is analogous to subsets, abstracting the entire network and describing how it works at a high level is analogous to quotients. Both of these are ways of simplifying a network that would otherwise be unwieldy to work with, but again, they simplify in quite different ways.
These subsets/quotients of the mechanism/computation of the network seem to somewhat correspond to subsets/quotients of the behavior of the network:
- If we interpret a subset of the neurons/weights/... of the network in detail, that subset is often chosen to explain the network's behavior on a subset of inputs very well (while we won't get much insight into what happens on other inputs).
- A rough high-level description of the network could plausibly be similarly useful to predict behavior on many different inputs. But it won't let us predict behavior exactly on any of them—we can only predict certain properties the outputs are going to have. So this leads to a quotient on outputs.
To make this a bit more formal: if we have a network that implements some function , then simplifying that network using interpretability tools might give us two different types of simpler functions:
- A restriction of to some subset
- A composition of with a quotient map on , i.e. a function
Interpreting part of the network seems related to the first of these, while abstracting the network to a high-level description seems related to the second one. For now, this is mostly a vague intuition, rather than a formal claim (and there are probably exceptions, for example looking at some random subset of neurons might just give us no predictive power at all).
Existing interpretability work
I'll go through some examples of interpretability research and describe how I think they fit into the subset/quotient framework:
- The work on indirect object identification in GPT-2 small is a typical example of the subset approach: it explains GPT-2's behavior on a very specific subset of inputs, by analyzing a subset of its circuits in a lot of detail.
- Induction heads are similar in that they still focus on precisely understanding a small part of the network. However, they help understand behavior on a somewhat broader range of inputs, and they aren't specific to one model in particular (which is a dimension I've ignored in this post).
- The analysis of early vision in InceptionV1 has some aspects that feel quotient-y (namely grouping neurons by functionality), but it focuses entirely on the subset of early layers and mostly explains what individual neurons do. Overall, I'd put this mostly in the subset camp.
- The general idea that early layers of a CNN tend to detect low-level features like curves, which are then used to compute more complicated features, which are finally turned into an output label, is a clear example of a quotient explanation of how these models work. This is also a good example of how the approaches can interact: studying individual neurons can give strong evidence that this quotient explanation is correct.
- Clusterability in neural networks and other work on modularity are other typical examples of quotient approaches to interpretability.
- Acquisition of chess knowledge in AlphaZero combines elements of a subset and a quotient approach. Figuring out that AlphaZero represents lots of human chess concepts is part of a quotient explanation: it lets us explain at a very high level of abstraction how AlphaZero evaluates positions (presumably by using those concepts, e.g. recognizing that a position where you have an unsafe king is bad). On the other hand, the paper certainly doesn't provide a complete picture of how AlphaZero thinks, not even at such a high level of abstraction (e.g. it's unclear how these concepts are actually being used, we can only make reasonable guesses as to what a full explanation at this level would look like).
- The reverse-engineered algorithm for modular addition seems to me to be an example of a subset-based approach (i.e. my impression is that the algorithm was discovered by looking at various parts of the network and piecing together what was happening). The unusual thing about it is that the "subset" being explained is ~everything the network does. So you could just as well think of the end product as a quotient explanation (at a rather fine-grained level of abstraction). This is an example of how both approaches converge as the subset increases in size and the abstraction level becomes more and more fine-grained.
- The polytope lens itself feels like a quotient technique (reframing the computations a network is doing at a specific level of abstraction, talking about subunits as groups of polytopes with similar spline codes). However, given that it abstracts a network at a very fine-grained level, I'd expect it to be combined with subset approaches in practice. Similar things apply to the mathematical transformer circuits framework.
- Causal scrubbing focuses on testing subset explanations: it assumes that a hypothesis is an embedding of a smaller computational graph into the larger one.
Combining subset and quotient approaches
I've already mentioned two examples of how both types of techniques can work in tandem:
- A subset analysis can be used to test a quotient explanation (e.g. if I conjecture that early CNN layers detect low-level features like curves, that are then combined to compute increasingly high-level concepts like dog ears, I can test that by looking at a bunch of example neurons)
- Good fine-grained quotients can make it easier to explain subsets of a network (e.g. the polytope lens, the mathematical transformers framework, or other abstractions that are easier to work with then thinking about literal multiplications and additions of weights and activations).
Some more hypothetical examples:
- Understanding a network in terms of submodules might point us to interesting subsets to study in detail. For example, a submodule that reasons about human psychology might be more important to study than one that does simple perception tasks.
- A high-level understanding of a network should make it easier to understand low-level details in subsets. E.g. if I suspect that the neurons I'm looking at are part of a submodule that somehow implements a learned tree search, it will be much easier to figure out how the implementation works than if I'm going in blind.
- Conversely, subset-based techniques might be helpful for identifying submodules and their functions. If I figure out what a specific neuron or small group of neurons is doing, that puts restrictions on what the high-level structure of the network can be.
- We can try to first divide a network into submodules and then understand each of them using a circuits-style approach. Combining the abstraction step with the low-level interpretation lets us parallelize understanding the network. Without the initial step of finding submodules, it might be very difficult to split up the work of understanding the network between lots of researchers.
I'm pretty convinced that combining these approaches is more fruitful than either one on its own, and my guess is that this isn't a particularly controversial take. At the same time, my sense is that most interpretability research at the moment is closer to the "subset" camp, except for frameworks like transformer circuits that are about very fine-grained quotients (and thus mainly tools to enable better subset-based research). The only work I'm aware of that I would consider clear examples of quotient research at a high level of abstraction are Daniel Filan's Clusterability in neural networks line of research and some work on modularity by John Wentworth's SERI MATS cohort.
Some guesses as to what's going on:
- I missed a bunch of work in the quotient approach.
- People think the subset approach is more promising/we don't need more research on submodules/...
- Subset-style research is currently quite tractable using empirical approaches and easier to scale, whereas quotient-style research needs more insights that are hard to find.
- Maybe the framework I'm using here is just confused? But even then, I'd still think that the "finding high-level structure in neural networks" is clearly a sensible distinct category, and neglected compared to circuits-style work.
I'd be very curious to hear to hear your thoughts (especially from people working on interpretability: why did you pick the specific approach you're using?)
A way to fit in quotient explanations would be to make the larger graph itself a quotient of the neural network, i.e. have its nodes perform complex computations. But causal scrubbing doesn't really discuss what makes such a quotient explanation a good one (except for extensional equality).