Interpretability techniques often need to throw away some information about a neural network's computations: the entirety of the computational graph might just be too big to understand, which is part of why we need interpretability in the first place. In this post, I want to talk about two different ways of simplifying a network's computational graph:
These correspond to the idea of subsets and quotients in math, as well as many other instances of this duality in other areas. I think lots of interpretability at the moment is 1., and I'd be excited to see more of 2. as well, especially because I think there are synergies between the two.
The entire post is rather hand-wavy; I'm hoping to point at an intuition rather than formalize anything (that's planned for future posts). Note that a distinction like the one I'm making isn't new (e.g. it's intuitively clear that circuits-style research is quite different from neural network clusterability). But I haven't seen it described this explicitly before, and I think it's a useful framing to keep in mind, especially when thinking about how different interpretability techniques might combine to yield an overall understanding.
ETA: An important clarification is that for both 1. and 2., I'm only discussing interpretability techniques that try to understand the internal structure of a network. In particular, 2. talks about approximate descriptions of the algorithm the network is actually using, not just approximate descriptions of the function that's being implemented. This excludes large parts of interpretability outside AI existential safety (e.g. any method that treats the network as a black box and just fits a simpler function to the network).
In math, if you have a set X, there are two "dual" ways to turn this into a smaller set:
(The quotient X/∼ is the set of all equivalence classes under ∼.)
I think this is also a good framing to distinguish interpretability techniques, but before elaborating on that, I want to build intuitions for subsets and quotients in contexts other than interpretability.
The subset/quotient framework can be applied to mechanistic interpretability as follows: fully explaining part of a network is analogous to subsets, abstracting the entire network and describing how it works at a high level is analogous to quotients. Both of these are ways of simplifying a network that would otherwise be unwieldy to work with, but again, they simplify in quite different ways.
These subsets/quotients of the mechanism/computation of the network seem to somewhat correspond to subsets/quotients of the behavior of the network:
To make this a bit more formal: if we have a network that implements some function f:X→Y, then simplifying that network using interpretability tools might give us two different types of simpler functions:
Interpreting part of the network seems related to the first of these, while abstracting the network to a high-level description seems related to the second one. For now, this is mostly a vague intuition, rather than a formal claim (and there are probably exceptions, for example looking at some random subset of neurons might just give us no predictive power at all).
I'll go through some examples of interpretability research and describe how I think they fit into the subset/quotient framework:
I've already mentioned two examples of how both types of techniques can work in tandem:
Some more hypothetical examples:
I'm pretty convinced that combining these approaches is more fruitful than either one on its own, and my guess is that this isn't a particularly controversial take. At the same time, my sense is that most interpretability research at the moment is closer to the "subset" camp, except for frameworks like transformer circuits that are about very fine-grained quotients (and thus mainly tools to enable better subset-based research). The only work I'm aware of that I would consider clear examples of quotient research at a high level of abstraction are Daniel Filan's Clusterability in neural networks line of research and some work on modularity by John Wentworth's SERI MATS cohort.
Some guesses as to what's going on:
I'd be very curious to hear to hear your thoughts (especially from people working on interpretability: why did you pick the specific approach you're using?)
A way to fit in quotient explanations would be to make the larger graph itself a quotient of the neural network, i.e. have its nodes perform complex computations. But causal scrubbing doesn't really discuss what makes such a quotient explanation a good one (except for extensional equality).
Interesting analogy. I think of one of your examples the opposite, way, though!
When you examine a small subset of a network, that's more like a quotient of the set of inputs - it's some simple filter that can get applied to lots of stimuli. And when you make a broad, high-level model of a network, that's like a subset on inputs - the subset is the domain of validity of your high-level model, because of course such models only work within a domain of validity.