Subsets and quotients in interpretability

1Charlie Steiner

New Comment

Interesting analogy. I think of one of your examples the opposite, way, though!

When you examine a small subset of a network, that's more like a quotient of the set of inputs - it's some simple filter that can get applied to lots of stimuli. And when you make a broad, high-level model of a network, that's like a subset on inputs - the subset is the domain of validity of your high-level model, because of *course* such models only work within a domain of validity.

## Summary

Interpretability techniques often need to throw away some information about a neural network's computations: the entirety of the computational graph might just be too big to understand, which is part of why we need interpretability in the first place. In this post, I want to talk about two different ways of simplifying a network's computational graph:

These correspond to the idea of subsets and quotients in math, as well as many other instances of this duality in other areas. I think lots of interpretability at the moment is 1., and I'd be excited to see more of 2. as well, especially because I think there are synergies between the two.

The entire post is rather hand-wavy; I'm hoping to point at an intuition rather than formalize anything (that's planned for future posts). Note that a distinction like the one I'm making isn't new (e.g. it's intuitively clear that circuits-style research is quite different from neural network clusterability). But I haven't seen it described this explicitly before, and I think it's a useful framing to keep in mind, especially when thinking about how different interpretability techniques might combine to yield an overall understanding.

ETA: An important clarification is that for both 1. and 2., I'm only discussing interpretability techniques that try to understand the

internal structureof a network. In particular, 2. talks about approximate descriptions of the algorithm the network isactuallyusing, not just approximate descriptions of the function that's being implemented. This excludes large parts of interpretability outside AI existential safety (e.g. any method that treats the network as a black box and just fits a simpler function to the network).## Subsets vs quotients

In math, if you have a set X, there are two "dual" ways to turn this into a smaller set:

subsetY⊂X.quotientof X/∼ by some equivalence relation ∼.(The quotient X/∼ is the set of all equivalence classes under ∼.)

I think this is also a good framing to distinguish interpretability techniques, but before elaborating on that, I want to build intuitions for subsets and quotients in contexts other than interpretability.

intoX induce subsets of X (namely their image f(Y)). For this subset, it doesn't matter how many elements in Y where mapped to a given element in X, so we can assume f is injective without loss of generality. Thus, subsets are related toinjective maps. Dually, maps g:X→Zout ofX induce quotients of X: we can define two elements in X to be equivalent if they map to the same element in Z. The quotient X/∼ is then the set of all preimages g−1({z}) for z∈g(X). Again, we can assume that g is surjective if we only care about the quotient itself, so quotients correspond tosurjective maps.whichinformation they throw away and which books become indistinguishable. Knowing only the first chapter leaves the rest of the book entirely unspecified, but that one chapter is nailed down exactly. Knowing a summary restricts choices for the entire book somewhat, but leaves local freedom about word choice etc. everywhere.## In the context of interpretability

The subset/quotient framework can be applied to mechanistic interpretability as follows: fully explaining part of a network is analogous to subsets, abstracting the entire network and describing how it works at a high level is analogous to quotients. Both of these are ways of simplifying a network that would otherwise be unwieldy to work with, but again, they simplify in quite different ways.

These subsets/quotients of the

mechanism/computationof the network seem to somewhat correspond to subsets/quotients of thebehaviorof the network:exactlyon any of them—we can only predict certain properties the outputs are going to have. So this leads to a quotient on outputs.To make this a bit more formal: if we have a network that implements some function f:X→Y, then simplifying that network using interpretability tools might give us two different types of simpler functions:

Interpreting part of the network seems related to the first of these, while abstracting the network to a high-level description seems related to the second one. For now, this is mostly a vague intuition, rather than a formal claim (and there are probably exceptions, for example looking at some random subset of neurons might just give us no predictive power at all).

## Existing interpretability work

I'll go through some examples of interpretability research and describe how I think they fit into the subset/quotient framework:

completepicture of how AlphaZero thinks, not even at such a high level of abstraction (e.g. it's unclear how these concepts are actually being used, we can only make reasonable guesses as to what a full explanation at this level would look like).^{[1]}## Combining subset and quotient approaches

I've already mentioned two examples of how both types of techniques can work in tandem:

Some more hypothetical examples:

howthe implementation works than if I'm going in blind.parallelizeunderstanding the network. Without the initial step of finding submodules, it might be very difficult to split up the work of understanding the network between lots of researchers.I'm pretty convinced that combining these approaches is more fruitful than either one on its own, and my guess is that this isn't a particularly controversial take. At the same time, my sense is that most interpretability research at the moment is closer to the "subset" camp, except for frameworks like transformer circuits that are about very

fine-grainedquotients (and thus mainly tools to enable better subset-based research). The only work I'm aware of that I would consider clear examples of quotient research at a high level of abstraction are Daniel Filan's Clusterability in neural networks line of research and some work on modularity by John Wentworth's SERI MATS cohort.Some guesses as to what's going on:

I'd be very curious to hear to hear your thoughts (especially from people working on interpretability: why did you pick the specific approach you're using?)

^{^}A way to fit in quotient explanations would be to make the larger graph itself a quotient of the neural network, i.e. have its nodes perform complex computations. But causal scrubbing doesn't really discuss what makes such a quotient explanation a good one (except for extensional equality).