This is an extremely opinionated list of my favourite mechanistic
interpretability papers, annotated with my key takeaways and what I like
about each paper, which bits to deeply engage with vs skim (and what to
focus on when skimming) vs which bits I don’t care about and recommend
skipping, along with fun digressions and various hot takes.
This is aimed at people trying to get into the field of mechanistic
interpretability (especially Large Language Model (LLM)
interpretability). I’m writing it because I’ve benefited a lot by
hearing the unfiltered and honest opinions from other researchers,
especially when first learning about something, and I think it’s
valuable to make this kind of thing public! On the flipside though, this
post is explicitly about my personal opinions - I think some of these
takes are controversial and other people in the field would disagree.
The four top level sections are priority ordered, but papers within each
section are ordered arbitrarily - follow your curiosity
Sets out the circuits research agenda, and is a whirlwind
overview of progress in image circuits
This is reasonably short and conceptual (rather than technical)
and in my opinion very important, so I recommend deeply
engaging with all of it, rather than skimming.
The core thing to take away from it is the perspective of
networks having legible(-ish) internal representations of
features, and that these may be connected up into
interpretable circuits. The key is that this is a mindset for
thinking about networks in general, and all the discussion
of image circuits is just grounding in concrete examples.
In my opinion, the circuits agenda is pretty deeply at the core
of what mechanistic interpretability is. It’s built on the
assumption that there is some legible, interpretable structure
inside neural networks, if we can just figure out how to
reverse engineer it. And the core goal of the field is to find
what circuits we can, build better tools for doing so, and do
the fundamental science of figuring out which of the claims
about circuits are actually true, which ones break, and
whether we can fix them.
Meta: The goal of reading this is to understand what the
fundamental mindset and worldview being defended here is. The
goal is not necessarily to leave feeling convinced that
these claims are true, or that the article adequately
justifies them. That’s what the rest of the papers in here are
A useful thing to reflect on is what the world would look like
if the claims were and were not true - what evidence could you
see that might convince you either way? These are definitely
not obviously true claims!
A Mathematical Framework for Transformer
The point of this is to explain how to conceptually break down a
transformer into individually understandable pieces.
Deeply engage with:
All the ideas in the overview section, especially:
Understanding the residual stream and why
The notion of interpreting paths between interpretable bits (eg input tokens and output logits) where the path is a composition of matrices and how this is different from interpreting every intermediate activations
And understanding attention heads: what a QK and OV
matrix is, how attention heads are independent and
additive and how attention and OV are
Skip Trigrams & Skip Trigram bugs, esp understanding why
these are a really easy thing to do with attention, and
how the bugs are inherent to attention heads separating
where to attend to (QK) and what to do once you attend
Induction heads, esp why this is K-Composition (and how
that’s different from Q & V composition), how the circuit
works mechanistically, and why this is too hard to do in a
Skim or skip:
Maybe check out my (long-ass) walkthrough of the
comments on how I think about things
If you prefer video over reading I expect it to be high
Either way it’s probably useful to check the relevant
section it if there’s part of the paper that confuses you.
This is a study of how induction heads are ubiquitous in real
transformers, and form as a sudden phase change during
Key concepts + argument 1.
Argument 4: induction heads also do translation + few shot
Getting a rough intuition for all the methods used in the
Model Analysis Table, as a good overview of interesting
All the rigour - basically everything I didn’t mention. The
paper goes way overboard on rigour and it’s not worth
understanding every last detail
A particularly striking result is that induction heads form at
~the same time in all models - I think this is very cool, but
somewhat overblown - from some preliminary experiments, I
think it’s pretty sensitive to learning rate and positional
encoding (though the fact that it doesn’t depend on scale is
Mechanistic Interpretability, Variables, and the Importance of
Short-ish conceptual essay on what the point of mechanistic
interpretability is and how to think about it.
This is similar in flavour to Circuits: Zoom In, but is more
conceptual and less grounded in very concrete examples +
progress - your mileage may vary in how much this works for
A Toy Model of
Building a simple toy model that contains superposition, and
analysing it in detail.
The core intuitions: what is superposition, how does it
respond to feature importance and sparsity, and how does
it respond to correlated and uncorrelated features.
Read the strategic picture, and sections 1 and 2 closely.
A good intro paper for concrete projects. The models are tiny,
the core results should be easy to replicate (and have short
training times), there’s an accompanying
and a list of follow-up
so this is a great paper to play around with!
An extremely detailed and rigorous study of a family of neurons
in Inception; a gold standard of what good interpretability
can look like. Culminates in them hand-coding the weights of
artificial neurons and substituting those into the circuit,
and comparing performance. Note that a bunch of the techniques
Understanding what they did as a gold standard, and thinking
about why what they did is deep and meaningful evidence.
Think about which techniques will and will not generalise to
A paper about reverse engineering a complex (28 head!) circuit
in GPT-2 Small
The most detailed “we actually have a circuit, and can drill
into it in detail and really get how it works” paper that
I know of.
Particularly good for a vibe of “ways interpretability is hard
and you can trick yourself” + “but it is actually possible and
we can fix these”
A paper on a neuron activation function that makes transformer
neurons somewhat more interpretable.
Section 3 (Background). For the core ideas, esp
superposition, privileged bases and why they matter.
Section 6 (on the neurons found). For getting the vibe of
what kind of features LLMs learn - I think this is the
best resource I know of for getting a vibe of what kinds
of things MLP layers are doing at different layers of a
A paper on locating and editing factual knowledge in GPT-2 - a
strong contender for my favourite non Chris Olah
A solid early bit of work on LLM interpretability. The key
insight is that we interpret the residual stream of the
transformer by multiplying by the unembedding and mapping to
logits, and that we can do this to the residual stream
before the final layer and see the model converging on the
Deeply Engage with:
Skim the figures about progress towards the answer through
the model, focus on just getting a vibe for what this
progress looks like.
Skip everything else.
The deeper insight of this technique (not really covered in the
work) is that we can do this on any vector in the residual
stream to interpret it in terms of the direct effect on the
logits - including the output of an attn or MLP layer and even
a head or neuron. And we can also do this on weights writing
to the residual stream.
Analyzing Transformers in Embedding
Space is a more
recent paper that drills down into this insight, focusing
I’m somewhat meh on the paper as a whole, but sections
3, 4.1 and Appendix C are cool for seeing what head
and neuron circuits can look like
Note that they make the (IMO) mistake of treating
embedding and unembedding space as the same space -
the input and output are different spaces! Even if
most people make the mistake of setting the embed and
unembed maps to be the same matrix :(
Note that this tends only to work for things close to the
final layer, and will totally miss any indirect effect on
the outputs (eg via composing with future layers, or
suppressing incorrect answers)
An Interpretability Illusion for
Good early paper on the limitations of max activating dataset
examples - they took a seemingly interpretable neuron in BERT
and took the max activating dataset examples on different
datasets, and observed consistent patterns within a dataset,
but very different examples between datasets
Within the lens of the Toy Model paper, this makes sense!
Features correspond to directions in the residual stream
that probably aren’t neuron aligned. Max activating
dataset examples will pick up on the features most
aligned with that neuron. Different datasets have
different feature distributions and will give different
“most aligned feature”
The concrete result that the same neuron can have very
different max activating dataset examples
The meta-level result that a naively compelling
interpretability technique can be super misleading on
A Mechanistic Interpretability Analysis of
Conflict of interest note - I was the main person working on
A very detailed reverse engineering of a tiny model trained to
do modular addition and interpreting it during training, plus
a bunch of discussion on phase changes, an (attempted)
showing grokking on other tasks.
Grokking probably isn’t that relevant to real models and the
techniques don’t really generalise, but a good example of
detailed reverse engineering + fully understanding a model
on an algorithmic task, and of applying interpretability
I also just personally think this project was super fucking
cool, even if not that useful.
The key claims and takeaways sections
Overview of the modular addition
Reverse engineering modular
understanding the different types of evidence and how they
Evolution of modular addition circuits during
the flavour of what the circuits developing looks like
during training, and the fact that once we understand
things, we can just literally watch them develop!
The Phase Changes
probably the most interesting bits are the explanation of
grokking, and the two speculative hypotheses.
Maybe a good intro paper to replicate! It has an
accompanying colab and a
list of future directions at the end
Multimodal Neurons in Artificial Neural
An analysis of neurons in a text + image model (CLIP), finding a
bunch of abstract + cool neurons. Not a high priority to
deeply engage with, but very cool and worth skimming.
My key takeaways
There are so many fascinating
The intuition that multi-modal models (or at least, models
that use language) are incentivised to represent things in
a conceptual way, rather than specifically tied to the
The detailed analysis of the Donald Trump
esp that it is more than just a “activates on Donald
Trump” neuron, and instead activates for many different
clusters of things, roughly tracking their association
with Donald Trump.
The “adversarial attacks by writing Ipod on an
part isn’t very deep, but is hilarious
The rest of the circuits
A lot of really cool ideas and scattered threads! Worth skimming
and digging into anything that catches your interest. Each
individual article is short-ish
This thread represents, in my opinion, the first serious attempt
at reverse engineering a real model (inception)
My personal favourites:
An Overview of Early Vision
it’s just fascinating to see the weird shit that happens,
super cool to the hierarchy where see simple shapes are in
early layers and are built into more abstract shapes in
layer layers, and to see neurons being sorted into
somewhat image specific, but a fascinating exploration of
the data visualisation questions underlying mechanistic
interpretability - visualisations are super useful, but
how can we do them in a properly principled way, and how
can they mislead?
networks spontaneously learn to be modular and the
modules seem to be consistent and semantically
Not a paper: The codebase of
a transformer mechanistic interpretability I’m writing - I think
it’s worth reading for a fairly clean and conceptual-focused
implementation of a transformer, specifically reading
(a file for the various layers) (the actual codebase is pretty
Everything else Chris Olah has
I’m somewhat biased on this, but I think Chris is just clearly
far and away the best interpretability researcher in the
He’s also a massive nerd for good technical communication,
interactivity and good graphic design, and I find his work a
joy to read.
Interesting application of image circuits techniques to get some
insight into an RL model - unclear how much it
The parts about the impact of the amount of and diversity of
data on interpretability feel most interesting and general to
Probably the best RL mechanistic interpretability paper I know
of (but it’s a pretty low bar :( )
Not a paper: Playing around with OpenAI
Microscope - visualizations
and top dataset examples of every neuron in a ton of image models!
Challenge: What’s the weirdest neuron you can find?
Visualizing and Interpreting the Geometry of
BERT (+ blog
An early LLM interpretability paper about understanding how BERT
represents language in the residual stream.
Acquisition of Chess Knowledge in
AlphaZero - analysing
AlphaZero’s chess knowledge, including during training
Notable for the hilarious stunt of getting a chess grandmaster
commenting, and for co-authoring (even if this isn’t that
Focuses on feature analysis rather than really mechanistic
engagement, but still very cool! The main things I think are
cool were successfully applying interpretability during
training, and on the weird and fucky task of playing chess
(and that models trained on non-image/language tasks are
Toward Transparent AI: A Survey on Interpreting the Inner
Structures of Deep Neural
Networks - a decent survey
paper on what’s up in the rest of interpretability.
I’m personally pretty meh about the majority of the academic
field of interpretability (I rarely find insights from there
useful in my work) and would prioritise reading the papers in
the previous sections, but it’s worth skimming to get a sense
for what’s out there, and digging into anything relevant to a
specific project you’re pursuing!
A Primer in
BERTOLOGY - a
survey paper specifically on BERTology, a subfield about
specifically interpreting BERT. I feel pretty meh about this,
but am not very familiar with the field.
The Building Blocks of
Not a paper, but I find Chris Olah’s interview on the 80,000
Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are doing to be interesting to plenty of academics, and I would think that's what the academics are trying to do. Are they just bad at generating insights? Do they look for the wrong kinds of progress, perhaps motivated by different goals? Why is the large academic field of interpretability not particularly useful for x-risk motivated AI safety?
Thanks for writing this - I've found it useful in my current attempts to survey some key mechanistic interpretability literature.
a decent survey paper on what’s up in the rest of interpretability.I’m personally pretty meh about the majority of the academic field of interpretability
a decent survey paper on what’s up in the rest of interpretability.
I’m personally pretty meh about the majority of the academic field of interpretability
A bit confused by this. This paper's abstract and intro claim to be focusing on inner interpretability methods - which they define as learned features and internal structure. This seems to fit my idea of what mechanistic interpretability is pretty well, but you seem to classify it as 'the rest of interpretability'.
Do you see a clear distinction between mechanistic interpretability methods vs the methods reviewed in this paper? If so, what's the distinction?
This is a fair point! I honestly have only vaguely skimmed that survey, and got the impression there was a lot of stuff in there that I wasn't that interested in. But it's on my list to read properly at some point, and I can imagine updating this a bunch.