Disclaimer: This post mostly links to resources I've made. I feel somewhat bad about this, sorry! Transformer MI is a pretty young and small field and there just aren't many people making educational resources tailored to it. Some links are to collations of other people's work, and I link to more in the appendix.
Feel free to just skip the intro and read the concrete steps
The point of this post is to give concrete steps for how to get a decent level of baseline knowledge for transformer mechanistic interpretability (MI). I try to give concrete, actionable, goal-oriented advice that I think is enough to get decent outcomes - please take this as a starting point and deviate if something feels a better fit to your background and precise goals!
A core belief I have about learning mechanistic interpretability is that you should spend at least a third of your time writing code and playing around with model internals, not just reading papers. MI has great feedback loops, and a large component of the skillset is the practical, empirical skill of being able to write and run experiments easily. Unlike normal machine learning, once you have the basics of MI down, you should be able to run simple experiments on small models within minutes, not hours or days. Further, because the feedback loops are so tight, I don’t think there’s a sharp boundary between reading and doing research. If you want to deeply engage with a paper, you should be playing with the model studied, and testing the paper’s basic claims.
The intended audience is for people new-ish to Mechanistic Interpretability but who know they want to learn about it - if you have no idea what MI is, check out this Mech Interp Explainer, Circuits: Zoom In or Chris Olah’s overview post.
Here’s an outline of what I mean when I say “a decent level of baseline knowledge”
A set of goal oriented steps that I think are important for getting the fundamental skills. If you feel comfortable meeting the success criteria of a step, feel free to skip it.
Callum McDougall has made a great set of tutorials for mechanistic interpretability and TransformerLens, with exercises, solutions and beautiful diagrams. This section is in large part an annotated guide to those!
A slightly less concrete collection of different strategies to further explore the field. These are different options, not a list to follow in order. Read all of them, notice if one jumps out at you, and dive into that. If nothing jumps out, start with the “exploring and building on a paper” route. If you spend several weeks in one of these modes and feel stuck, you should zoom out and try one of the other modes for a few days! Further, it’s useful to read through all of the sections, the tips and resources in one often transfer!
Find a paper you’re excited about and try to really deeply engage with it, this means both running experiments and understanding the ideas, and ultimately trying to replicate (most of) it. And if this goes well, it can naturally transition into building on the paper’s ideas, exploring confusions and dangling threads, and into interesting original research.
Find a problem in MI that you’re excited about, and go and try to solve it! Note: A good explicit goal is learning and having fun, rather than doing important research. Doing important original research is just really hard, especially as a first project without significant mentorship! Note: I expect this to be a less gentle introduction than the other two paths, only do this one first if it feels actively exciting to you! Check out this blog post and this walkthrough for accounts of what the actual mech interp research process can look like.
Go and read around the field. Familiarise yourself with the existing (not that many!) papers in MI, try to understand what’s going on, big categories of open problems, what we do and do not know. Focus on writing code as you read and aim to deeply understand a few important papers rather than skimming many.
See Apart’s Interpretability Playground for a longer list, and a compilation of my various projects here
Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?
Er, I'm bad at time estimates at the best of times. And this is a particularly hard case, because it's going to depend wildly on someone's prior knowledge and skillset and you can choose how deep to go, even before accounting for general speed and level of perfectionism. Here are some rough guesses:
ML pre-reqs 10-40h
Transformer implementation 10-20h
Mech Interp Tooling 10-20h
Learning about MI Field 5-20h
But I am extremely uncertain about these. And I would rather not put these into the main post, since it's easy to be misleading and easy to cause people to feel bad if they think they "should" get something done in <20h and it actually takes 60 to do right. I'd love to hear how long things take you if you try this!
Bad question, but curious why it's called "mechanistic"?
Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME.
In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
Or see this post by Daniel Filan.
Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)