Disclaimer: This post mostly links to resources I've made. I feel somewhat bad about this, sorry! Transformer MI is a pretty young and small field and there just aren't many people making educational resources tailored to it. Some links are to collations of other people's work, and I link to more in the appendix.

Introduction

Feel free to just skip the intro and read the concrete steps

The point of this post is to give concrete steps for how to get a decent level of baseline knowledge for transformer mechanistic interpretability (MI). This is an experiment in trying  to give concrete, actionable, goal-oriented advice that I think is enough to get decent outcomes. Naturally, this can be constraining and the best path will differ a lot between your background and precise goals! As an accompaniment, I’m writing a future post fleshing out my philosophy behind doing mechanistic interpretability research, what I think the sub-skills are, and fleshing out how I think people should learn them.

A core belief I have about learning mechanistic interpretability is that you should spend at least a third of your time writing code and playing around with model internals, not just reading papers. MI has great feedback loops, and a large component of the skillset is the practical, empirical skill of being able to write and run experiments easily. Unlike normal machine learning, once you have the basics of MI down, you should be able to run simple experiments on small models within minutes, not hours or days. Playing around with models builds this empirical skill, but also helps enhance the reading and learning. It builds intuitions for how the model actually works internally, what techniques are easy vs hard, etc, which is key context when understanding the ideas in papers. 

Further, because the feedback loops are so tight, I don’t think there’s a sharp boundary between reading and doing research. If you want to deeply engage with a paper, you should be playing with the model studied, and testing the paper’s basic claims. And if you’re doing this, you can try to answer the questions that come up as you read the paper. And there’s a smooth continuum between this and doing real original research. 

The intended audience is for people new-ish to Mechanistic Interpretability but who know they want to learn about it - if you have no idea what MI is, check out my MI Explainer or Circuits: Zoom In.

One of my projects for the past few months has been trying to make it much easier to get into the field of reverse engineering language models, and making better open source tooling and educational materials. I compile the most relevant ones in this post, but if you're interested you can see a full list of side projects here

Defining “Decent Baseline”

Scoping out my goals here, I want you to be able to take a behaviour in a transformer that you want to understand, and have some idea of how to get started and get traction. Breaking this down further:

  • A good grounding in the key concepts of ML and MI
  • An intuition for how a transformer actually works as a mathematical object - what the moving parts are, how it all fits together, and how to reason about the overall system
  • Familiarity with tooling, such that you can easily spin up a model and run quick and dirty experiments.
  • A rough map of the literature, what’s known in the field, and big categories of open problems - not necessarily a deep knowledge, but hopefully enough to get a sense for techniques used, and where you could go and read a relevant paper if you want to.
  • A sense of basic techniques, what compelling evidence about model internals looks like, and how to get started when poking around at a model.

This is a deliberately limited framing! Importantly, this does not mean having a deep knowledge and understanding of everything that’s known in the field, nor the skills to actually produce important novel research - these are much harder to gain, especially the second! I consider these out of scope for this post.

Further, I focus on the specific skill of getting started at reverse engineering a system because it’s much more concrete than other skills, yet also fundamental and important. Not all MI research explicitly looks like “take a system and try to reverse engineer it”, and often significant skill goes into identifying which system and which task, but I think that this is an important underlying skill for most possible MI research, and builds key intuitions. 

Getting the Fundamentals

A set of goal oriented steps that I think are important for getting the fundamental skills. If you feel comfortable meeting the success criteria of a step, feel free to skip it.

  1. Learn general Machine Learning prerequisites: There's a certain baseline level of understanding about ML in general that's important context. It's also important to be familiar with an ML framework like PyTorch to actually write code, and to help ground your knowledge.
    • Resource: Read My Barebones Guide to Mechanistic Interpretability Prerequisites, and learn the pre-reqs you're missing (see step 2 for more on transformers)
    • Success criteria: Write and train an MLP in PyTorch to solve MNIST
      • I have deliberately set this success criteria to emphasise that your goal here is to get enough intuition and context that you can learn more. The goal here is not to get a deep knowledge, or to understand all of the sub-fields - ML contains a ton of niches, and this is not on the critical path to exploring MI! 
    • Tips:
      • I recommend PyTorch over other frameworks (Jax, Tensorflow, etc) if you’re doing MI. I highly recommend that you learn to Einops and einsum for tensor manipulation - if you don’t use these you’ll probably hurt yourself!
      • It's very easy to overestimate how much you need to learn general pre-reqs - when in doubt, move on to MI, and go back when you notice something missing! 
  2. Deeply understand transformers: A key skill in MI is to have a gears-level model in your head of the transformer - what is it as a mathematical object, what are all of the moving parts and how do they fit together, and what kinds of algorithms could this implement. 
  3. Familiarise yourself with Mechanistic Interpretability (MI) Tooling: You want to be able to easily write and run experiments as you learn and explore MI, so it's worth some up front investment to learn what's out there. The goal here is familiarity with the basics, not deep expertise - the best way to really learn tooling is by using it in practice to solve problems that you care about.
    • Resource: My TransformerLens library for doing mechanistic interpretability of GPT-style language models. 
      • If you’re new to ML coding, I highly recommend doing your work in a Colab Notebook (with a GPU) unless you’re confident you know what you’re doing. You don’t want to be wasting time setting up infrastructure!
    • Success criteria: Read the main demo for the library. Use TransformerLens to load GPT-2 Small in a Colab notebook, run the model, and visualise the activations.
    • Bonus: Data visualization: A core skill in MI is being good at visualizing data. Neural networks are high dimensional objects, and you need to be able to understand what's going on! My research workflow looks like running an experiment, visualizing the data, staring at the data, being confused, forming more hypotheses, and iterating. Plot data often, and in a diversity of ways.
      • Familiarise yourself with a plotting library. My personal favourite is Plotly, but Matplotlib, Bokeh and Holoview are other options.
        • Callum McDougall has a great Plotly intro
        • Matplotlib is the most popular and easiest to google/use ChatGPT or Copilot for. But personally I find it really annoying, unintuitive and limiting. 
      • Play around with Alan Cooney's CircuitsVis library, which lets you pass in tensors to an interactive visualization and show it in a Jupyter notebook. See the existing visualizations here
        • You can write your own visualizations in React, but it's higher effort
  4. Begin learning about the MI field: Get your head around basic concepts in MI, and get an overview of the field. The goal of this step is not to get a deep and perfect knowledge of everything!
    • Resource: A Comprehensive Mechanistic Interpretability Explainer & Glossary
    • Success criteria: Read through the whole explainer, and be able to follow the gist of most of it. 
      • Feel free to skip or skim sections or definitions that you don’t follow or find interesting.
      • Be able to look at each section in the table of contents, and have a rough intuition for what it’s about and why you might care about it.

Paths for Further Exploration

A slightly less concrete collection of different strategies to further explore the field. These are different options, not a list to follow in order. I recommend the exploring and building on a paper route, but in practice I think a mix is worthwhile - if you spend several weeks in one of these modes, you should zoom out and try one of the other modes for a few days! Further, I recommend reading through all of the sections, the tips and resources in one often transfer!

Explore and Build On A Paper

Find a paper you’re excited about and try to really deeply engage with it, this means both running experiments and understanding the ideas, and ultimately trying to replicate (most of) it. And if this goes well, it can naturally transition into building on the paper’s ideas, exploring confusions and dangling threads, and into interesting original research.

  • Resources:
  • First, find a paper that excites you. I recommend reading my list or watching my walkthroughs, or reading paper introductions. Once you’ve found one you’re excited about, go on a deep dive and read it closely.
  • Code: As you read the paper, load in the relevant model, and replicate and explore the paper's claims as you read. Don’t stress about being perfect, but try to get the gist.
    • If possible, crib from my TransformerLens Main Demo or Exploratory Analysis Demo to minimise time spent writing code.
    • A key thing to track is what techniques they/you are using, and why we would expect that technique to tell us anything useful. Where would the technique break?
    • Which of their claims feel most suspicious to you?
  • Understand: Write up a summary of the key ideas, claims and limitations. Make sure you deeply understand what techniques the paper used, and why those techniques have told you anything real about the underlying model(s)
  • Good questions to ask as you read:
    • What are the techniques and evidence used in the paper? How do they distinguish between true and false beliefs about the model? Can you see any flaws? What forms of evidence feel stronger vs weaker?
    • What parts of their techniques and results seem model/task specific? What do you predict will generalise?
    • Which parts of the paper feel most cherry-picked?
    • What are the limitations and flaws of the paper?
  • Notice what things you’re confused about at the end. Try to figure these out, or ask someone for help (in a pinch, you can always try emailing the authors!)
  • Extend: There’s a fairly smooth continuum between deeply engaging with a paper and doing original research. What follow-on questions interest you? How can you build on the paper? 
    • Check out the tips in the next section for doing good MI research!
    • Keep building on your experiments - can you fully replicate the paper? 
    • Investigate any dangling threads you’re curious or confused about. What questions did the paper leave unanswered?
      1. A good default is checking how much the paper’s claims transfer to other models. 
    • Read the section of Concrete Open Problems relevant to that paper, and if any problem excites you, go and try to work on it!

Work on a Concrete Problem

Find a problem in MI that you’re excited about, and go and try to solve it! Note: I highly recommend setting your explicit goal as learning and having fun, rather than doing important research. Doing important original research is just really hard, especially as a first project without significant mentorship! Note: I expect this to be a less gentle introduction than the other two paths, only do this one first if it feels actively exciting to you!

  • Resources200 Concrete Open Problems in Mechanistic Interpretability
  • The two most important things here are to find a problem you’re excited about, and to have a sense of the basic techniques you’d use to get a foothold on it. Concretely, I recommend skimming through my Concrete Open Problems sequence and choosing a section that excites you, and then reading the problems in that section and choosing a problem that excites you. 
    • Coding is much easier if you’re starting from and editing existing code. I recommend using my Exploratory Analysis Demo as a base.
    • Orient your learning around solving the problem, but still spend time learning things! Read the papers that seem most relevant, learn how to use the relevant infrastructure and techniques, etc. 
    • If you get consistently stuck, or feel confused and like you’re floundering, go and read around more! 
  • Problem choice:
    • It’s easy to be overwhelmed by the number of problems. If so, go try the deeply engaging with a paper path
    • Ditto, it’s easy to just not be excited by any of the problems. If so, go try the deeply engaging with a paper path! It’s much easier to be excited about a problem when you have more context and can see why it’d be interesting and how you might make progress.
    • It’s also easy to be a perfectionist and stress about choosing the best problem. This is not the right thing to optimise for here! Pick a problem you’re excited about, and you can always pivot if it turns out to be too hard or too boring or too easy.
    • One angle is to pick your five favourite problems, try each for 1-2 days, and pick your favourite at the end. Aim for hackathon vibes, of being willing to do unprincipled and hacky things, trying to make progress, and seeing whether anything interesting happens.
      • This works best on ones that don’t need major infrastructure set up - but those are a bad fit for your first problem anyway!
  • Tips:
    • A common mistake is being too ambitious. I highly recommend choosing a problem from my doc that’s rated A or B! In particular, when you’re new to a field, your intuition for “this should be easy” will be wildly, wildly miscalibrated. Pick a problem that feels like it should be really easy - if it actually is, then move on to a harder one!
      • Key lesson: Research is always harder than you think (even after accounting for this fact)
      • Note that many people need the equal but opposite advice! You can get started on doing research without being really smart, an expert in the field, or having years of experience.
    • I highly recommend working with smaller models! There’s a lot of interesting things to learn from reverse engineering smaller models (anything from a one layer transformer to 2B parameters), but a common mistake is going for the sexiest (ie largest) models. Infrastructure for models too big to fit in one GPU is a massive, massive pain.
      • GPT-J (6B parameters) is a pain but the largest that is reasonable, anything bigger than that is completely not worth the effort. 
      • This advice applies 3x for anything that involves fine-tuning (let alone pre-training!) a model of more than 300M parameters (GPT-2 Medium sized)
    • Aggressively red-team your hypotheses. The most common mistake in junior MI researchers is coming up with an elaborate explanation for a seemingly complex behaviour and missing a simple explanation for the behaviour. Or using some high-power techniques (which take a lot of time and effort to set up!) without noticing plausible ways that those techniques could break or be misleading.
      • Familiarise yourself with the results in tiny models in A Mathematical Framework - skip trigrams, bigrams, copying, induction. Always check whether the behaviour could be done with some combination of these.
      • Look for evidence to falsify your hypothesis about the model behaviour, not just confirm it. Try a lot of minor edits to the model’s prompt, and see which ones do and do not break the behaviour.
    • On the flip-side, be willing to use janky and non-rigorous techniques, just don’t rely on them. These can be key inputs into hypothesis formation and exploration, and if a lot of non-rigorous techniques point in the same direction (and aren’t all flawed in the same way) that can be pretty strong evidence! But it’s important to also red-team this and to not blindly trust them.
    • Jump in and get your hands dirty! A common mistake is to spend days to weeks setting up epic infrastructure, or writing a perfect, sophisticated project proposal. I highly recommend jumping in with some quick experiments, trying to get some feedback from reality, and iterating, before you heavily invest. 
      • Research is all about tight feedback loops! The vibe should be closer to a weekend hackathon, rather than a 2 month internship. 
      • Building yourself good infrastructure is important, but it’s often done best if you’ve done things the hacky way first, and can identify what features do/do not matter

Read Around the Field (Ie Do a Lit Review)

Go and read around the field. Familiarise yourself with the existing (not that many!) papers in MI, try to understand what’s going on, big categories of open problems, what we do and do not know. My core recommendations are to write code as you read and aim to deeply understand a few important papers rather than skimming many. 

  • ResourcesAn Extremely Opinionated Annotated List of my Favourite Interpretability Papers
    • My MI Explainer is a good reference for terms
    • Video walkthroughs of A Mathematical Framework, Induction Heads, and Interpretability in the Wild, I recommend watching them before digging too deep into the papers. 
    • Read the intro to each section in my Concrete Open Problems sequence, to get a sense of the types of open problem and motivation/background.
  • Recommendation: It’s easy to read passively, or without fully understanding. I highly recommend aiming for deep engagement, even at the cost of being slower. 
    • Good questions to ask as you read:
      • What are the techniques and evidence used in the paper? How do they distinguish between true and false beliefs about the model? Can you see any flaws? What forms of evidence feel stronger vs weaker?
      • What parts of their techniques and results seem model/task specific? What do you predict will generalise?
      • Which parts of the paper feel most cherry-picked?
      • What are the limitations and flaws of the paper?
    • I recommend writing code and playing around with the models in question as you read! See the deeply engaging with the paper section for tips.
  • Output: Try to produce some concrete output for the papers you really like - this forces you to engage much more deeply
    • A blog post on the paper
    • Explain the paper (in detail!) to a friend
    • Give a talk on them to a reading group/some friends.
    • A Colab notebook replicating some key claims

Appendix

Advice on compute

  • Rules of thumb - to use and analyse a model up to GPT-2 Medium size, this is easy enough to do in a free colab notebook. If you getting CUDA Out of Memory issues, you want to upgrade to a better notebook.
    • I recommend Colab Pro or Paperspace Gradient (a version of colab that gives you a virtual machine behind the notebook, I recommend the $8/month pro plan)
  • If you need to do anything that involves running code for >1 hour, or things are super slow, you probably want a dedicated virtual machine. Anything involving training a language model, or running inference for the model on >100M tokens of text will be a pain and require serious compute

Other Resources

See Apart’s Interpretability Playground for a longer list, and a compilation of my various projects here

16

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 3:53 AM

Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?

Er, I'm bad at time estimates at the best of times. And this is a particularly hard case, because it's going to depend wildly on someone's prior knowledge and skillset and you can choose how deep to go, even before accounting for general speed and level of perfectionism. Here are some rough guesses:

ML pre-reqs 10-40h Transformer implementation 10-20h Mech Interp Tooling 10-20h Learning about MI Field 5-20h

But I am extremely uncertain about these. And I would rather not put these into the main post, since it's easy to be misleading and easy to cause people to feel bad if they think they "should" get something done in <20h and it actually takes 60 to do right. I'd love to hear how long things take you if you try this!

Bad question, but curious why it's called "mechanistic"?

Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME

In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

Or see this post by Daniel Filan.

Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)