Concrete Steps to Get Started in Transformer Mechanistic Interpretability

[-]Kay Kozaronek3y00

Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?

[-]Neel Nanda3y10

Er, I'm bad at time estimates at the best of times. And this is a particularly hard case, because it's going to depend wildly on someone's prior knowledge and skillset and you can choose how deep to go, even before accounting for general speed and level of perfectionism. Here are some rough guesses:

ML pre-reqs 10-40h Transformer implementation 10-20h Mech Interp Tooling 10-20h Learning about MI Field 5-20h

But I am extremely uncertain about these. And I would rather not put these into the main post, since it's easy to be misleading and easy to cause people to feel bad if they think they "should" get something done in <20h and it actually takes 60 to do right. I'd love to hear how long things take you if you try this!

[-]Alexander3y00

Bad question, but curious why it's called "mechanistic"?

[-]LawrenceC3y27

Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME.

In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

Or see this post by Daniel Filan.

[-]Neel Nanda3y10

Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

17

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

17

Introduction

Defining “Decent Baseline”

Getting the Fundamentals

Paths for Further Exploration

Explore and Build On A Paper

Work on a Concrete Problem

Read Around the Field (Ie Do a Lit Review)

Appendix

Advice on compute

Other Resources