This was written as a guide for Apart Research's Mechanistic Interpretability Hackathon as a compressed version of my getting started post. The spirit is "how to speedrun your way to maybe doing something useful in mechanistic interpretability in a weekend", but hopefully this is useful even to people who aren't aiming for weekend long projects!
Mechanistic Interpretability is the study of reverse-engineering neural networks - analogous to how we might try to reverse-engineer a program’s source code from its compiled binary, our goal is to reverse engineer the parameters of a trained neural network, and to try to reverse engineer what algorithms and internal cognition the model is actually doing. Going from knowing that it works, to understanding how it works. Check out Circuits: Zoom In for an introduction.
In my (extremely biased!) opinion, mech interp is a very exciting subfield of alignment. Currently our models are inscrutable black boxes! If we can really understand what models are thinking, and why they do what they do, then I feel much happier about living in a world with human level and beyond models, and it seems far easier to align them.
Further, it is a young field, with a lot of low-hanging fruit. And it suffices to screw around in a Colab notebook with a small-ish model that someone else trained, copying code from an existing demo - the bar for entry can be much lower than other areas of alignment. So you stand a chance of getting traction on a problem in this hackathon!
Though the bar for entry is lower for mech interp than other areas of alignment, it is still far from zero. I’ve written a post on how to get started that lays out the key prerequisites and my takes for what to do to get them. A weekend hackathon isn’t long enough to properly engage with those, so I recommend picking a problem you’re excited about, and dipping into the resources summarised here whenever you get stuck. I recommend trying to have some problem in mind, so you can direct your learning towards making progress on that goal. But it’s completely fine if, in fact, you just spend the weekend learning as much as you can - if you feel like you’ve learned cool things, then I call that a great hackathon!
A summary of the key resources, and how to think of them during a hackathon.
There’s a lot of concrete open problems you could engage with! I’ve written a sequence called 200 Concrete Open Problems in Mechanistic Interpretability that tries to lay out a ton of them. Try not to get paralysed by choice! I recommend reading the overview, skimming the posts that seem exciting, picking a problem that jumps out at you and running with it. You should be able to get traction on anything rated difficulty A and maybe something rated difficulty B!
Here are some meta thoughts on particularly suitable areas for a hackathon. Please just take these takes as a starting point! There’s a bunch of other posts in the sequence, and problems of a very different flavour. And please take the sequence itself as just another starting point - if you’re already excited about some other question or direction, go wild!
Unfinished sentence at “if you want a low coding project” at the top.