This is the seventh post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer
Motivating papers: Causal Scrubbing, Logit Lens
In Mechanistic Interpretability, the core goal is to form true beliefs about what’s going on inside a network. The search space of possible circuits is extremely large, and even once a circuit is found, we need to verify that that’s what’s really going on. These are hard problems, and having good techniques and tooling is essential to making progress. This is particularly important in mech interp because it’s such a young field that there isn’t an established toolkit and standard of evidence, and each paper seems to use somewhat different and ad-hoc techniques (pre-paradigmatic, in Thomas Kuhn’s language).,
Getting better at this is particularly important to enable us to get traction interpreting circuits at all, even in a one layer toy language model! But it’s particularly important for dealing with the problem of scale. Mech interp can be very labour intensive, and involve a lot of creativity and well honed research intuitions. This isn’t the end of the world with small models, but ultimately we want to understand models with hundreds of billions to trillions of parameters! We want to be able to leverage researcher time as much as possible, and the holy grail is to eventually automate the finding of circuits and understanding models. My guess is that the most realistic path to really understanding superhuman systems is to slowly automate more and more of the work with weaker systems, while making sure that we understand those systems, and that they’re aligned with what we want.
This can be somewhat abstract, so here are my best guesses for what progress could look like:
A mechanistic perspective: This is an area of mech interp research where it’s particularly useful to study non-mechanistic approaches! People have tried a lot of techniques to understand models. Approaches that solely focus on the model’s inputs or outputs don’t seem that relevant, but there’s a lot of work digging into model internals (Rauker et al is a good survey) and many of these ideas are fairly general and scalable! I think it’s likely that there are insights here that are underappreciated in the mech interp community.
A natural question is, what does mech interp have to add? I have a pretty skeptical prior on any claim about neural network internals, let alone that there’s a technique that works in general or that can be automated without human judgement. In my opinion, one of the core things missing is grounding - concrete examples of systems and circuits that are well understood, where we can test these techniques and see how well they work, their limitations, and whether they miss anything important. My vision for research here is to take circuits we understand, and use this as training data to figure out scalable techniques that work for those. And to then use these refined techniques to search for new circuits, do our best to fully understand those, and to build a feedback loop that validates how well these techniques generalise to new settings.
“Build better techniques” is a broad claim, and encompasses many approaches and types of techniques. Here’s my attempt to operationalise the key ways that techniques can vary - note that these are spectrums, not black and white! Important context is that I generally approach mech interp with two mindsets: exploration, where I’m trying to become less confused about a model and form hypotheses, and verifying/falsifying, where I’m trying to break the hypotheses I’ve formed and look for flaws, or for stronger evidence that I’m correct. Good research looks like regularly switching between these mindsets, but they need fairly different techniques.
Finally, a cautionary note to beware premature optimization. A common reaction among people new to the field is to dismiss highly labour intensive approaches, and jump to techniques that are obviously scalable. This isn’t crazy, scalable approaches are an important eventual goal! But if you skip over the stage of really understanding what’s going on, it’s very easy to trick yourself and produce techniques or results that don’t really work.
Further, I think it is a mistake to discard promising seeming interpretability approaches for fear that they won’t scale - there’s a lot of fundamental work to do in getting to a point where we can even understand small toy models or specific circuits at all. I see a lot of the work to be done right now as being basic science - building an understanding of the basic principles of networks and a collection of concrete examples of circuits (like what is up with superposition?!), and I expect this to then be a good foundation to think about scaling. We only have like 3 examples of well understood circuits in real language models! It’s plausible to me that we shouldn’t be focusing too hard on automation or scalable techniques until we have at least 20 diverse example circuits, and can get some real confidence in what’s going on!
But automated and scalable techniques remain a vital goal!
This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)