This is the eighth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer
Motivating papers: Thread: Circuits, Multimodal Neurons in Artificial Neural Networks
Disclaimer: My area of expertise is language model interpretability, not image models - it would completely not surprise me if this section contains errors, or if there are a lot of great open problems that I’ve missed!
A lot of the early work in mechanistic interpretability was focused on reverse engineering image classification models, especially Inceptionv1 (GoogLeNet). This work was largely (but not entirely!) led by Chris Olah and the OpenAI interpretability team. They got a lot of fascinating results, most notably (in my opinion):
I think the image interpretability results are awesome, and one of the main things that convinced me that reverse engineering neural networks was even possible! But also, very few people worked on these questions! There was enough work done to give a good base to build off of and to expose a lot of dangling threads, but also a lot of open questions left.
My personal goal with mech interp is to get good enough at understanding systems that we can eventually understand what’s going on in a human-level frontier model, and use this to help align it. From this perspective, is it worth continuing image circuits work? This is not obvious to me! I think language models (and to a lesser degree transformers) are far more likely to be a core part of how we get transformative AI (though I do expect transformative AI to have significant multimodal components), and most of the mech interp field is now focused on LLMs as a result.
But I also think that just any progress on reverse engineering neural networks is good. And at least some insights transfer. Though there are obviously a lot of differences - Inception has a continuous rather than discrete input space, doesn’t have attention heads or a residual stream, and is doing classification rather than generation. I’m personally most excited about image circuits work driving towards fundamental questions about reverse engineering networks:
My guess is that all other things being the same, marginal effort should go towards language models. But equally, I want a field with a diverse portfolio of research, and all other things may not be the same - image circuits seem to have a pretty different flavour to transformer circuits. And speculation about which architectures or modalities are most relevant to human level are hard and very suspect. Further, I think that often the best way to do research is by following your curiosity and excitement, and whatever nerd-snipes you, and that being too goal directed can be a mistake. If you feel drawn to image circuits work, or think that AGI is a load of overblown hype and want to work on image circuits for other reasons, then more power to you, and I’m excited to see what you learn!
This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)
Different image model architectures
I've done a little bit of work on ViT interpretability. It's kind of messy right now but maybe a starting point for someone else to jump off of + I might add to it in the future: https://berkan.xyz/projects/ (see vision transformer interpretability).