A brief research note by Chris Olah about the point of mechanistic interpretability research. Introduction and table of contents are below.
An informal note on the relationship between superposition and distributed representations by Chris Olah. Published May 24th, 2023.
Our present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.
We aim to offer insight into our vision for addressing mechanistic interpretability's other challenges, especially scalability. Because we have focused on foundational issues, our longer-term path to scaling interpretability and tackling other challenges has often been obscure. By articulating this vision, we hope to clarify how we might resolve limitations, like analyzing massive neural networks, that might naively seem intractable in a mechanistic approach.
Before diving in, it's worth making a few small remarks. Firstly, essentially all the ideas in this essay were previously articulated, but buried in previous papers. Our goal is just to surface those implicit visions, largely by quoting relevant parts. Secondly, it's important to note that everything in this essay is almost definitionally extremely speculative and uncertain. It's far from clear that any of it will ultimately be possible. Finally, since the goal of this essay is to lay out our personal vision of what's inspiring to us, it may come across as a bit grandiose – we hope that it can be understood as simply trying to communicate subjective excitement in an open way.
Some very interesting and inspiring material.
I was fascinated to see that https://distill.pub/2021/multimodal-neurons/#emotion-neurons provides some clear evidence for emotion neurons in CLIP rather similar to the ones for modeling author's current emotional state that I hypothesized might exist in LLMs in https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/?commentId=ggKug9izazELkRLun As I noted there, if true this would have significant potential for LLM safety and alignment.