When justifying my mechanistic interpretability research interests to others, I've occasionally found it useful to borrow a distinction from physics and distinguish between 'fundamental' versus 'applied' interpretability research.
Fundamental interpretability research is the kind that investigates better ways to think about the structure of the function learned by neural networks. It lets us make new categories of hypotheses about neural networks. In the ideal case, it suggests novel interpretability methods based on new insights, but is not the methods themselves.
Applied interpretability research is the kind that uses existing methods to find the representations or circuits that particular neural networks have learned. It generally involves finding facts or testing hypotheses about a given network (or set of networks) based on assumptions provided by theory.
Although I've found the distinction between fundamental and applied interpretability useful, it's not always clear cut:
Clearly both fundamental and applied interpretability research are essential. We need both in order to progress scientifically and to ensure future models are safe.
But given our current position on the tech tree, I find that I care more about fundamental interpretability.
The reason is that current interpretability methods are unsuitable for comprehensively interpreting networks on a mechanistic level. So far, our methods only seem to be able to identify particular representations that we look for or describe how particular behaviors are carried out. But they don't let us identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean). Let's call the ability to do these things 'comprehensive interpretability' .
We need comprehensive interpretability in order to have strong-ish confidence about whether dangerous representations or circuits exist in our model. If we don't have strong-ish confidence, then many theories of impact for interpretability are inordinately weakened:
For most of these theories of impact, the relationship feels like it might be nonlinear: A slight improvement to interpretability that nevertheless falls short of comprehensive interpretability does not lead to proportional safety gains; only when we cross a threshold to something resembling comprehensive interpretability would we get the bulk of the safety gains. And right now, even though there's a lot of valuable applied work to be done, it feels to me like progress in fundamental interpretability is the main determinant of whether we cross that threshold.
Similar terms for 'comprehensive interpretability' include Anthropic's notion of 'enumerative safety', Evan Hubinger's notion of 'worst-case inspection transparency', and Erik Jenner's notion of 'quotient interpretability'.
How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability?
Are there other specific areas you're excited about?
Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers.Dictionary learning - This is one of my main bets for comprehensive interpretability. Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709