I've historically been pretty publicly supportive of interpretability research. I'm still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed.
I acknowledge that spreading research insights less widely comes with real research costs. I'd endorse building a cross-organization network of people who are committed to not using their understanding to push the capabilities frontier, and sharing freely within that.
I acknowledge that public sharing of research insights could, in principle, both shorten timelines and improve our odds of success. I suspect that isn’t the case in real life.
It's much more important that blatant and direct capabilities research be made private. Anyone fighting for people to keep their AI insights close to the chest, should be focusing on the capabilities work that's happening out in the open, long before they focus on interpretability research.
Interpretability research is, I think, some of the best research that can be approached incrementally and by a large number of people, when it comes to improving our odds. (Which is not to say it doesn't require vision and genius; I expect it requires that too.) I simultaneously think it's entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don't have good solutions. Reality doesn't have to provide you any outs.
There's a tradeoff here. And it's not my tradeoff to make; researchers will have to figure out what they think of the costs and benefits. My guess is that the current field is not close to insights that would significantly improve capabilities, and that growing the field is important (and would be hindered by closure), and also that if the field succeeds to the degree required to move the strategic needle then it's going to start stumbling across serious capabilities improvements before it saves us, and will need to start doing research privately before then.
I reiterate that I'd feel ~pure enthusiasm about a cross-organization network of people trying to understand modern AI systems and committed not to letting their insights push the capabilities frontier.
My goal in writing this post, though, is mostly to keep the Overton window open around the claim that there is in fact a tradeoff here, that there are reasons to close even interpretability research. Maybe those reasons should win out, or maybe they shouldn't, but don't let my praise of interpretability research obscure the fact that there are tradeoffs here.
It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.
A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.
: known to the author