TL;DR: Multiple people have raised concerns about current mechanistic interpretability research having capabilities externalities. We discuss to which extent and which kind of mechanistic interpretability research we should publish. The core question we want to explore with this post is thus to which extent the statement “findings in mechanistic interpretability can increase capabilities faster than alignment” is true and should be a consideration. For example, foundational findings in mechanistic interpretability may lead to a better understanding of NNs which often straightforwardly generates new hypotheses to advance capabilities.
We argue that there is no general answer and the publishing decision primarily depends on how easily the work advances alignment in relation to how much it can be used to advance capabilities. We recommend a differential publishing process where work with high capabilities potential is initially only circulated with a small number of trusted people and organizations and work with low capabilities potential is published widely.
Related work: A note about differential technological development, Current themes in mechanistic interpretability research, Thoughts on AGI organization and capabilities work, Dan Hendrycks’s take, etc.
We have talked to lots of people about this question and lost track of who to thank individually. In case you talked to either Marius or Lawrence about this, thank you!
Let’s revisit the basic cases for publishing.
In addition, mechanistic interp seems especially well suited for publication in classic academic venues since it is less speculative than other AI safety work and overlaps with established academic fields.
Thus, publication seems robustly positive as long as it doesn’t advance capabilities more than alignment (which is often hard to predict in advance). The crux of this post, therefore, lies mainly in the possible negative externalities of publications and how they trade off against the alignment benefits.
The primary reason to think that mechanistic interpretability has large capabilities is that understanding a system better makes improvements easier. Historically a lot of applications were a downstream effect of previous foundational research. This seems true for a lot of scientific advances in general, e.g. improved understanding of biology lead to better medicine, but also for ML applications in particular. We decompose the capabilities externalities into multiple different paths.
It’s worth noting that historically, most capability advances were not the result of a detailed understanding of NNs--rather, they were the result of a mix of high-level insights and trial and error. In particular, it seems that existing work interpreting concrete networks has been counterfactually responsible for very few capability gains (other than the two cases cited above).
Unfortunately, we think it’s likely that the potential capability implications of interpretability are proportional to its usefulness for alignment, i.e. better interpretability tools are both better for safety and also increase capabilities more since they yield better insights. Thus, the historic lack of capabilities advances from mechanistic interpretability potentially just indicate that interpretability is too far behind the state of the art to be useful at the moment. But that could change once it catches up.
We think these are much less relevant than the capabilities externalities but still worth mentioning.
The previous considerations were presented in a vacuum, i.e. we presented effects that are plausibly negative. However, the alternative world could be even worse and we thus look at counterfactual considerations.
Obviously, publishing can mean very different things. In order of increasing effort and readership size, publishing can mean:
A caveat: while the average LW/AF post has relatively low readership and a high alignment/non-alignment ratio, viral LW/AF posts can reach a wide audience. For example, after its tweet thread went viral, Neel Nanda’s grokking work was widely circulated amongst the academic interpretability community. Given the hit-based nature of virality, there isn’t a particular place to put it on the hierarchy, but it’s worth noting that Twitter threads can sometimes greatly increase the publicity of a post.
Different people have very different opinions about this question and it seems hard to combine them. Thus, we decided to ask multiple people in the alignment scene about their stance on this question.
It seems really straightforwardly good to me to publish (almost all) mechanistic interpretability work; it’s so far down the list of things we should be worrying about that by default I assume that objections to it are more strongly motivated by deontological rather than impact considerations. I’m generally skeptical of applying new deontological rules to decisions this complex; but even if we’re focusing on deontology, there are much bigger priorities to consider (e.g. EA’s relationship to AI labs).
I enjoyed these thoughts, and there are a few overarching thoughts I have on it.
Is mechanistic interpretability the best category to ask this question about? On one hand, this may be a special case of a broader point involving research in the science of ML. Worries about risks from basic research insights do not seem unique to interpretability. On the other hand, different types of (mechanistic) interpretability research seem likely to have very different implications for safety vs. risky capabilities.
Interpretability tools often trade off with capabilities. In these cases, it might be extra important to publish. For example, disentanglement techniques, adversarial training, modularity techniques, bottlenecking, compression, etc. are all examples of things that tend to harm a network’s performance on the task while making them more interpretable. There are exceptions to this like model editing tools and some architectures. But overall, it seems that in most existing examples, more interpretable models are less capable.
Mechanistic interpretability may not be the elephant in the room. In general, lots of mechanistic interpretability work might not be very relevant for engineers – either for safety or capabilities. There’s a good chance this type of research might not be key either way, especially on short timelines. Meanwhile, RLHF is currently changing the world, and this has now prompted hasty retrospectives about whether it was net good, bad, or ok. At a minimum, mechanistic interpretability is not the only alignment-relevant work that all of these questions should be asked about.
If work is too risky to publish, it may often be good to avoid working on it at all. Pivotal acts could be great. And helping more risk-averse developers of TAI be more competitive seems good. But infohazardous work comes with inherent risks from misuse and copycatting. Much of the time, it may be useful to prioritize work that is more robustly good. And when risky things are worked on, it should be by people who are divested from potential windfalls resulting from it.
I strongly agree that this is a thing we should be thinking about. That mechanistic interpretability has failed to meaningfully enhance capabilities so far is, I think, largely owed to current interpretability being really bad. The field has barely figured out the first thing about NN insides yet. I think the level of understanding needed for proper alignment and deception detection is massively above what we currently have. To give a rough idea, I think you probably need to understand LLMs well enough to be able to code one up by hand in C, without using any numerical optimisation, and have it be roughly as good as GPT-2. I would expect that level of insight to have a high risk of leading to massive capability improvements, since I see little indication that current architectures, which were found mostly through not-very-educated guesswork, are anywhere near the upper limit of what the hardware and data sets allow.I would go further and suggest that we need to plan for what happens if we have some success and see fundamental insights that seem crucial for alignment, but might also be used to make a superintelligence. How do you keep researching and working with collaborators safely in that information environment? There does not currently exist much of an infrastructure for different orgs and researchers to talk to another under some level of security and trust. If not proper NDAs and vetting, at least the early establishment of stronger ecosystem norms around secrecy, and the normalization of legally non-binding, honor based NDAs might be in order. If we don’t do it now, it might become a roadblock that eats up valuable time later, near the end, when time is even more precious.
TLDR: I care most about how we prioritise research and how we shape the culture of the field. I want mech interp to be a large and thriving field, but one that prizes genuine scientific understanding and steerability of systems, and not a field where the goal is to make a number go up, or where capabilities advancements feel intrinsically high status. Thinking through whether to publish something does matter on the margin and should be done, but it's a lower order bit - most research that could directly cause harm if published is probably just not worth doing! And getting good at interpretability seems really important, in a way that makes me pretty opposed to secrecy or paralysis. I'm concerned about direct effects on capabilities (a la induction heads, or even more directly producing relevant ideas), but think that worrying about indirectly accelerating capabilities via eg field-building or producing fundamental insights about models that someone then builds on, is too hard and paralysing to be worthwhile. People new to the field tend to worry about this way too much, and should chill out.
I think these are important, but hard and thorny questions, and it's easy to end paralysed, or to avoid significant positive impact by being too conservative here. But also worth trying to carefully think through.
The most important question to me is not publication norms, but what research we do in the first place, and the norms in the field of what good research looks like. To me this is the highest leverage thing, especially as the field is small yet growing fast. My vision for the field of mechanistic interpretability is one that prizes rigorous, scientific understanding of models. And I personally judge by how well it tracked truth and taught me things about models, rather than whether it made models better. I'll feel pretty sad if we build a field of mech interp where the high-status thing to do is to push hard on making models better.
To me this is a much more important question than publication norms - if you do research that's net bad to publish, probably it would have been better to do something else with a clearer net win for alignment, all other things being the same. This can be hard to tell ahead of time, so I think this is worth thinking through before publishing, but that's a lower-order bit.
At a high level, I think that getting good at interpretability seems crucial to alignment going well (whether via mech interp or some other angle of attack), and we aren't very good at it yet! Further, it's a very hard problem, yet with lots of surface area to get traction on it, and I would like there to be a large and thriving field of mech interp. This includes significant effort from academia and outside the alignment community, which means having people who are excited about capabilities advancement. This means accepting that "do no harm" is an unrealistically high standard, and I mostly want to go full steam ahead on doing great mech interp work and publishing it and making it easy to build upon. I think that tracking the indirect effects here is hard and likely ineffective and unhelpful. Though I do think that "would this result directly help a capabilities researcher, in a way that does not result in interpretability understanding" is a question worth thinking about
I mostly think that interpretability so far has had fairly little impact on capabilities or alignment, but mostly because we aren't very good at it! If the ambitious claims of really understanding a system hold, then I expect this to be valuable for both, (in a way that's pretty correlated) though it seems far better on net than most capabilities work! We should plan for success - if we remain this bad at interpretability we should just give up and do something else. So to me the interesting question is how much there are research directions that push much harder on capabilities than alignment.
One area that's particularly interesting to me is using interpretability to make systems more steerable, like interpretability-assisted RLHF. This seems like it easily boosts capabilities and alignment, but IMO is a pretty important thing to practice and test, and see what it takes to get good at this in practice (or if it just breaks the interpretability techniques and makes the failures more subtle!).
Using mech interp to analyse fundamental scientific questions in deep learning like superposition is more confusing to me. I would mostly guess it's harmless (eg I would be pretty surprised if my grokking work is directly useful for capabilities!). For some specific questions like superposition, I think that better understanding this is one of the biggest open problems in mech interp, and well worth the capabilities externalities!
A final note is that these thoughts are aimed more at established researchers, and how we should think as we grow the field. I often see people new to the field, working independently without a mentor, who are very stressed about this. I think this is highly unproductive - an unmentored first project is probably not going to produce good research, let alone accidentally produce a real capabilities advance, and you should prioritise learning and seeing how much you enjoy the research.
Depending on when this document is circulated, I either have a post in my drafts folder on this topic, or I have recently posted my thoughts on this topic. I agree that the situation is pretty thorny. If the choice were all up to me and I had magic coordination powers, I'd create a large and thriving interpretability community that was committed to research closure relative to the larger world, while sharing research freely within the community, and while committing not to use the fruits of that research for capabilities advancements (until humanity understands intelligence well enough to use that knowledge wisely).
This likely depends a lot on how the “solution to superposition” looks like. A sparse coding scheme is less likely to be capabilities advancing than a fundamental insight into transformers that allows us to decode superposed features everywhere in the network.
Note that this goes both ways – just because mech interp. has not been particularly useful for alignment right now, does not mean that future work won’t!
To throw in my two cents, I think it's clear that whole classes of "mechansitic interpretability" work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.
And I think this points strongly against publishing this stuff, especially if the goal is to "make this whole field more prestigious real quick". Insofar as the prestige is coming from folks who work on AI capabilities, that's drinking from a poisoned well (since they'll grant the most prestige to the work that helps them accelerate).
One relevant point I don't see discussed is that interpretability research is trying to buy us "slack", but capabilities research consumes available "slack" as fuel until none is left.
What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we're left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.
Idk how to point at this thing properly, my examples aren't great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.
But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.
I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs.
I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well.
This isn't something I can visualize working, but maybe it has components of an answer.
And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
I'm surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)
I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc.
I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).
The transformer circuits work strikes me this way, so does a bunch of others.
Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.
Interesting, thanks for the context. I buy that this could be bad, but I'm surprised that you see little upside - the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works
I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.
Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.
I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and there's a lot more attention/people/incentives for capabilities.
I think there are more targeted things that would be better for getting more good work to happen. Like research workshops or unconferences, where you choose who to invite, or building community with more aligned folk who are looking for interesting and alignment-relevant research directions. This would come with way less potential harm imo as a recruitment strategy.