TL;DR: In this post, I want to argue why Interpretability & Transparency tools have a defender’s advantage if they are used correctly, i.e. they improve alignment much more than new capabilities and therefore mitigate risks of dual use. I draw parallels from biosecurity researchers who have thought about the risks of dual-use and defender’s advantages in more detail and I think that the AI safety community can learn a lot from them. Lastly, I want to point out that not all interpretability tools have a clear defender’s advantage and some interpretability research might still carry a lot of risks when used incorrectly.
I’d like to thank Lee Sharkey and Simon Grimm for their feedback on this post.
Most technology is dual-use in some way--a knife can be used as a household appliance or as a weapon. However, different technologies have different propensities to be used for good or bad, e.g. more research into walls will likely benefit the defender more than the attacker while more research into the capabilities of viruses benefits attackers more than defenders.
I feel like we, the AI safety community, have not thought enough about which approaches have a clear defender’s advantage or how we could steer existing approaches to have more of a defender's advantage. To my (very limited) understanding, the biosecurity community has thought a bit more about these kinds of dual-use trade-offs. Therefore, we could probably learn some things from them.
In this post, I want to briefly look at some of the possible lessons from biosecurity and see if we can translate them to AI safety. Then I want to argue why interpretability is one of the approaches that plausibly has a defender’s advantage.
I’m certainly not the first person to have come to the conclusion that interpretability is important for alignment. Chris Olah has made the case for interpretability for years. Neel Nanda has provided a long theory of impacts of interpretability research. Quintin Pope has made the case for optimism about interpretability. Evan Hubinger has provided 11 proposals to build safe AI that are all essentially something+interpretability, has developed an interpretability tech tree and summarized transformer circuits. ARC is working on ELK (and related topics) that certainly read to me as if they are intended to prevent deceptive alignment. There are many further good posts on aspects of interpretability (see e.g. here, here, here, or here).
The reason why I add this post to the long list of posts arguing for the importance of interpretability is that I feel like the “defender’s advantage” framework allows for an easy way to decide which kind of interpretability research will help more with alignment than with capabilities and thus alleviates one major concern that some people have against it (personal conversations, not sure if someone wrote this down).
Most of the following comes from personal discussions with biosecurity researchers or podcasts like “Hear This Idea”s interview with Kevin Esvelt and Jonas Sandbrink. I’m not a biosecurity researcher myself and the following is likely to lack nuance.
In summary, a) some defensive tools do not require novel capabilities, e.g. broad spectrum vaccines, better PPE or better ventilation, and b) the knowledge of defensive insights can sometimes be used intentionally or accidentally to create more powerful offensive tools. Thus, we should keep the offensive-defensive scaling in mind when creating a new tool.
I’m probably missing a lot of nuance and some important points but even these fairly general ideas can already be translated to AI safety--at least to some extent.
For the following section, I will use Interpretability in a very broad sense, i.e. including mechanistic interpretability but also more high-level approaches that aim to understand NNs (sometimes called “Science of DL”).
My reasons to think that interpretability has a defenders advantage include
I argue that many forms of interpretability and transparency have a defender’s advantage, i.e. that they are more likely to help with alignment than with developing new capabilities. However, results from interpretability investigations should still be handled with care. Specific use cases and types of interpretability can still carry substantial risk of increasing capabilities without meaningfully increasing alignment. For example, I think there is some chance that Neel Nanda’s mechanistic analysis of grokking will lead to capability improvements in the long run. I still think it was correct to publish these results on balance but one should think about possible harms beforehand (and I expect Neel to have done that). I expect the situations where the defender's advantage doesn’t hold anymore to be “we understood the system well enough to make it better but not really why it got better” similar to how our current understanding of scaling laws allows us to build more capable models but we don’t really understand why.
To offer a simple solution, there is always the option to share results only with a select group of people rather than publishing them or doing research that is private by default.
Furthermore, I want to encourage other AI safety researchers to apply the defender’s advantage framework more generally and pursue research that has a high chance to help with alignment while not increasing capabilities unnecessarily.
For example, I think there is some chance that Neel Nanda’s mechanistic analysis of grokking will lead to capability improvements in the long run.
I'm curious if you have a particular concern in mind here?
My personal take is that this is the kind of interpretability work where I'm least concerned about it leading to capabilities improvements, since it's very specific to toy models and analysing deep learning puzzles, and pretty far from the state of the art frontier.
In a world where it does lead to advancements, my best guess is that it follows a pretty indirect and diffuse trajectory (eg, furthers science of deep learning studies which lead to new insights that let us build better models, or get more people excited about interpretability which leads to more research and some of that advances capabilities), which seems extremely hard to model. I'd guess the alignment benefits of the work are minor to moderate (definitely not the interpretability work I think is most relevant to pushing on reducing x-risk, but likely somewhat useful), and strongly outweigh this kind of concern about diffuse and hard-to-predict effects