Three ways interpretability could be impactful

Arthur Conmy

Epistemic status: somewhat confident that overall these arguments show interpretability could be impactful, but not confident on any individual point.

Background: I wrote a draft of this post here after reading Charbel’s work. Thanks to Charbel and several others’ feedback, the post was improved a lot: I would definitely recommend adversarial collaboration so thanks Charbel! I’m hoping to discuss interpretability’s impact further so posting now.

Three ways interpretability could be impactful

Just read the test and validation set parts if you have little time.

1. Test set interpretability: Finding alignment failures

What this is. We want to find alignment failures with interpretability that other forms of auditing do not find. Before deploying a model, we find inputs that seem to elicit new capabilities of the model. Either A) the model has correctly generalized to novel capabilities or B) the model is deceiving our evaluation process to produce convincing but false outputs. We then use our future interpretability tools to try to understand some of the computation the AI is doing to reach these new capabilities. We either become more confident that the AI has correctly generalized, or we’ve uncovered deceptive computation and we can retrain and/or provide robust evidence of misalignment to relevant actors.

Why this isn’t achievable with other methods. If a model can deceive our evaluation process it could be very difficult to observe this deception with mere behavorial evals. See here.

In my mind test set interpretability primarily targets a specific set of alignment failures, illustrated here:

Figure 1.

In the appendix I outline my reasoning behind (interpretability’s role in) Figure 1.

2. Validation set interpretability: A better science of alignment.

What this is (thanks to Neel). We have little ground truth on whether our models are misaligned now or how far methods such as RLHF will further scale. More generally, we understand little about how machine learning works, which limits our ability to reason about future systems. Interpretability could first and foremost actually provide evidence for what our alignment techniques are doing (e.g interpreting RLHF reward models) and secondly give us a better evidence base for reasoning about deep learning. I think that Progress Measures for Grokking Via Mechanistic Interpretability has already somewhat changed people’s perspectives on how ML models select different algorithms (e.g here, here).

This differs from test set interpretability as it is broader and can be applied before testing potentially misaligned models, to steer the field towards better practises for alignment (Russell and Norvig’s validation/test distinction here may be helpful for analogy).

Why this isn’t achievable with other methods. If we want to understand how models work for safety-relevant end goals, it seems likely to me that interpretability is the best research direction to pursue. Most methods are merely behavorial and so provide limited ground truth, especially when we are uncertain about deception. For example, I think the existing work trying to make chain-of-thought faithful shows that naive prompting is likely insufficient to understand models’ reasoning. Non-behavorial methods such as science of deep learning approaches (e.g singular learning theory, scaling laws) by default give high-level descriptions of neural network statistics such as loss or RLCT. I don’t think these approaches are as likely to get as close to answering the questions about internal computations in AIs that I think successful interpretability could lead to. I think some other directions are worthwhile bets due to uncertainty and neglectedness, however.

3. Training set interpretability: Training models to be interpretable

(In my opinion this is less impactful than test and validation)

What this is. While test and validation set interpretability offer ways to check whether training procedures have generated misaligned models, they don’t work on actually aligning future models. We could use interpretability to train models by e.g i) rewarding cases where the model’s internal and external reasoning process are identical, or ii) adding terms into the loss function that incentivize simple and/or fast internal computation (example paper that trains for interpretability).

Why this **IS** achievable with other methods. I think training set interpretability is not as promising as test or validation set interpretability. This is because I think that there are lots of other ways to try to align models through training such as (AI-assisted) human feedback, process-based training or adversarial training, and so marginal benefits of interpretability are lower. However, ML is an empirical field so it’s hard to tell which techniques will be effective and which won’t be. Additionally, I’m not that strict on what counts as safety or what counts as interpretability. Few people are working on training for interpretability, so I’d be curious to see projects!

Feasibility of interpretability

The above arguments use cases of interpretability don’t spell out why these are feasible goals. This section has three arguments for feasibility

1. We have perfect access to models inputs, internals and outputs.

This has been stated many times before (~~Chris Olah’s 80k episode~~ Chris Olah's blog) but worth reiterating. Consider a scientific field such as biology where molecular biology and the discovery of DNA, Darwin’s discovery of natural selection and gene-editing were all developed by researchers working with extremely messy input, internal and output channels. In neural networks, we can observe all input, internal and output channels always, with no noise.

2. Crisp internal representations may be both natural and the most important objects of study.

LLMs are trained to model human text and to be useful to humans. In my opinion, both of these pressures push them to have representations close to human concepts (relevant papers include this and this thanks to Charbel) when gradient descent is successful. Also, within reasonable limits of human ability, when model behaviors are general these are likely to be the behaviors we are most concerned by: these will be the capabilities that are can reliably be missed by brute forcing prompt space. In my opinion, general model behaviors are more likely to correspond to crisp internal representations since it is more difficult for models to learn a large number of heuristics that collectively are general.

3. Most useful techniques in the field can be systematized and automated.

The path decomposition insight in Transformer Circuits, several forms of patching and gradient-based attribution methods are all examples of techniques that I would consider systematizable (rather than, for example, requiring some researcher ingenuity to figure out what new algorithm a model is implementing in every interpretability project).

I think systematizing and automating interpretability techniques are some of the most promising directions for research. Note that this is contra Chris Olah (here). To me, automation is crucial for defense against obsoletion by the bitter lesson. I am wary of the “new scientific field” analogy for mechanistic interpretability, as sciences typically do not mature in less than a couple of decades. Automating interpretability could allow the field's speed to match capability progress (Carl Shulman discusses this race dynamic here).

Appendix

(Remarks on Figure 1)

Figure 1 from the main text

Inspired by this post that extends Anthropic’s thoughts here, in Figure 1 consider the “dangerous behavior level” of a ML model. The dangerous behavior level measures a specific dangerous behavior that an AI could have, such as producing a novel virus, or designing software with hidden backdoors. Of course dangerous behavior level isn’t just one axis, and I haven’t added units (just like the post).

Figure 1’s chronology seems plausible through training.

Why might all the stages in Figure 1 occur?

I think this is because gradient descent likely gradually implements and removes heuristics of differing sophistication through training. Therefore when capabilities first appear they are likely to be blatantly capable (not very robust and observable by humans from I/O channels only). Blatant capabilities, if they’re not trained away anyway, are likely to be spotted by comprehensive evals because of I/O observability. To me, the next level of dangerous and deceptive capability is likely when there are two possible generalizations to perform a novel task: i) performing novel tasks in a new environment and ii) producing an output which appears to be correct to humans. If the novel task is sufficiently difficult then NN inductive biases could prefer 2, due to a simplicity bias (i.e a shortcut solution). A training process like those performed today wouldn’t distinguish between i) and ii). Finally, speculatively, significant situational awareness required for the capable and obfuscated solution to be selected, and therefore the deceptively capable part of Figure 1 could be a sizable fraction of the dangerous capability axis. If capable and obfuscated dangerous capabilites emerge, I’m pessimistic about how helpful currently explored prosaic techniques will be.

Figure 1’s chronology seems plausible through scaling.

Similarly to the above point, I think that larger models likely have more capacity for a higher dangerous behavior levels. I think that since there are many large models trained, more gradual dangerous capability level jumps are more likely than sudden leaps. I also personally think that punctuated equilibria is the most useful model for progress from one frontier model to the next, currently. This doesn't suggest that there's a single extreme discontinuity. However, it does suggest there are multiple, smaller discontinuities.

Figure 1 suggests that interpretability is only somewhat important.

None of this post claims that interpretability is more impactful than other alignment approaches (e.g, I think scalable oversight to improve upon RLHF, as well as dangerous capability evals sound more neglected, and I’m personally working on interpretability since I think I have a fairly strong comparative advantage). I am just arguing that to me there is a compelling case for some interpretability research, contra Charbel’s post. In fact, since dangerous capability evals is such a new research area, and only in 2023 it has really become clear that coordination between labs and policymaker interventions could make real impact, to me I would guess that this is a stronger bet for new alignment researchers compared to interpretability, absent other information.

Thanks to @Charbel-Raphaël for prompting this, @Neel Nanda for most of the ideas behind the validation set, as well as Lucius Bushnaq, Cody Rushing and Joseph Bloom for feedback. EDIT: after reading this, I found this LW post which is has a similar name, covers some ideas but differs in that there are more ideas explored.