Part 11 of 12 in the Engineer’s Interpretability Sequence.

So far, this sequence has discussed a number of topics in interpretability research, all building toward this post. Its goal is to explain some approaches that may be valuable moving forward. I plan to work on some of the ideas here soon. Others, I may not work on soon, but I would love to discuss and support such work if I am able. I hope that this post can offer some useful ideas for people entering or continuing with interpretability research, and if you would like to discuss anything here more, feel more than free to email me at

What are we working toward?

First, it seems useful to highlight two points that are uncontroversial in the AI safety community but important nonetheless. 

Our goal is a toolbox – not a silver bullet.

As AI safety engineers, we should neither expect nor try to find a single ‘best’ approach to interpretability that will solve all of our problems. There are many different types of interpretability tools, and many of the differences between them can be described as enforcing different priors over what explanations they generate. So it is trivial to see that there is not going to be any free lunch. There is no silver bullet for interpretability, and few tools conflict with each other anyway. Hence, our goal is a toolbox. 

In fact, some coauthors and I recently found an excellent example of how using multiple interpretability tools at once beats using individual ones (Casper et al., 2023).

This doesn’t mean, however, that we should celebrate just any new interpretability tool. Working in unproductive directions is costly, and applying tool after tool to a problem contributes substantially to the alignment tax. The best types of tools to fill our toolbox will be ones that are automatable, cheap to use, and have demonstrated capabilities on tasks of engineering-relevance. 

Don’t advance capabilities.

As AI safety engineers, we do not want to advance capabilities because doing so speeds up timelines. In turn, faster timelines mean less time for safety research, less time for regulators to react, and a greater likelihood of immense power being concentrated in the hands of very few. Avoiding faster timelines isn’t as simple as just not working on capabilities though. Many techniques have potential uses for both safety and capabilities. So instead of judging our work based on how much we improve safety, we need to judge it based on how much we improve safety relative to capabilities. This is an especially important tradeoff for engineers to keep in mind. A good example was discussed by Hendrycks and Woodside (2022) who observed that there is a positive correlation between the anomaly detection capabilities of a network and its task performance. Some work may improve safety capabilities but if it does so by continuing along existing trendlines, we don’t get more safety than the counterfactual. For the full discussion of this point, see Hendrycks and Woodside (2022).

What types of existing tools/research seem promising?

Before discussing what topics may be important to work on in the future, it may be valuable to reflect on examples of past work that have introduced interpretability tools that seem to be able to competitively provide engineering-relevant insights. Here is a personal list that is subjective and undoubtedly incomplete. But hopefully it is still valuable. Consider this an engineer’s interpretability reading list of sorts.

Some works have competitively done engineering-relevant things with methods for making novel predictions about how a network will handle OOD inputs.

Others involve controlling what a system does by editing or repurposing it.

I have found papers on this list to be very inspiring, and I hope that you might too :)


Previously, EIS III discussed the growing consensus that interpretability research needs more benchmarking. Having rigorous benchmarks that measure how competitive tools are for engineering tasks of practical interest may be one of the most important yet neglected things in the interpretability research space. Recall from EIS III that there are three ways to demonstrate that an interpretability tool is of potential engineering value: 

1. Making novel predictions about how the system will handle interesting inputs. This is what we worked on in Casper et al. (2023). Approaches to benchmarking in this category will involve designing adversaries and detecting trojans. 

2. Controlling what a system does by guiding manipulations to its parameters. Benchmarks based on this principle should involve cleanly implanting and/or removing trojans or changing other properties of interest. Wu et al. (2022) provide benchmarks involving tasks with trojans that are somewhat related to this.

3. Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative. Benchmarks for this might involve using interpretability tools to reconstruct the function that was used to design or supervise a network. This is the kind of thing that EIS VII and Lindner et al. (2023) focus on. 

Even though some prior work has been done involving these three approaches, there is much more to be done to make well-designed benchmarks that have the potential to be widely adopted.  Also, none of the works mentioned above involve natural language processing except for Lindner et al. (2023). But since large language models are here to stay, we should be working on benchmarking tools for them. I am planning on starting some work with language models similar to Casper et al. (2023) very soon – let me know if you have questions or ideas.

Another thing that may be valuable for benchmarking is automated methods for performing evaluation. In Casper et al. (2023), we relied on human subjects to determine how helpful interpretability tools were, but this was costly in time, money, and effort. We tried to see if CLIP could stand in for a human by taking multiple choice tests, but it did poorly compared to humans. Nonetheless, given the potential improvements to the benchmarking process, working toward automated benchmark evals may be very helpful. 

Red teaming in the wild

The only thing better than testing interpretability tools on good benchmarks is testing them in the real world. There is a laundry list of reasons to work more on this. 

  • Finding real problems in real world systems prevents cherrypicking and requires competitiveness with other approaches.
  • The proof is in the pudding – finding a verified problem in a real system is as good a certificate as any of an interpretability tool’s value (emphasis on verified though). 
  • This work requires engaging more with the world outside of the AI safety research space. It could be a good opportunity to make allies and friends with people working on causes with convergent agendas.
  • This type of work could be an excellent way to gain attention from regulators. One thing that seems strikingly common among AI safety researchers is a desire for some type of governing body that performs auditing and standard setting for risky models in high-stakes deployments. For the same reason that clinical trials mandated by the US FDA reduce risks from new drugs, auditing models before deployment would reduce risks from AI. Norms/mandates for auditing like this would serve to directly identify harms, incentivize developers to not have harmful systems, and delay deployment timelines. 
  • Finally, work on finding problems with real world systems may be especially timely. Right now, as text and image generations are rapidly becoming more well-known, it may be an especially good time to start building society’s collective understanding of the problems with these systems.  

It is also worth mentioning that Rando et al. (2022) offer a great example of this type of work, and the Stanford AI Audit Challenge gives an example of a coordinated push for more of it. Lately, I have been thinking about trying to point my research work in a direction of greater possible policy relevance, and I may work on red teaming systems in the wild soon – let me know if you have questions or ideas. 

Finding new tools

Toward a basic understanding

Even though lots of current interpretability work may not be moving in the most useful directions (see EIS VI), it is still important to emphasize the value of work that helps to build conceptual foundations. This isn’t a replacement for engineering, and it shouldn’t be the dominant thing we work on, but it is likely useful nonetheless. Some of this work might lead to new useful tools. For example, useful regularizers to disentangle models might come out of Anthropic’s work in the near future. 

We need more interdisciplinary work.

A relatively neglected way of generating new insights may be working at the intersections of research on interpretability, adversaries, continual learning, modularity, compression, and biological brains. As discussed in the last post, the consistency of connections between these areas seems to be almost uncanny. Working to understand how methods from related fields like these can be used for the interpretability toolbox could be highly valuable both for generating basic insights and practical tools. Emphasizing this type of work might also be a useful way of getting people from different backgrounds interested in useful interpretability research. 

Combining tools

When it comes to the (neglectedness  tractability  importance) calculus, there may be no more compelling type of interpretability work than simply studying how different existing tools combine and interact. Right now, largely as a consequence of not having popular benchmarks, there is very little work that studies the interactions of multiple interpretability tools. How interpretable might a modular network that’s adversarially trained with lateral inhibition and elastic weight consolidation be? How much could we learn about it by combining various forms of feature synthesis, probing, exemplar analysis, etc.? We should find out! 

In engineering disciplines, progress is rarely made by the introduction of single methods that change everything. Instead, game-changing innovations usually come from multiple techniques being applied at once. As discussed in EIS III, progress on ImageNet is an excellent example of this. It took combinations of many new techniques – batch normalization, residual connections, inception modules, deeper architectures, hyperparameter engineering, etc – to improve the performance of CNNs so much so quickly. 

There is some precedent for valuable, influential papers that make no novel contributions other than combining old techniques. A good example of this comes from reinforcement learning. Hessel et al. (2017) improved greatly on the previous state of the art on deep Q learning by simply combining a bunch of existing tricks. 

Recall that there are two types of interpretability techniques: intrinsic ones that are applied before or during training to make a network more naturally interpretable, and post hoc ones that are applied to study the network after it has been trained. One of the nice things about having both types of methods is that almost no pairs of intrinsic and post hoc tools conflict. As discussed in EIS V, the AI safety field works surprisingly little on intrinsic interpretability techniques. So there is a lot of room to incorporate these techniques into future safety work. 

A final note – this is all very low-hanging fruit, and this type of work could be valuable for people newer to interpretability work. 

Automated translation/distillation of networks into programs

As discussed in EIS VI, automated model-guided program synthesis may be the key to mechanistic interpretability being engineering-relevant. Telling whether a model’s internal mechanisms conform to a particular hypothesis is the easy part of mechanistic interpretability, and casual scrubbing seems to be a good step toward this. The hard part is the program synthesis, and automating good versions of it should be a high priority for mechanistic interpretability researchers. Toward this goal, applying intrinsic interpretability tools, neurosymbolic techniques, and techniques for distilling networks into simpler models seem to be potentially useful approaches. See EIS VI for the full discussion.

Latent adversarial training 

EIS VII, discussed how mechanistic interpretability is not uniquely equipped for solving deceptive alignment and proposed latent adversarial training (Jermyn, 2022) as an alternative. By definition, deceptive alignment failures will not be triggered by inputs that will be easy to simulate during development (e.g. factors of RSA 2048). However, there will probably be a substantial amount of neural circuitry dedicated to misbehavior once it is triggered. Instead of ineffective adversarial training or difficult mechanistic interpretability, latent adversarial training might offer a way to automatically prevent the bad behavior during training. It would also have most/all of the same benefits of regular adversarial training discussed in EIS IX.

Considering the potential of this type of work, it seems surprising that it has not been studied  much yet in context of deception. There are already examples of latent adversarial training being used to improve robustness in CNNs and GCNs (Jin et al. (2020)Balunovic et al. (2020)Park et al. (2021)) and to improve generalization in language models (Zhu et al. (2019)Liu et al. (2020)Li et al. (2020)He et al. (2020)Pan et al. (2021)Yuan et al. (2021)). But to the best of my knowledge, there does not yet exist an example of latent adversarial training being used to solve a deception-like problem. 

A good way to approach this would be to take a model and its training dataset and then identify a behavior that, just like deception, does not appear in the training set but is still incentivized by it. Then the only other step is to train against that type of output under latent adversarial perturbations. An example of this could be how language models learn to say certain n-grams that are perfectly normal English despite not actually appearing in the training corpus. Training the language model to never say some set of these n-grams even in the presence of latent adversarial perturbations (1) might work well, and (2) would be the same type of technical problem as fixing deception. 

I plan to work on this soon. Let me know if you have questions or ideas!


  • Which ideas here do you like? Which do you not like?
  • Would you like to work with me or talk with me about any of these? 
  • Do you have thoughts or ideas on benchmarking LLM interpretability tools or latent adversarial training? These are the two things here that I am most likely to work on in the near future. If you have any, let's talk!
  • What are your thoughts on the theory of change disussed in the Red Teaming in the Wild section: engineering-relevant inerpretability tools --> red-teaming real world models --> potential for applications in auditing --> attention and support from governance bodies --> formal requirements for model auditing --> fewer failures, better incentives, and slower deployment timelines?
  • Do you know of any examples of papers like Hessel et al. (2017) who don't introduce new methods but instead study combinations of methods? Engstrom et al. (2019) is another example. 
New Comment
2 comments, sorted by Click to highlight new comments since:

Rando et al. (2022)

This link is broken btw!