Part 3 of 12 in the Engineer’s Interpretability Sequence.

Right now, interpretability is a major subfield in the machine learning research community. As mentioned in EIS I, there is so much work in interpretability that there is now a database of 5199 interpretability papers (Jacovi, 2023). You can also look at a survey from some coauthors and me on over 300 works on interpreting network internals (Räuker et al., 2022)

The key promise of interpretability is to offer open-ended ways of understanding and evaluating models that help us with AI safety. And the diversity of approaches to interpretability is encouraging since we want to build a toolbox full of many different useful techniques. But despite how much interpretability work is out there, the research has not been very good at producing competitive practical tools. Interpretability tools lack widespread use by practitioners in real applications (Doshi-Velez and Kim, 2017; Krishnan, 2019; Räuker et al., 2022). 

The root cause of this has much to do with interpretability research not being approached with as much engineering rigor as it ought to be. This has become increasingly well-understood. Here is a short reading list for anyone who wants to see more takes that are critical of interpretability research. This post will engage with each of these more below. 

  1. The Mythos of Model Interpretability (Lipton, 2016)
  2. Towards A Rigorous Science of Interpretable Machine Learning (Doshi-Velez and Kim, 2017)
  3. Explanation in Artificial Intelligence: Insights from the Social Sciences (Miller, 2017)
  4. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (Rudin, 2018)
  5. Against Interpretability: a Critical Examination of the Interpretability Problem in Machine Learning (Krishnan, 2019)
  6. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022)
  7. Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023) 

Note that I’m an author on the final two, so references to these papers are self-references. Also, my perspectives here are my own and should not be assumed to necessarily reflect those of coauthors. 

The goal of this post is to overview some broad limitations with interpretability research today. See also EIS V and EIS VI which discuss some similar themes in the context of AI safety and mechanistic interpretability research. 

The central problem: evaluation

The hardest thing about conducting good interpretability research is that it’s not clear whether an interpretation is good or not when there is no ground truth to compare it to. Neural systems are complex, and it’s hard to verify that an interpretation faithfully describes how a network truly functions. So what does it even mean to be meaningfully interpreting a network? There is unfortunately no agreed upon standard. Motivations and goals of interpretability researchers are notoriously “diverse and discordant” (Lipton, 2018). But here, we will take an engineer’s perspective and consider interpretations to be good to the extent that they are useful

Evaluation by intuition is inadequate.

Miller (2019) observes that “Most work in explainable artificial intelligence uses only the researchers’ intuition of what constitutes a ‘good’ explanation”. Some papers and posts have even formalized evaluation by intuition. Two examples are Yang et al. (2019) and Kirk et al. (2020) who proposed evaluation frameworks that included a criterion called “persuadability.” This was defined by Yang et al. (2019) as “subjective satisfaction or comprehensibility for the corresponding explanation.” 

This is not a very good criterion from an engineer’s perspective because it only involves intuition. To this day, there is a persistent problem in which sometimes researchers simply look at their results and pontificate about what they mean without putting the interpretations to rigorous tests. A recent example of this from AI safety work is from Elhage et al. (2022) who evaluated a neural interpretability technique by measuring how easily human subjects were able to simply form hypotheses about what roles neurons played in a network. 

The obvious problem with evaluation using human intuition is that it isn’t very good science – it treats hypotheses as conclusions (Rudin, 2019Miller, 2019; Räuker et al., 2022). But there are related issues that stem from Goodhart’s law. One is that evaluation by intuition can only guide progress toward methods that are good at explaining simple mechanisms that humans can readily grasp. But this fails to select for ones that might be useful for solving the types of difficult or nontrivial problems that are key for AI safety. Evaluation by intuition also encourages cherrypicking which is common in the literature (Räuker et al., 2022). And to the extent that cherrypicking is the norm, this will only tend to guide progress toward methods that are good in their best-case performance. But if we want reliable interpretability tools, we should be aiming for methods that perform well in the average or worst case.

Weak, ad hoc evaluation is not enough either.

Objective evaluation is clearly needed. But just because an evaluation method involves quantitative measurements or testing falsifiable hypotheses doesn’t mean it’s a very valuable one. Evaluation can adhere to the scientific method while still not being useful for engineering. As an example, I confess to doing this myself in some past work (Hod et al., 2021). In order to test how useful different clusterings of neurons might be for studying networks, we solely used proxy measures. And while we did not claim to be "interpreting" the network by doing so, interpretability was our motivation. Another way this problem often appears is by testing on the training proxy. Sometimes researchers evaluate interpretability tools based on the loss function for whatever model, feature, mask, map, clustering, vector, distance, or other thing was optimized during training. Unless the loss in this case is the exact definition of what is cared about, this will lead to Goodharting. More examples are discussed below. 

Again, the main issue here is the obvious one. It’s that not holding interpretability works to engineering-relevant evaluation standards won’t produce methods that are useful for engineering.  But another closely-related problem is the commonality of ad hoc methods to evaluate tools. The interpretability field probably should -- but does not yet -- have clear and consistent evaluation methods. Instead, the norm is for every paper’s authors to independently introduce and apply their own approach to evaluation. This allows researchers to only select measures that make their technique look good. 

Some examples

Claiming that most interpretability research is not evaluated well is the kind of statement that demands some more concreteness. But showcasing arbitrary examples wouldn’t help much with this point. To try to give an unbiased sense of the state of the field, I went to the NeurIPS (the largest AI conference) 2021 (the most recent year for which the full list of papers is available at the time of writing this) and searched among all accepted papers that had “interpretability” in the title. There were 4. None of which evaluated their techniques in a way that an engineer would find very compelling. 

  1. Understanding Instance-based Interpretability of Variational Auto-Encoders (Kong and Chaudhri, 2021) is about analyzing how influential individual training set examples are for how variational autoencoders handle test examples. The experiments are all conducted on MNIST and CIFAR-10. The evaluation does not involve baselines or benchmarks and consists of (1) a sanity check to verify that examples are judged as high-influence for themselves, (2) using the method to produce similar looking “proponents” and different looking “opponents” of testing examples, and (3) using this method for anomaly detection to find that it tends to classify OOD samples as anomalies slightly more than dataset ones on average. (1) is trivial, (2) is intuition-based, and (3) is weak. 
  2. Foundations of Symbolic Languages for Model Interpretability (Arenas et al., 2021) is fairly different from the other three papers, and it is not focused on neural nets. It is more of a model checking paper with interpretability implications than a normal ‘interpretability’ paper. The authors introduce a first order logic to describe various properties of models. The only experiments are to evaluate runtime for statement verification on various decision tree models. To be fair to the authors, their goal was to break conceptual and theoretical ground – not to introduce any interpretability tools. But they still do not do anything to show that their framework is a valuable one – all of the arguments for the merits of the framework are theoretical. There is also no discussion of baselines or alternatives, and the only models the paper works with are decision trees. The OpenReview reviews were mostly positive, but one of the reviewers commented “I can't imagine anyone using this for anything useful." I generally agree. 
  3. IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers (Pan et al., 2021) seeks to reduce redundant computations inside of vision transformers. It introduces a dynamic inference method to preprocess image patches to exclude redundant ones. The paper introduces a method for speeding up forward passes through transformers which is nice. But there is little substance on interpretability. The technique is used for feature attribution by constructing saliency maps to highlight what parts of an input image are non-redundant (i.e. “important”) to the model. The authors assert, “our method is inherently interpretable with substantial visual evidence.” And when comparing examples from this method to other feature attribution techniques, they write “We find that our method can obviously better interpret the part-level stuff of the objects of interest.” I don’t find it obvious at all – see the figure below and decide for yourself. Their argument is just based on intuition and maybe even a little wishful thinking. 
  4. Improving Deep Learning Interpretability by Saliency Guided Training (Ismail et al., 2021) is about training networks with a form of regularization that encourages them to have more sparse, lucid saliency maps. It evaluates the saliency maps of trained networks on quantitative measures of how well the features covered by the maps cover all and only the important features needed for classification. This is quantitative but weak because it fails to connect the method to something of practical use. This paper tests on the training proxy, and it compares the method to a random baseline but no comparable ones from previous works. There is actually a significant amount of literature (see page 9 of Räuker et al., 2022) that predates this work and connects the same type of regularization that these authors study to useful improvements in adversarial robustness. But the authors of this paper did not discuss, experiment, or cite anything related to improving robustness. 

Pan et al. (2021) claim that the feature attributions from their technique are “obviously better” than alternatives. 

In summary, all four of four papers do not meaningfully evaluate methods by connecting them to anything of practical value. And to be clear, I only considered these 4 papers for this purpose – I didn’t cherrypick among selection methods. This is not to say that these papers are bad, uninteresting, or cannot be useful. But from the standpoint of an engineer who wants interpretability research to be rigorously approached and practically relevant, they all fall short of this goal. 

Evaluation with meaningful tasks is needed

An example

Suppose I visualize a neuron in a CNN and that it looks like dogs to me. 

From Olah et al. (2017)

Then suppose I say,

Nice, my feature visualization tool works! Look at this dog neuron it identified.

If I stopped at this point, this would just be intuition and pontification. And while this may not be a bad hypothesis, it can’t yet make for a conclusion

Then say I pass some images through the network, look at the results, and say, 

Just as I predicted – the neuron responds more consistently to dog images than non-dog ones.

This is still not enough. It’s too weak and ad hoc. From an engineer’s perspective, it’s not yet meaningful to say the neuron is a dog neuron unless I do something useful with that interpretation. And there are plenty of ways that a neuron which correlates with dog images could be doing something much more complicated than it seems at first. Olah et al. (2017) acknowledge this. See also Bolukbasi et al. (2021) for examples of such “interpretability illusions.”

But then finally, suppose I ablate the neuron from the network, run another experiment, and remark,

Aha! When I removed the neuron, the network stopped being able to classify dogs correctly but still performs the same on everything else. The same is true for OOD dog data. 

Now we’re talking!

If we want interpretability tools to help us do meaningful, engineering-relevant things with networks, we should establish benchmarks grounded in useful tasks to evaluate them for these capabilities. 

There is a growing consensus that more rigorous methods to evaluate interpretability tools are needed (Doshi-Velez & Kim, 2017Lipton, 2018Miller, 2019Hubinger, 2021Krishnan, 2020Hendrycks & Woodside, 2022CAIS, 2022Räuker et al., 2022). So what does good evaluation look like? Evaluation tools should measure how competitive interpretability tools are for helping humans or automated processes do one of the following three things. 

  1. Making novel predictions about how the system will handle OOD inputs. This could include designing adversaries, discovering trojans, or predicting model behavior on interesting OOD inputs.
  2. Controlling what a system does by guiding edits to it. This could involve cleanly implanting trojans, removing trojans, or making the network do other novel things via manual changes or targeted forms of fine-tuning. 
  3. Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative. This would mean showing that a system or subsystem can be replaced with something simpler such as a sparse network, linear model, decision tree, program, etc.

Notably, these three things logically partition the space of possible approaches: working with the inputs, working with the system, or getting rid of the whole thing and using something else. 

Meaningful benchmarking in interpretability is almost nonexistent, but benchmarks are important for driving progress in a field. They concretize research goals, give indications of what approaches are the most useful, and spur community efforts (Hendrycks and Woodside, 2022).

To help demonstrate the value of benchmarking, some coauthors and I recently finished a paper (Casper et al., 2023). We use strategy #1 above and evaluate interpretability tools based on how helpful they are to humans who want to rediscover interpretable trojans. A useful thing about this benchmarking task is that trojan triggers can be arbitrary and may not appear in a particular dataset. So novel triggers cannot be discovered by simply analyzing the examples from a dataset that the network mishandles. Thus, rediscovering them mirrors the practical challenge of finding flaws that evade detection with a test set. In other words – this benchmarking task tests competitiveness for debugging. 

We tested 9 different feature synthesis methods (rows) on 12 different trojans (columns) below. In the table below, each cell gives the proportion of the time that a method helped humans correctly identify a trojan trigger in a multiple choice test. See the paper for details. 

From Casper et al. (2023)

Notice two things in the data. First, some methods perform poorly including TABOR (Guo et al., 2019) and three of the four feature visualization (FV) methods (Olah et al., 2017Mordvintsev et al., 2018). So this experiment demonstrates how benchmarks can offer information about what does and doesn’t work well. Second, even the methods that do relatively well still fail to achieve a 50% success rate on average, so there is still more work to do to make these types of tools very reliable. From an engineer’s perspective, this is all valuable information.

There are many interpretability tools out there, so why did we only test 9 based on feature synthesis? This is because these 9 were the only ones of which we knew that are suited for this task at all. Most interpretability tools are only useful for analyzing how a network works on either specific examples or on a specific dataset (Räuker et al., 2022). In fact, very few are useful for studying how a network may (mis)behave on novel inputs. Only feature synthesis methods can be competitive for identifying novel trojan triggers because no non-synthesis method can give insights off a given data distribution. And when it comes to aligning highly intelligent and potentially deceptive systems, it seems likely that the failures that are difficult to find are going to be due to inputs well off the training distribution.

Other problems

Not scaling

Many interpretability tools have only been demonstrated to work at a small scale such as small MLPs trained on MNIST or small transformers trained on toy problems. But simple networks performing simple tasks can only be deployed in a limited number of settings of any practical consequence, and they often should be replaced with other intrinsically interpretable, non-network models (Rudin, 2018). Working at a small scale is usually a prerequisite to scaling things up later, and some lessons that can be learned from small experiments may offer excellent inspiration for future work. But unless there exists a realistic pathway from research at a small scale to more useful work at a large one, small-scale work seems to be of little direct value. 

Relying too much on humans

Most approaches to interpretability rely on a human somewhere in the loop. And in some cases like much mechanistic interpretability work, an immense amount of human involvement is typically required. But if the goal of interpretability is to rigorously obtain a useful understanding of large systems, human involvement needs to be efficient. Ideally, humans should be used for screening interpretations instead of generating them. Or maybe we don’t need humans at all. This possibility will be discussed more in future posts.

Failing to combine techniques

Most interpretability techniques can be combined with most others. Why just use one technique or one type of evidence to examine when you can have a bunch? Our goal for interpretability should be to design a useful toolbox – not a silver bullet. And notice above in our figure from Casper et al. (2023) that the best results overall come from combining all of the 9 methods. Unfortunately, the large majority of work in interpretability focuses on studying tools individually. But combining different methods seems to be a useful way to make better engineering progress.

Consider an example. In the 2010s, immense progress was made on ImageNet classification. But improvements didn’t come from single techniques, but a combination of breakthroughs like batch normalization, residual connections, inception modules, deeper architectures, improved optimizers, etc. Similarly, we should not expect to best advance interpretability without a combination of methods. 

A lack of practical applications

Our ultimate goal for interpretability tools is to use them in the real world, so it only makes sense to do more practical work. It’s worth noting that the sooner we can get interpretability tools to be relevant in the real world, the sooner that actors in AI governance can think concretely about ways to incorporate standards related to interpretability into the regulatory regime. 


  • Are there any papers you would add to the reading list of critical works at the beginning of this post?
  • Do you think there are any approaches to interpretability that this post isn’t charitable enough to? Why?
  • Do you know of any particularly interesting examples of intuition-based or weak/ad-hoc approaches to evaluating interpretability tools?
  • What do you find surprising or unsurprising about our results from Casper et al. (2023)? Would you have predicted that TABOR and feature visualization would struggle? Would you have predicted that robust feature level adversaries and SNAFUE would be the most effective? Would you have predicted that all of the methods would succeed less than 50% of the time?
New Comment
2 comments, sorted by Click to highlight new comments since:

Have you read the Redwood post on causal scrubbing? To me, it's an excellent example of evaluating interpretability using something other than intuition.

Thanks. I'll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it's a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.