Part 5 of 12 in the Engineer’s Interpretability Sequence.

Thanks to Anson Ho, Chris Olah, Neel Nanda, and Tony Wang for some discussions and comments. 

TAISIC = “the AI safety interpretability community”

MI = “mechanistic interpretability” 

Most AI safety interpretability work is conducted by researchers in a relatively small number of places, and TAISIC is closely-connected by personal relationships and the AI alignment forum. Much of the community is focused on a few specific approaches like circuits-style MI, mechanistic anomaly detection, causal scrubbing, and probing. But this is a limited set of topics, and TAISIC might benefit from broader engagement. In the Toward Transparent AI survey (Räuker et al., 2022), we wrote 21 subsections of survey content. Only 1 was on circuits, and only 4 consisted in significant part of work from TAISIC.

I have often heard people in TAISIC explicitly advising more junior researchers to not focus much on reading from the literature and instead to dive into projects. Obviously, experience working on projects is irreplaceable. But not engaging much with the broader literature and community is a recipe for developing insularity and blind spots. I am quick to push back against advice that doesn’t emphasize the importance of engaging with outside work. 

Within TAISIC, I have heard interpretability research described as dividing into two sets: mechanistic interpretability and, somewhat pejoratively,  “traditional interpretability.” I will be the first to say that some paradigms in interpretability research are unproductive (see EIS III-IV). But I give equal emphasis to the importance of TAISIC not being too parochial. Reasons include maintaining relevance and relationships in the broader community, drawing useful inspiration from past works, making less-correlated bets with what we focus on, and most importantly – not reinventing, renaming, and repeating work that has already been done outside of TAISIC. 

TAISIC has reinvented, reframed, or renamed several paradigms

Mechanistic interpretability requires program synthesis, program induction, and/or programming language translation 

“Circuits”-style MI is arguably the most popular and influential approach to interpretability in TAISIC. Doing this work requires iteratively (1) generating hypotheses for what a network is doing and then (2) testing how valid these hypotheses explain its internal mechanisms. Step 2 may not be that difficult, and causal scrubbing (discussed below) seems like a type of solution that will be useful for it. But step 1 is hard. Mechanistic hypothesis generation is a lot like doing program synthesis, program induction, and/or programming language translation. 

Generating mechanistic hypotheses requires synthesizing programs to explain a network using its behavior and/or structure. If a method for this involves synthesizing programs based on the task or I/O from the network, it is a form of program synthesis or induction. And if a method is based on using a network’s structure to write down a program to explain it, it is very similar to programming language translation. 

In general, program synthesis and program induction are very difficult and currently fail to scale to large problems. This is well-understood, and these fields are mature enough so that we have textbooks on them and how difficult they are (e.g. Gulwani et al., 2017). Meanwhile, programming language translation is very challenging too. In practice, translating between common languages (e.g. Python and Java) is only partially automatable and relies on many hand-coded rules (Qiu, 1999), and using large language models has had limited success (Roziere et al., 2020).  And in cases like these, both the source and target language are discrete and easily interpretable. Since this isn’t the case for neural networks, we should expect things to be more difficult for translating them into programs. 

It is unclear the extent to which the relationships between program synthesis, induction, and language translation, and MI are understood inside of TAISIC. I do not know of this connection being pointed out before in TAISIC. But understanding this seems important for seeing why MI is difficult and likely to stay that way. MI work in TAISIC has thus far been limited to explaining simple (sub)processes. In cases like these, the program synthesis part of the problem is very easy for a human to accomplish manually. But if a problem can be solved by a program that a human can easily write, then it is not one that we should be applying deep learning to (Rudin, 2018). There will be a much more in-depth discussion of this problem in EIS VI.

If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals. 

When highly intelligent systems in the future learn unexpected, harmful behaviors, characterizing the neural circuitry involved will probably not be simple like the current MI work that TAISIC focuses on. We should not expect solving toy MI problems using humans to help with real world MI problems any more than we should expect solving toy program synthesis problems using humans to help with real world program synthesis problems. As a result, automating model-guided hypothesis generation seems to be the only hope that MI research has to be very practically relevant. It may be time for a paradigm shift in TAISIC toward symbolic methods. But the fact that existing neurosymbolic work has not yet scaled or been very useful for many practical problems seems to signify difficulties ahead.

Causal scrubbing, compression, and frivolous subnetworks

The above section discussed how MI can be divided into a program generation component and a hypothesis verification component. And when it comes to hypothesis verification, causal scrubbing (Chan et al., 2022) is an exciting approach. It seems to have the potential to be tractable and valuable for this goal. 

If our goal is rigorous MI, causal scrubbing can only be as good as the hypotheses that go into it. Relying on hypotheses that are too general will prevent it from being a very precise tool. And this might be fine. For loose goals such as mechanistic anomaly detection, hypotheses that are merely decent may still be useful for flagging anomalous forward passes through a network. Maybe the production of such decent hypotheses can be automated, and they may do a perfectly fair job of capturing useful mechanisms. 

But we should be careful. Some causal scrubbing work has been explored using things like gradients, perturbations, ablations, refactorizations, etc. to find parts of the network that can be scrubbed away. But this may not be a very novel or useful approach to hypothesis generation. This particular approach is just a form of network compression. And just because a compressed version of a network seems to accomplish some task does not mean that there is some meaningful mechanism behind it. Ramanujan et al. (2020) showed that randomly initialized networks could be “trained” simply by pruning all of the weights that harmed performance on the task of interest. The resulting subnetwork may accomplish a task of interest, but only in a frivolous sense, and it should not be expected to generalize. So just because a subnetwork in isolation seems to do something doesn’t mean that it really performs that task. This is a type of interpretability illusion. 

Polysemanticity and superposition = entanglement

This section is a bit longwinded, but the TL;DR is that TAISIC has done a lot of work on “polysemanticity” and “superposition” in neural networks, but this work is not as novel as it may seem in light of previous work on “entanglement.”

In 2012 Bengio et al. described and studied the “entanglement” of representations among different neurons in networks. To the best of my knowledge, this was the first use of this term in deep learning (although the rough concept goes back to at least Bengio and LeCun (2007)). Since then, there has been a great deal of literature on entanglement – enough for a survey from Carbonneau et. al (2022). See also the disentanglement section from the Toward Transparent AI survey (Räuker et al., 2022). Locatello et a. (2019) describe the goals of this literature as such (parenthetical citations removed for readability):

[Disentangled representations] should contain all the information present in  in a compact and interpretable structure while being independent from the task at hand. They should be useful for (semi-)supervised learning of downstream tasks, transfer, and few shot learning. They should enable us to integrate out nuisance factors, to perform interventions, and to answer counterfactual questions.

Does this sound familiar? 

In 2016 Arora et al. described and studied embeddings of words that have multiple semantic meanings. They described these words as “polysemous” and their embeddings as in “superposition.” To the best of my knowledge, this was the first use of “polysemous” and “superposition” to describe embeddings and embedded concepts in deep learning. And to my knowledge, Arora et al. (2016) was the only work prior to TAISIC’s work in 2017 on this topic. 

Later on, Olah et al. (2017) characterized neurons which seem to detect multiple unrelated features, and later, Olah et. al (2020) described neurons that seem to respond to multiple unrelated features as “polysemantic.” Olah et. al (2020) writes

Our hope is that it may be possible to resolve polysemantic neurons, perhaps by “unfolding” a network to turn polysemantic neurons into pure features, or training networks to not exhibit polysemanticity in the first place. 

Olah et. al (2020) also used the term “superposition”.

Polysemantic neurons…seem to result from a phenomenon we call “superposition” where a circuit spreads a feature across many neurons

And things are even muddier than this. Thorpe (1989) studied how embeddings can densely represent a larger number of distinct concepts than they have dimensions under the term “distributed coding.” And Losch et al. (2019) describe a process for creating a disentangled latent layer as “semantic bottlenecking.” I don’t know how many other terms in various literatures describe similar concepts as entanglement, polysemanticity, superposition, distributed coding, and bottlenecking. And I don’t care much to sift through things thoroughly enough to find out. Instead, the point here is that in light of the literature on entanglement, many of the contributions that TAISIC has made related to polysemanticity and superposition are not very novel. 

Olah et al. (2017) and Olah et. al (2020) did not do a thorough job of engaging with the entanglement literature. The only mention of it made by either was from Olah et. al (2020) which wrote without citation:

This is essentially the problem studied in the literature of disentangling representations…At present that literature tends to focus on known features in the latent spaces of generative models.

Although it should be noted that this blog post from 2017 also discussed "superposition."

Based on my knowledge of the entanglement literature, it is true that most but not all papers using the term study autoencoders. But it is not clear why this matters from the perspective of studying entanglement, polysemanticity, and superposition. Besides, an entangled encoder can be used to extract features for a classifier. This is just a form of “bottlenecking” (Losch et al., 2019) – another concept that predates Olah et. al (2020).

To be clear, it seems that the authors of Olah et al. (2017) and Olah et. al (2020) were aware of the entanglement literature, and later, their discussion of related work in Elhage et al. (2022) was much more thorough. But ultimately, Olah et al. (2017) and Olah et. al (2020) did not very thoroughly engage with the entanglement literature. And when Olah et al. (2017) and Olah et. al (2020) were written, the term “entanglement” was much more standard in the deep learning literature than “polysemanticity” and “superposition.”

Details (which I could be wrong about) and speculation (ditto) aside, two different groups of AI researchers have now been working on the same problems under different names, and this isn't good. The mainstream one uses “entanglement” while TAISIC uses “polysemanticity” and “superposition.” Terminology matters, and it may be the case that TAISIC’s terminology has caused a type of generational isolation among different groups of AI researchers.

There is a lot of useful literature on both supervised and unsupervised entanglement. Instead of listing papers, I’ll refer anyone interested to page 7 of the Toward Transparent AI survey (Räuker et al., 2022). Some researchers in TAISIC may find valuable insights from these works. 

One disentanglement method that has come from TAISIC is the softmax linear unit activation function from Elhage et al. (2022). They train a network to be more disentangled using an activation function that makes neurons in the same layer compete for activations. Lateral inhibition being used as a solution to entanglement is nothing new. Again, see page 7 of the Toward Transparent AI survey (Räuker et al., 2022). And a fun fact is that even AlexNet (Krizheveky et al., 2012) used a form of lateral inhibition called “local response normalization.” But Elhage et al. (2022) engages very little with prior work like this in its discussion of related work. It gives the impression that their technique is more novel than it is. 

The whole saga involving distributed coding, entanglement, polysemanticity, superposition, and bottlenecking serves as an example of how powerful terminology can be in influencing how the research community understands and approaches problems. This story highlights the importance of engaging thoroughly with previous works and being careful about terminology. 

Deceptive alignment ≈ trojans

This discussion will be short because deception will be the main focus of EIS VIII. But spoiler alert: detecting and fixing deception is an almost identical technical problem to detecting and fixing trojans. The only difference is that deceptiveness typically results from an inner alignment failure while trojans are typically implanted with data poisoning which simulates an outer alignment failure. From an engineering standpoint though, this difference is often tenuous. This isn’t a major blind spot per se – many researchers in TAISIC understand this connection and are doing excellent work with trojans. TAISIC should do its best to ensure that this connection is more universally understood. 

Unsupervised contrast consistent search = self-supervised contrastive probing

One recent paper from TAISIC presents a way to train a classifier that predicts when models will say dishonest things based on their inner activations (Burns et al., 2022). This type of approach seems promising. But the paper names its method “contrast consistent search” and describes it as “unsupervised,” both of which I have nitpicks for. The first is that “contrast consistent search” is much better described as “contrastive probing,” and while the paper refers to the probe as a “probe,” the related works and citations do not engage with the probing literature -- non-supervised probing has been done before (e.g. Hoyt et al. (2021)). Second, this method is not exactly “unsupervised.” It is better described as self-supervised because it requires using paired true and false statements. See Jaiswal et al. (2021) titled A Survey on Contrastive Self-Supervised Learning for definitions. In future work, it will be useful to name methods and discuss related work in ways that minimize the possibility of confusion or isolation.

Why so little work on intrinsic interpretability?

There are two basic approaches to interpretability. Intrinsic interpretability techniques involve designing/training models to be easier to study in the first place while post hoc interpretability techniques involve interpreting models after they have been trained. The Toward Transparent AI survey (Räuker et al., 2022) divides its discussion of methods into intrinsic and post hoc ones if you would like to look into this more. 

Some great news is that because intrinsic interpretability techniques operate on the model before or during training and post hoc ones operate on it after, combining intrinsic and post hoc methods almost always works well! And given this, it’s odd that with some exceptions (e.g. Elhage et al. (2022)), the large majority of work from TAISIC is on post hoc methods. Maybe it is because of some founder effects plus how TAISIC is still fairly small. In the Toward Transparent AI survey (Räuker et al., 2022) we also speculate about how a lack of benchmarking means a lack of incentive for results-focused work which means a lack of incentive for studying useful synergies between novel combinations of non-novel methods. 

But whatever the reason, TAISIC should do more work to study intrinsic interpretability tools and combine them with post hoc analysis. The main reason is the obvious one – that this may significantly improve interpretability results. But this should also be of particular interest to MI researchers. Recall the discussion above about how automating model-guided program synthesis may be necessary if circuits-style MI is to be useful. Designing more intrinsically interpretable systems may be helpful for this. It also seems to be fairly low-hanging fruit. Many intrinsic interpretability methods (e.g. modular architectures, pruning, some regularization techniques, adversarial training) are simple to implement but have rarely been studied alongside post hoc interpretability tools.


  • Do you know of any other examples from TAISIC of reinvented, reframed, or renamed paradigms? Do you know of other notable examples of this outside of TAISIC?
  • Do you agree or disagree with the claim that program generation is the crucial step in mechanistic interpretability? Do you agree or disagree with the claim that mechanistic interpretability research in TAISIC is currently not addressing this very well?
  • Do you know of any past work discussing how mechanistic interpretability involves program synthesis, induction, and/or language translation?
  • Are you or anyone you know working on neurosymbolic approaches to mechanistic interpretability? 
  • Do you know of any deep learning works prior to 2012 that use the term “entanglement”? Do you know of any prior to 2016 that use “polysemy”/”polysemanticity” or “superposition”? Do you know of any other redundant names for “distributed coding,”  “entanglement,” “polysemanticity,” “superposition,” or “bottlenecking?”
  • Are you or anyone you know doing interesting work with trojans?
  • Do you have any other hypotheses for why TAISIC doesn’t focus very much on intrinsic interpretability tools? 
New Comment
15 comments, sorted by Click to highlight new comments since:

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas. While I don't work on interpretability per se, I see similar things happening with value learning / inverse reinforcement learning approaches to alignment.

Regarding causal scrubbing in particular, it seems to me that there's a closely related line of research by Geiger, Icard and Potts that it doesn't seem like TAISIC is engaging with deeply? I haven't looked too closely, but it may be another example of duplicated effort / rediscovery:

The importance of interventions

Over a series of recent papers (Geiger et al. 2020, Geiger et al. 2021, Geiger et al. 2022, Wu et al. 2022a, Wu et al. 2022b), we have argued that the theory of causal abstraction (Chalupka et al. 2016, Rubinstein et al. 2017, Beckers and Halpern 2019, Beckers et al. 2019) provides a powerful toolkit for achieving the desired kinds of explanation in AI. In causal abstraction, we assess whether a particular high-level (possibly symbolic) mode H is a faithful proxy for a lower-level (in our setting, usually neural) model N in the sense that the causal effects of components in H summarize the causal effects of components of N. In this scenario, N is the AI model that has been deployed to solve a particular task, and H is one’s probably partial, high-level characterization of how the task domain works (or should work). Where this relationship between N and H holds, we say that H is a causal abstraction of N. This means that we can use H to directly engage with high-level questions of robustness, fairness, and safety in deploying N for real-world tasks.


We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.

Hopefully this will be fixed with the forthcoming arXiv paper!

Great to know, and good to hear!

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas

Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.

I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum's reference for causality is frozen at Pearl's 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field. 

  • Example: Finite Factored Sets seems to be reinventing causal representation learning [for a good intro, see Schölkopf 2021], where it seems to me that the broader field is outpacing FFS on its own goals. FFS promises some theoretical gains (apparently to infer causality where Pearl-esque frameworks can't) but I'm no longer as sure about the validity of this.
  • Counterexample(s): the Causal Incentives Working Group, and David Krueger's lab, for instance. Notably these are embedded in academia, where there's more culture (incentive) to thoroughly relate to previous work. (These aren't the only ones, just 2 that came to mind.)

I was intrigued by your claim that FFS is already subsumed by work on academia. I clicked the link you provided but from a quick skim it doesn't seem to do FFS or anything beyond the usual pearl causality story as far as I can tell. Maybe I am missing something - could you provide an specific page where you think FFS is being subsumed?

Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.

I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning). 

It was intentional that the linked paper is an intro survey paper to the Pearl-ish  approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to study causality if we don't have pre-defined variables?" 

It may be that FFS ends up contributing novel insights above and beyond <Pearl-based causal rep. learning>, but a priori I expect this to occur only if FFS researchers are familiar with the existing literature, which I haven't seen mentioned in any FFS posts. 

My line of thinking is: It's hard to improve on a field you aren't familiar with. If you're ignorant of the work of hundreds of other researchers who are trying to answer the same underlying question you are, odds are against your insights being novel / neglected. 

Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.

E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renaming" seems far too strong.

Same for "we should not expect solving toy MI problems using humans to help with real world MI problems" - there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.

Ramanujan et al. (2020) showed that randomly initialized networks could be “trained” simply by pruning all of the weights that harmed performance on the task of interest. The resulting subnetwork may accomplish a task of interest, but only in a frivolous sense, and it should not be expected to generalize.

I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?

Thanks for the comment.

In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.

This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety. 

there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.

No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda. 

I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?

Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks.  So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is. 

I strongly downvoted with this post, primarily because contra you, I do actually think reframing/reinventing is valuable, and IMO I think that the case for reframing/reinventing things is strawmanned here.

There is one valuable part of this post, and that interpretability doesn't have good result-incentives. I agree with this criticism, but given the other points of the post, I would strongly downvote it.

This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?

I think it's a big stretch to say that deception is basically just trojans. There are similarities, but the regularities that make deception a natural category of behavior that we might be able to detect are importantly fuzzier than the regularities that trojan-detecting strategies use. If "deception" just meant acting according to a wildly different distribution when certain cues were detected, trojan-detection would have us covered, but what counts as "deception" depends more heavily on our standards for the reasoning process, and doean't reliably result in behavior that's way different than non-deceptive behavior.

Thanks. See also EIS VIII.

Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.  

I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.