This is a blogpost version of a talk I gave earlier this year at GDM. 

Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be reasonable to disagree with and is, in an important sense, subjective. But overall I think it's directionally (subjectively) true. 

 

It's often said that mech interp is pre-paradigmatic. 

I think it's worth being skeptical of this claim. 

In this post I argue that:

  • Mech interp is not pre-paradigmatic.
  • Within that paradigm, there have been "waves" (mini paradigms). Two waves so far.
  • Second-Wave Mech Interp has recently entered a 'crisis' phase.
  • We may be on the edge of a third wave.

 

Preamble: Kuhn, paradigms, and paradigm shifts

First, we need to be familiar with the basic definition of a paradigm: 

paradigm is a distinct set of concepts or thought patterns, including theories, research methods, postulates, and standards for what constitute legitimate contributions to a field.

Kuhn's model of paradigms and how they shift goes like this:

  • Phase 0: Preparadigmatic phase - A new field is just emerging. There is no consensus on any particular theory or body of facts. Eventually, researchers coalesce around a particular dominant paradigm.
  • Phase 1: Normal Science - Now starts a phase of "normal science", where researchers use the dominant paradigm to solve puzzles that arise in the field. This leads to scientific progress! Researchers encounter anomalies and they solve some of these using the dominant paradigm, or slight modifications. But unsolved anomalies accrue.
  • Phase 2: Crisis - Eventually, the field accrues enough anomalies that can't be solved using the dominant paradigm. At this point, there are expressions of discontent; researchers recourse to philosophy; fundamental assumptions of the field are debated. Researchers attempt so-called 'revolutionary science' that generates new theories, experiments, and methods that aim to resolve the observed anomalies.
  • Phase 3: Paradigm shift - Consensus gradually shifts toward a new paradigm that better resolves the anomalies. At this point, 'normal science' resumes, restarting the cycle. 

In addition to this model, I contend that there are mini paradigms that are local to scientific subfields. These mini paradigms can undergo mini paradigm shifts, even when the field as a whole remains within roughly the same paradigm. One way I think about 'normal science'  is that it actually constitutes a number of smaller paradigm shifts that occur within a larger paradigm. I'll call these mini paradigms "waves", but they have essentially the same properties as paradigms do: They're a set of conceptsmethodsstandards for what constitute legitimate contributions to a field, etc. We'll study each wave/mini paradigm of mech interp with respect to each of these properties. 

Claim: Mech Interp is Not Pre-paradigmatic

I claim that mech interp never has been pre-paradigmatic. 

The reason is that it inherits almost every concept, method and standard for what constitutes a legitimate contribution to the field from computational neuroscience and connectionism. I'll call this the CNC paradigm

Mech interp specifically studies modern deep neural networks, but most of the concepts and methods (i.e. most of the paradigm) used to understand and study them lie squarely within the CNC paradigm. For instance, from the CNC paradigm, mech interp inherits:

  • The idea that networks "represent" things;
  • That these "representations" or computations can be distributed across multiple neurons or multiple parts of the network;
  • That these representations can be superposed on top of one another in a linear fashion, as in the 'linear representation hypothesis' (e.g. Smolensky, 1990);
  • That representations can form representational hierarchies, thus representing more abstract concepts on top of less abstract ones, such as the visual representational hierarchy. 

The methods are also spiritually identical:

  • Max-activating dataset examples are basically what Hubel and Wiesel (1959) (and many researchers since) used to demonstrate the functional specialisation of particular neurons.
  • Causal interventions, commonly used in mech interp, are the principle behind many neuroscience methods, including thermal ablation (burning parts of the brain), cooling (thus ‘turning off’ parts of the brain), optogenetic stimulation, and so on.
  • Key data analysis methods, such as dimensionality reduction or sparse coding, that are used extensively in computational neuroscience (and sometimes directly developed for it) are also used extensively in mech interp.

And in many cases, the standards of what constitute a legitimate contribution to the field are the same. In both, for instance, a legitimate contribution might include a demonstration that a neuron (whether in a brain or an artificial neural network) appears be involved in an interesting representation or computation, such as the ‘Jennifer Anniston’ neuron (Quiroga et al. 2005) or the ‘Donald Trump’ neuron (Goh et al. 2021). 

Given the extent of the similarities, I don’t think it's excessively controversial to say that mech interp is spiritually a branch of neuroscience, where the same paradigm is applied to artificial neural networks rather than biological ones. Similar comparisons have been made before

It's unfortunate that mech interp inherits the CNC paradigm, because despite many years of research, turns out it's really hard to do computational science on brains, so computational neuroscience hasn't made a huge amount of progress. 

So, as a field, we don't have to be happy with the dominant paradigm.  But just because we're not happy with it doesn't mean it's not there.

It's important to note that it's pretty subjective whether one views mech interp as a subfield of CNC vs. a subfield of machine learning vs. any other domain. Whether this rings true depends a lot on how one situatuates the field in the context of broader ideas and discussions. I think it's reasonable to disagree with the perspective that mech interp is a field best contextualized within the CNC paradigm. And that perspective matters for whether mech interp should be considered pre-paradigmatic. I argue here that mech interp qua CNC isn't pre-paradigmatic because CNC isn't. But mech interp qua machine learning could very validly be said to be pre-paradigmatic. For instance, it has had to (and continues to) endure various political battles with other parts of ML to establish itself as a legitimate subfield, and establish within the ML community an agreed-upon set of facts, concepts, and practices. I tend to view the field more through the mech interp qua CNC lens. But I take it as a valid and mutually inclusive perspective to view it as a subfield of ML, even though these two perspectives might lead to different conclusions with regard to its pre-paradigmaticity. But I'll proceed using the perspective that mech interp is a subfield of CNC, since that frame feels most natural to me personally given my path to the field. 

Going further than claiming that mech interp lies squarely within the broader CNC paradigm, I also claim that the subfield of mech interp exhibits identifiable mini-paradigms, which each constitute smaller sets of of concepts, thought patterns, methods, and standards for what constitutes legitimate contributions. I call these mini-paradigms "waves", and claim that mech interp has had two so far.

 

First-Wave Mech Interp (ca. 2012 - 2021)

Some of the earliest days of the deep learning revolution were also the earliest days of mech interp, which began with what I call 'First-Wave Mech Interp'. 

Early work[1] such as Zeiler and Fergus (2013) set out to understand "why [convolutional image classifiers] perform so well" and studied "the function of intermediate feature layers". They extensively visualize hidden features in AlexNet. (see also Erhan et al. 2009Simonyan et al. 2013)

During First-Wave Mech Interp, some parts of the deep learning community were somewhat skeptical that interesting structure could be found in deep neural networks at all. It was therefore a legitimate contribution to the field simply to demonstrate and document that structure! And many papers did an excellent job of this (Erhan et al. 2009Simonyan et al. 2013; Kaparthy et al. 2015; Olah et al. 2020; Cammarata et al. 2020; Goh et al. 2021). Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013)

Toward the end of the First-Wave, the field, now rightfully claiming victory that interesting interpretable structure could be found in deep neural networks, placed a new emphasis on the ideas that 'features are the fundamental unit of neural networks' and 'features are connected by weights, forming circuits' (Olah et al. 2020). Those were not intended as novel claims as far as I can tell; it is simply a re-application of foundational ideas of the CNC paradigm to the context of deep artificial neural networks. 

But First-Wave Mech Interp was not yet done importing ideas from the dominant CNC paradigm. Its failure to import some ideas in particular caused the anomaly that would cause the crisis that precipitated the end of First-Wave Mech Interp.  

The Crisis in First-Wave Mech Interp

First-Wave Mech Interp noticed the existence of polysemantic neurons in deep artificial neural networks. To the best of my knowledge, such neurons were first identified in in deep artificial neural networks by Nguyen et al. (2016)(see e.g. appendix S5). But Olah (2018a2018b) was the first to emphasize the difficulty that polysemanticity posed to the mech interp research program. 

The discovery of polysemantic neurons in computational neuroscience goes much further, usually under the name of 'mixed selectivity'. Observations extend perhaps to 1998 (see Fusi et al. 2013 for review). 

It was therefore a curious deviation from an already widely-held perspective in the dominant CNC paradigm that caused the crisis of First-Wave Mech Interp. The crisis eventually resolved, marking the start of Second-Wave Mech Interp.

Second-Wave Mech Interp (ca. 2022 - ??)

At the time, it was somewhat unclear how to solve polysemanticity, nor was there a consensus on what caused it. 

One of the hypotheses was the superposition hypothesis: The idea that 'networks represent more features than they have neurons'. It is a natural corollary of the superposition hypothesis that neurons would exhibit polysemanticity, since there cannot be a one-to-one relationship between neurons and 'features'. 

Goh (2016) was perhaps[2] first to write about the superposition hypothesis, but Olah et al. (2018) was probably the first to invoke superposition as a potential explanation for polysemanticity. But in the CNC paradigm, the idea that neural systems could use this representational strategy goes back much further, perhaps to Olshausen and Field (1997).

In 2022, after several years without much noticeable concrete progress toward solving the main crisis in the field, Elhage et al. (2022) published their landmark paper Toy Models of Superposition. This deeply studied the the phenomenon of superposition and sparked the search for a practical solution. It also marked the beginning of the end of the crisis of First-Wave Mech Interp and the transition to Second-Wave Mech Interp. 

The solution that was identified shortly afterwards, sparse dictionary learning (SDL) (Sharkey et al. 2022Cunningham et al. 2023Bricken et al. 2023) was already a well known method within the dominant CNC paradigm (Olshausen and Field, 1996; Makhzani and Frey, 2013;  Yun et al. 2021). With these ideas and methods, mech interp came to focus less on individual neurons and more on activation vectors (sparse dictionary latents), thus bringing it conceptually up-to-date with the rest of the CNC paradigm. 

During this phase, contributions to the field continue[d] to include demonstrations that interesting interpretable structure exists in neural networks, especially sparse dictionary latents that appeared to be monosemantic (e.g. Cunningham et al. 2023Bricken et al. 2023Templeton et al. 2024Lindsey et al. 2025). Other contributions attempted to use the methods and ideas of the second wave for downstream tasks. SDL methods also introduced several anomalies, such as 'shrinkage' (Wright et al. 2024; Jermyn et al. 2024), thus sparking a flurry of work that sought solutions to such anomalies (e.g.  Gao et al. 2024; Rajamanoharan et al. 2024a; Taggart et al. 2024; Rajamanoharan et al. 2024b). 

Anomalies in Second-Wave Mech Interp

Which brings us to today.

There are now plenty of anomalies that remain unsolved within the Second-Wave Mech Interp, or that have only partial solutions that are each incoherent with each other.

Anomalies include:

  • A satisfactory definition of what features actually are is elusive. This is a pretty bad issue for such a fundamentally important object, and I've come to believe that many of the field's conceptual issues emerge downstream of it.
  • It is unclear how generally to identify 'multidimensional features' (Engels et al. 2024) and to distinguish where their multidimensional boundaries stop and start.
  • The reasons for feature geometry is not explained within the paradigm (Mendel, 2024Kriegeskorte et al. 2013).
  • Sparse dictionaries are sometimes missing features if they are too small; but features in larger dictionaries are not 'atomic' (Leask and Bussmann et al. 2025).
  • Relatedly, sparse dictionary features exhibit feature splitting (Bricken et al. 2023) and feature absorption (Chanin et al. 2024), suggesting that sparsity is not truly what we want our network decomposition methods to optimize for.
  • It remains conceptually unclear what it means for a representation to span multiple layers/branches/attention heads[3]. Partial solutions to each of these cases have been suggested (e.g. Mathwin et al. 2024Wynroe et al. 2024Ameisen et al. 2025Jermyn et al. 2025),  but these solutions are either incoherent with each other or do not address any of the other anomalies of the paradigm.
  • Empirical results that attempt to use SDL latents for downstream tasks are mixed, with some highlighting their utility or lack of utility (e.g. Smith et al. 2025Muhamed et al. 2025).

With these various issues, the feeling I get in the last few months is that a ‘crisis’ phase has begun in Second-Wave Mech Interp, especially since the publication of Smith et al. (2025), which seems to have been the straw that broke the camel’s back. 

The Crisis of Second-Wave Mech Interp (ca. 2025 - ??) 

‘Crises’ do not require universal discontent. There are still relatively steadfast proponents of Second-Wave Mech Interp who less keenly feel the need for a new paradigm. And, indeed, there is still real progress to be made within second-wave mech interp! And, accordingly, it continues to progress with some very nice work (e.g. Ameisen et al. 2025; Lindsey et al. 2025). 

However, it feels like more and more, the field is willing to express discontent with and openly question fundamental assumptions of the field.  I personally agree with the need to question the fundamental assumptions of Second-Wave Mech Interp and, even more heretically, question some of the emphases of the dominant CNC paradigm.

Toward 'Third-Wave' Mechanistic Interpretability

To graduate to Third-Wave Mech Interp, the field now needs new theories, concepts, methods, experiments that resolve the anomalies of Second-Wave Mech Interp.

Here, I tentatively suggest that some of these elements might come from a branch of work that my team and others have been working on called ‘Parameter Decomposition’ (Braun et al. 2025; Chrisman et al. 2025). 

I’ll emphasize that these are early ideas and certainly do not yet constitute ‘Third-Wave Mech Interp’. 

But at least in theory, the approach promises to resolve some of the anomalies of Second-Wave Mech Interp. 

Emphasis here on “in theory”. Current Parameter Decomposition methods are not yet scalable or robust enough to convincingly demonstrate that they can resolve the anomalies of  Second-Wave Mech Interp in practice (though we're making progress on this - we'll have another paper out on this soon). But conceptually, the approach suggests a promising path forward for the field.

What is the approach exactly? Here I outline the basics of the approach and how it might resolve the anomalies of Second-Wave Mech Interp. 

For a fuller discussion of the approach, see Braun et al. (2025).

The Basics of Parameter Decomposition

One of the premises of (linear) Parameter Decomposition is that we can decompose a neural network’s parameter vector into a sum of parameter components. There are of course infinitely many sets of potential parameter component vectors that sum to the parameters of a given network. But we want to identify parameter components such that each one performs a particular (preferably interpretable) role in the overall algorithm learned by the neural network. In other words, we want to identify parameter components that correspond to the ‘mechanisms’ of the neural network.

One of the core ideas that Parameter Decomposition approaches leverage is that a neural network should not require all of its mechanisms simultaneously (Veit et al., 2016; Zhang et al., 2022; Dong et al., 2023). For example,  if a network uses a mechanism to store the knowledge that “The sky is blue”, then, on inputs that do not use that knowledge, we should be able to ablate the parameter component that implements that mechanism without affecting the output. Thinking about this mechanistically, if a given datapoint’s activations lie orthogonal to particular directions in parameter space, then, at least on this input, we should be able to ablate those directions in parameter space without affecting the output[4].

The other core principle that Parameter Decomposition leverages is the principle of ‘minimum description length’ (MDL). The idea is that we want to decompose a network’s parameters in such a way that we minimize the length of the description of how the network functions over the training dataset. In particular, we want to identify parameter components that are:

  • Faithful to the parameters of the original network, in that the components do in fact sum to the parameters of the network.
  • Minimal, in that as few components as possible are required to replicate the network’s behavior on the training distribution.
  • Simple, in that the components should each involve as little computational machinery as possible.

If we can identify a set of parameter components that conform to these properties, it seems reasonable to call them the ‘mechanisms’ of the network.

Parameter Decomposition Questions Foundational Assumptions of Second-Wave Mech Interp

Parameter Decomposition makes some different foundational assumptions than used by the Second-Wave. 

One of these assumptions arises because Parameter Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here defined with reference to mechanisms, which is  great, because 'mechanisms' has a specific formal definition!

This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Parameter Decomposition rejects this idea and contends that mechanisms are the fundamental unit of neural networks’. 

It's a subtle philosophical point, but it this also implies that, in neural networks, computations are more fundamental than representations. Representations, to the extent that they aren’t just a leaky abstraction (which I think they might be), fall out of computations, not the other way round. This sheds some light on an ongoing debate in cognitive philosophy regarding what a 'concept' is (ontologically), with some arguing for computational vs. representational accounts. 

Parameter Decomposition In Theory Resolves Anomalies of Second-Wave Mech Interp 

Downstream of this definitional clarity, Parameter Decomposition suggests clean ways of thinking about several anomalies of Second-Wave Mech Interp:

  • Multidimensional features arise from multidimensional mechanisms, which we can identify by identifying minimum length descriptions of a network’s operations. It thus becomes clear where the boundaries of multidimensional features stop and start.
  • Feature geometry suggests that, rather than ‘features’ being fundamental units of computation, there exists some underlying structure that necessitates ‘features’ having particular geometric arrangements relative to each other. This phenomenon is no longer difficult to explain (unlike in Second-Wave Mech Interp), since it suggests that the reason different ‘features’ lie with particular relations relative to each other is because particular mechanisms must eventually be applied to those regions in activation space. In other words, the geometric structure of a network’s computational mechanisms give rise to the geometric structure of the semantics of its representations.
  • Feature splitting and feature absorption can likely be avoided by requiring that the network’s sum to the original parameters. There is no limit to the sparsity or number of sparse dictionary latents used in SDL. But there is a limit to how sparsely-activating and simple you can make parameter components that sum to the parameters of the original network.
  • The requirement that the parameter components sum to the original parameters also means that there can be no ‘missing mechanisms’. At worst, there can only be ‘parameter components which aren’t optimally minimal or simple’.
  • Representations that are distributed across layers, branches, or attention heads are explained away as mechanisms whose parameter vectors span multiple layers, branches, or attention heads. We no longer require multiple mutually incompatible decomposition methods to identify such representations. Instead, we just identify the associated directions in parameter space.
  • Another attractive property of Parameter Decomposition is that it identifies Minimum Description Length as the optimization criterion for our explanations of neural networks, in contrast to most Second-Wave Mech Interp, which arguably optimizes for sparsity or, more nebulously (and potentially less faithfully to the ‘true’ mechanisms of the network), optimizes for feature interpretability (but see Ayonrinde et al. 2024, which does at least select for MDL). 

It’s worthwhile to note that the emphases of Parameter Decomposition deviate somewhat from the standard emphases of the CNC paradigm: Most of the CNC paradigm puts features/representations front-and-center, and only rarely studies weights/parameters/synapse strengths, preferring to study neural activations. Despite its conceptual attractiveness, the neuroscience has not yet converged on a Parameter Decomposition-like approach. I speculate that one of the reasons for this is that it is so much harder to study biological synapses and accurately measure their strengths, compared with measuring biological neural activations. 

Conclusion

So Parameter Decomposition in theory suggests solutions to the anomalies of Second-Wave Mech Interp. But a theory doesn’t make a paradigm. We do not yet have all the required building blocks of a paradigm that is worthy of the name ‘Third-Wave Mech Interp’. More scalable, more robust parameter decomposition methods need to be developed before we can verify that these anomalies are resolved in practice[5]. Experiments need to be run. And, not least, a new scientific consensus need to emerge.

Nevertheless, that there exist even theoretical solutions to these anomalies is an improvement on the status quo of Second-Wave Mech Interp. If solutions to these issues can be found, we may be on the cusp of a new wave, similar to 2022, when the anomaly of superposition was foregrounded by Elhage et al. (2022), but the field had not yet settled on scalable methods to resolve it. 

So there’s plenty of work remaining to be done before we can resolve the crisis of Second-Wave Mech Interp. If you’re interested in collaborating with us and colleagues to investigate whether Parameter Decomposition does in fact resolve the anomalies of Second-Wave Mech Interp, then there’s a few steps you could take, beyond working on it independently! You can join us on the new Open Source Mechanistic Interpretability slack channel for #parameter-decomposition (invite link here - this will expire after a while). You can also apply to my MATS stream, or apply to work with us at Goodfire

 

 

  1. ^

    Arguably, first wave mech interp starts even earlier than this, with Hinton et al. (1986) "Learning representations by back-propagating errors", where they studied the weights of the first layer of a network. But it's typical to constrain the object of study of mech interp to be deep networks, thus excluding work prior to ca. 2012, which is fine and understandable though somewhat arbitrary. 

  2. ^

    There is no date on the blog post, but see this commit history for the date of when it was posted. 

  3. ^

    For example: Is a feature 'present' if it is linearly readable from any activation space? What if it's nonlinearly readable? Is the representation 'present' in that case?

  4. ^

    Or, if activations are not orthogonal to particular directions in parameter space, we should still be able to ablate them if they lead sub-threshold pre-activations, which have no downstream causal effect.

  5. ^

    A new paper that makes some progress on the issues of robustness and scalability should be coming soon!

  6. ^

     

New Comment
3 comments, sorted by Click to highlight new comments since:

Thanks for writing out your thoughts on this! I agree with a lot of the motivations and big picture thinking outlined in this post. I have a number of disagreements as well, and some questions:
 


It's unfortunate that mech interp inherits the CNC paradigm, because despite many years of research, turns out it's really hard to do computational science on brains, so computational neuroscience hasn't made a huge amount of progress. 

I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it's not just that it's hard to do computational science on brains, but that we don't have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we've gotten with the CNC approach. 

My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm. I haven't thought about this technique much, but on a first reading it doesn't sound all that different from what you call the second wave paradigm, just replacing activations with parameters.  For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I'll try this out here, trying to argue for a new paradigm called "Activation Decomposition". (just to be super clear I don't think this is a new paradigm!)

You wrote:
 


Parameter Decomposition makes some different foundational assumptions than used by the Second-Wave. 

One of these assumptions arises because Parameter Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here defined with reference to mechanisms, which is  great, because 'mechanisms' has a specific formal definition!

This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Parameter Decomposition rejects this idea and contends that mechanisms are the fundamental unit of neural networks’. 

and I'll rewrite that here, putting my changes in bold:
 


Activation Decomposition makes some different foundational assumptions than used by the Second-Wave. 

One of these assumptions arises because Activation Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here [in Activation Decomposition] defined with reference to mechanisms [which are circuits of linearly decomposed activations], which is  great, because 'mechanisms' has a specific formal definition!

This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Activation Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’. 

Perhaps a simpler way to say my thought is, isn't the current paradigm largely decomposing activations? If that's the case why is decomposing parameters so fundamentally different?

I think maybe one thing that might be going on here is that people have been quite sloppy (though I think it's totally excusable and arguably even a good idea to be sloppy about these particular things given the current state of our understanding!), with words like feature, representation, computation, circuit, etc. Like I think when someone writes "features are the fundamental unit of neural networks" they are often meaning something closer to "representations are the fundamental unit of neural networks" or maybe something closer to "SAE latents are the fundamental unit of neural networks" and importantly, an implicit "and representations are only really representations if they are mechanistically relevant." Which is why you see interventions of various types in current paradigm mech interp papers.
 

Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013)

This is a nitpick, and I don't think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It's not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort. 

 As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to "mech interp doesn't have a useful/good/perfect paradigm right now". But I'm also not sure if there's anything here beyond semantics.

Hey Adam, thanks for your thoughts on this! 

 

I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it's not just that it's hard to do computational science on brains, but that we don't have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we've gotten with the CNC approach. 


I think we're on the same page that we might not have the right framework to do computational science on brains or other intelligent systems. I think we might disagree on how far away current mainstream ideas are from being the right framework - I'd predict that, if we talked it out further, I'd say we're closer than you'd say we are. I don't know how far afield from current ideas that we need to look for the right framework, and I'd support work that looks even further afield than several inferential steps from current mainstream ideas. But I don't think the historical sluggish pace of computational neuroscience justifies search any particular inferential distance; more proximal solutions feel just as likely to be the next paradigm/wave than more distant solutions (maybe more likely given the social nature of what constitutes a paradigm/wave). 

My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm.

I really want to re-emphasize that I didn't call PD a new paradigm (or even a new 'wave') in the post. N.B.: "I’ll emphasize that these are early ideas and certainly do not yet constitute ‘Third-Wave Mech Interp’. "

I haven't thought about this technique much, but on a first reading it doesn't sound all that different from what you call the second wave paradigm, just replacing activations with parameters.  For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I'll try this out here, trying to argue for a new paradigm called "Activation Decomposition". (just to be super clear I don't think this is a new paradigm!)

Yeah I don't think PD throws away the majority of the ideas in the 2nd wave. It's designed primarily to solve the anomallies of the 2nd wave. It will therefore resemble 2nd wave ideas and we can draw analogies. But I think it's different in important ways. For one, I think it will probably help us be less confused about  ideas like 'feature', 'representation', 'circuit', and so on.

 

> Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013)

This is a nitpick, and I don't think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It's not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort. 

Yes this is fair. These are still fairly deep neural networks, though (if we count time as depth), and they're examples of work that interprets ANNs on the lowest level using low-level analysis of weights and activations using e.g. dimensionality reduction and other methods mech interp folks might find familiar. But I agree it doesn't usually get put in the bucket of 'mech interp' though ultimately the boundary is fairly arbitrary. As a separate point, it is surprising how little of the neuroscience community has actually jumped onto mechanistically understanding more interesting models like inception v2 or LLMs despite the similarity of methods and object of study, which is a testament to the early mech interp pioneers since they saw a field where few others did. 

As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to "mech interp doesn't have a useful/good/perfect paradigm right now". But I'm also not sure if there's anything here beyond semantics

I'm not actually sure if this is very action-relevant. I think in the past I might have said "mech interp practitioners should be more familiar with computational neuroscience/connectionism", since I think this might have save the mech interp community some time. But I don't think it would have saved a huge amount of time, and I think the mech interp has largely surpassed comp neuro as a source of interesting and relevant ideas. I think it's mostly useful as an exercise in situating mech interp ideas within the broader set of ideas of an eminently related field (comp neuro/connectionism). But I'll stress that many in the field see mech interp as better contextualized by other sets of broader ideas (e.g. as a subfield of interpretability/ML), and when viewing mech interp in light of those ideas, it might better be thought of as pre-paradigmatic. I think that's a completely compatible but different perspective from the one I tend to take, and just emphasizes the subjectiveness of the whole question of whether the field is paradigmatic or not. 

So, as a field, we don't have to be happy with the dominant paradigm. But just because we're not happy with it doesn't mean it's not there.

Um, ok fine, so what alternative term do you propose to replace "pre-paradigmatic" as it is currently used, to indicate that there's no remotely satisfactory paradigm in which to get going on the parts of the field-to-be that really matter?

Curated and popular this week