Adam Shai — AI Alignment Forum

Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.

Thanks for writing out your thoughts on this! I agree with a lot of the motivations and big picture thinking outlined in this post. I have a number of disagreements as well, and some questions:

It's unfortunate that mech interp inherits the CNC paradigm, because despite many years of research, turns out it's really hard to do computational science on brains, so computational neuroscience hasn't made a huge amount of progress.

I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it's not just that it's hard to do computational science on brains, but that we don't have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we've gotten with the CNC approach.

My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm. I haven't thought about this technique much, but on a first reading it doesn't sound all that different from what you call the second wave paradigm, just replacing activations with parameters. For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I'll try this out here, trying to argue for a new paradigm called "Activation Decomposition". (just to be super clear I don't think this is a new paradigm!)

You wrote:

Parameter Decomposition makes some different foundational assumptions than used by the Second-Wave.
One of these assumptions arises because Parameter Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here defined with reference to mechanisms, which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Parameter Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.

and I'll rewrite that here, putting my changes in bold:

Activation Decomposition makes some different foundational assumptions than used by the Second-Wave.
One of these assumptions arises because Activation Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here [in Activation Decomposition] defined with reference to mechanisms [which are circuits of linearly decomposed activations], which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Activation Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.

Perhaps a simpler way to say my thought is, isn't the current paradigm largely decomposing activations? If that's the case why is decomposing parameters so fundamentally different?

I think maybe one thing that might be going on here is that people have been quite sloppy (though I think it's totally excusable and arguably even a good idea to be sloppy about these particular things given the current state of our understanding!), with words like feature, representation, computation, circuit, etc. Like I think when someone writes "features are the fundamental unit of neural networks" they are often meaning something closer to "representations are the fundamental unit of neural networks" or maybe something closer to "SAE latents are the fundamental unit of neural networks" and importantly, an implicit "and representations are only really representations if they are mechanistically relevant." Which is why you see interventions of various types in current paradigm mech interp papers.

Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013).

This is a nitpick, and I don't think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It's not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort.

As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to "mech interp doesn't have a useful/good/perfect paradigm right now". But I'm also not sure if there's anything here beyond semantics.

That is a fair summary.

This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:

(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the "right" questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms - philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain's already exist, so we have that to probe, but I'd argue that we don't have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it's not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of "does any of this work even say anything about how the brain works?!"

(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I'm always really intrigued by the difference in how they go about problem solving compared to us scientists.

Thanks!

It's great to see someone working on this subject. I'd like to point you to Jim Crutchfield's work, in case you aren't familiar with it, where he proposes a "calculii of emergence" wherein you start with a dynamical system and via a procedure of teasing out the equivalence classes of how the past constrains the future, can show that you get the "computational structure" or "causal structure" or "abstract structure" (all loaded terms, I know, but there's math behind it), of the system. It's a compressed symbolic representation of what the dynamical system is "computing" and furthermore you can show that it is optimal in that this representation preserves exactly the information-theory metrics associated with the dynamical system, e.g. metric entropy. Ultimately, the work describes a heirarchy of systems of increasing computational power (a kind of generalization of the Chomsky heirarchy, where a source of entropy is included), wherein more compressed and more abstract representations of the computational structure of the original dynamical system can be found (up to a point, very much depending on the system). https://www.sciencedirect.com/science/article/pii/0167278994902739

The reason I think you might be interested in this is because it gives a natural notion of just how compressible (read: abstractable) a continous dynamical system is, and has the mathematical machinery to describe in what ways exactly the system is abstractable. There are some important differences to the approach taken here, but I think sufficient overlap that you might find it interesting/inspiring.

There's also potentially much of interest to you in Cosma Shalizi's thesis (Crutchfield was his advisor): http://bactra.org/thesis/

The general topic is one of my favorites, so hopefully I will find some time later to say more! Thanks for your interesting and though provoking work.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments