TL;DR: If we can build competitive AI systems that are interpretable, then I argue via analogy that trying to extract them from messy deep learning systems seems less promising than directly engineering them.  

EtA - here's a follow-up: Mechanistic Interpretability as Reverse Engineering (follow-up to "cars and elephants")

Preliminaries:  
Distinguish weak and strong tractability of (mechanistic) interpretability as follows:

  • Weak tractability: AGI-level systems are interpretable in principle, i.e. humans have the capacity to fully understand their workings in practice given the right instructions and tools.  This would be false if intelligence involves irreducible complexity, e.g. because it involves concepts that are not "crisp", modular, or decomposable; for instance, there might not a crisp conceptual core to various aspects of perception or abstract concepts such as "fairness".[1]
  • Strong tractability: We can build interpretable AGI-level systems without sacrificing too much competitiveness.

The claim:
If strong tractability is true, then mechanistic interpretability is likely not the best way to engineer competitive AGI-level systems.

The analogy:
1) Suppose we have a broken down car with some bad parts and we want to have a car that is safe to drive.  We could try to fix the car and update the parts. 
2) But we also have a perfectly functioning elephant.  So instead, we could try and tinker with the elephant to understand how it works and make its behavior more safe and predictable.
I claim (2) is roughly analogous to mechanistic interpretability, and (1) to pursuing something more like what Stuart Russell seems to be aiming for: a neurosymbolic approach to AI safety based on modularity, probabilistic programming, and formal methods.[2]

Fleshing out the argument a bit more:
To the extent that strong tractability is true, there must be simple principles we can recognize underlying intelligent behavior.  If there are not such simple principles, then we shouldn't expect mechanistic interpretability methods to yield safe, competitive systems.  We already have many ideas about what some of those principles might be (from GOFAI and other areas).  Why would we expect it is easier to recognize and extract these principles from neural networks than to deliberately incorporate them into the way we engineer systems?  

Epistemic status: I seem to be the biggest interpretability hater/skeptic I've encountered in the AI x-safety community.  This is an argument I came up with a few days ago that seems to capture some of my intuitions, although it is hand-wavy.  I haven't thought about it much, and spent ~1hr writing this, but am publishing it anyways because I don't express my opinions publicly as often as I'd like due to limited bandwidth.

Caveats: I made no effort to anticipate and respond to counter-arguments (e.g. "those methods aren't actually more interpretable").  There are lots of different ways that interpretability might be useful for AI x-safety.  It makes sense as part of a portfolio approach.  It makes sense as an extra "danger detector" that might produce some true positives (even if there are a lot of false negatives) or one of many hacks that might be stacked.  I'm not arguing that Stuart Russell's approach is clearly superior to mechanistic interpretability.  But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell's approach, and this seems bizarre.

Unrelated bonus reason to be skeptical of interpretability (hopefully many more to come!): when you to deploy a reasonably advanced system in the real world, it will likely recruit resources outside itself in various ways (e.g. The way people write things down on paper as a way of augmenting their memory), meaning that we will need to understand more than just the model itself, making the whole endeavor way less tractable.
 

  1. ^

    For what it's worth, I think weak tractability is probably false and this is maybe a greater source of my skepticism about interpretability than the argument presented in this post.

  2. ^

    Perhaps well-summarized here, although I haven't watched the talk yet: 

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 7:09 PM

(Context: I work at Redwood on using the internals of models to do useful stuff. This is often interpretability work.)

I broadly agree with this post, but I think that it implicitly uses the words 'mechanistic interpretability' differently than people typically do. It seems to be implying that for mechanistic interpretability to be tractible, all parts of the AGI's cognition must be possible to understand for humans in principle. While I agree that this is basically required for Microscope AI to be very useful, it isn't required to have mechanistic interp be useful.

For instance, we don't need to understand how models predict whether the next token is ' is' or ' was' in order to be able to gain some signal on whether or not the model is lying with interp.

I think if this post replaced the words 'mechanistic interpretability' with 'microscope ai' or 'comprehensive reverse engineering' it would be more accurate.

I think that it implicitly uses the words 'mechanistic interpretability' differently than people typically do.

I disagree.  I think in practice people say mechanistic interpretability all the time and almost never say these other more specific things.  This feels a bit like moving the goalposts to me.  And I already said in the caveats that it could be useful even if the most ambitious version doesn't pan out.  

For instance, we don't need to understand how models predict whether the next token is ' is' or ' was' in order to be able to gain some signal on whether or not the model is lying with interp.

This is a statement that is almost trivially true, but we likely disagree on how much signal.  It seems like much of mechanistic interpretability is predicated on something like weak tractability (e.g. that we can understand what deep networks are doing via simple modular/abstract circuits), I disagree with this, and think that we probably do need to understand "how models predict whether the next token is ' is' or ' was'" to determine if a model was "lying" (whatever that means...).  
But to the extent that weak/strong tractability are true, this should also make us much more optimistic about engineering modular systems.  That is the main point of the post.

 

But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell's approach, and this seems bizarre.

Data point: I consider myself part to be part of the AI x-risk community, but like you am not very excited about mechanistic interpretability research in an x-risk context. I think there is somewhat of a filter bubble effect going on, where people who are more exited about interpretability post more on this forum.

Stuart Russell's approach is a broad agenda, and I am not on board with of all parts of it, but I definitely read his provable safety slogan as a call for more attention to the design approach where certain AI properties (like safety and interpretability properties) are robustly created by construction.

There is an analogy with computer programming here: a deep neural net is like a computer program written by an amateur without any domain knowledge, one that was carefully tweaked to pass all tests in the test suite. Interpreting such a program might be very difficult. (There is also the small matter that the program might fail spectacularly when given inputs not present in the test suite.) The best way to create an actually interpretable program is to build it from the ground up with interpretability in mind.

What is notable here is that the CS/software engineering people who deal with provable safety properties have long ago rejected the idea that provable safety should be about proving safe an already-existing bunch of spaghetti code that has passed a test suite. The problem of interpreting or reverse engineering such code is not considered a very interesting or urgent one in CS. But this problem seems to be exactly what a section of the ML community has now embarked on. As an intellectual quest, it is interesting. As a safety engineering approach for high-risk system components, I feel it has very limited potential.

I think the argument in Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc applies here.

Carrying it over to the car/elephant analogy: we do not have a broken car. Instead, we have two toddlers wearing a car costume and making "vroom" noises. [Edit-To-Add: Actually, a better analogy would be a Flintstones car. It only looks like a car if we hide the humans' legs running underneath.] We have not ever actually built a car or anything even remotely similar to a car; we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine. We study the elephant not primarily in hopes of making the elephant itself more safe and predictable, but in hopes of learning those principles of mechanics, thermodynamics and chemistry which we currently lack.

Actually I really don't think it does... the argument there is that:

  • interpretability is about understanding how concepts are grounded.
  • symbolic methods don't tell us anything about how their concepts are grounded.

This is only tangentially related to the point I'm making in my post, because:

  • A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn't apply there.
  • I am comparing mechanistic interpretability of neural nets with neuro-symbolic methods.
  • One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity. 

That counterargument does at least typecheck, so we're not talking past each other. Yay!

In the context of neurosymbolic methods, I'd phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to "hook them up to". We can't just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because <all the usual reasons why maximizing a training objective does not do the thing we intended>.

Now, I'm totally on board with the general idea of using neural nets for symbol grounding and then building interpretable logic-style stuff on top of that. (Retargeting the Search is an instance of that general strategy, especially if we use a human-coded search algorithm.) But interpretability is a necessary step to do that, if we want the symbols to be robustly correctly grounded.

On to the specifics:

A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn't apply there.

I partially buy that. It does seem to me that a lot of people doing interpretability don't really seem have a particular goal in mind, and are just generally trying to understand what's going on. Which is not necessarily bad; understanding basically anything in neural nets (including higher-level algorithms) will probably help us narrow in on the answers to the key questions. But it means that a lot of work is not narrowly focused on the key hard parts (i.e. how to assign external meaning to internal structures).

One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity.

Insofar as the things passing between modules are symbols whose meaning we don't robustly know, the same problem comes up. The usefulness of structural/algorithmic properties is pretty limited, if we don't have a way to robustly assign meaning to the things passing between the parts.

Hmm I feel a bit damned by faint praise here... it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response).

Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving... you say (paraphrasing) "interpretability is a necessary step to robustly/correctly grounding symbols".  I can interpret that in a few ways:

  1. "interpretability := mechanistic interpretability (as it is currently practiced)": seems false.
  2. "interpretability := understanding symbol grounding well enough to have justified confidence that it is working as expected": also seems false; we could get good grounding without justified confidence, although it certainly much better to have the justified confidence.
  3. "interpretability := having good symbol grounding": a mere tautology.

A potential substantive disagreement: I think we could get high levels of justified confidence via means that look very different from (what I'd consider any sensible notion of) "interpretability", e.g. via: 

  • A principled understanding of how to train or otherwise develop systems that ground symbols in the way we want/expect/etc.
  • Empirical work
  • A combination of either/both of the above with mechanistic interpretability

It's not clear that any of these or their combination will give us as high of levels of justified confidence as we would like, but that's just the nature of the beast (and a good argument for pursuing governance solutions).

A few more points regarding symbol grounding:

  • I think it's not a great framing... I'm struggling to articulate why, but it's maybe something like "There is no clear boundary between symbols and non-symbols"
  • I think the argument I'm making in the original post applies equally well to grounding... There is some difficult work to be done and it is not clear that reverse engineering is a better approach than engineering.

we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine. 

If this is true, then it makes (mechanistic) interpretability much harder as well, as we'll need our interpretability tools to somehow teach us these underlying principles, as you go on to say.  I don't think this is the primary stated motivation for mechanistic interpretability.  The main stated motivations seem to be roughly "We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception"

I agree with you that there's a good chance that both forms of tractability that you outline here are not true in practice; it does seems like you can't get a mechanistic interpretation of a powerful LM that 1) is both faithful and complete and 2) is human understandable.* I also think that the mechanistic interpretability community has not yet fully reverse engineered an algorithm from a large neural network that wouldn't have been easier for humans to implement or even to solve with program induction, which we can point at to offer a clear rebuke of the "just make the interpretable AI" approach.


However, I think there are reasons why your analogy doesn't apply to the case of AI:

  • It's wrong to say we have a broken down car that we just need to fix; we don't know how to build a car that actually does many of the tasks that GPT-3 can do, or even have any idea of how to do them.
  • On the other hand, the elephant really seems to be working. This might be because a lot of the intelligent behavior current models exhibit is, in some sense, irreducibly complex. But it might also just be because it's easier to search through the space of programs when you parameterize them using a large transformer. Under this view, mechanistic interp can work because the elephant is a clean solution to the problems we face, even though evolution is messy. 
  • Relatedly, it does seem like a lot of the reason we can’t do the elephant approach in your analogy is that the elephant isn’t a very good solution to our problems!

IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there's a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would've said that we'd never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda's modular addition work. 

* This is not a knockdown argument against current mechanistic interpretability efforts. I think the main reasons to work on mechanistic interp do not look like "we can literally understand all the cognition behind a powerful AI", but instead "we can bound the behavior of the AI" or "we can help other, weaker AIs understand the powerful AI". For example, we might find good heuristic arguments even if we can't find fully complete and valid interpretations.

IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there's a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would've said that we'd never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda's modular addition work.

I don't think I've seen many people be surprised here, and indeed, at least in my model of the world interpretability is progressing slower than I was hoping for/expecting (like, when I saw the work by Chris Olah 6 years ago, I had hope we would make real progress understanding how these systems think, and lots of people would end up being productively able to contribute to the field, but our understand has IMO barely kept up with the changing architectures of the field, and is extremely far from being able to say really much of anything definite about how these models do any significant fraction of what they do, and very few people outside of Chris Olah's team seem to have made useful progress).

I would be interested if you can dig up any predictions by people who predicted much slower progress on interpretability. I don't currently believe that many people are surprised by current tractability in the space (I do think there is a trend for people who are working on interpretability to feel excited by their early work, but I think the incentives here are too strong for me to straightforwardly take someone's word for it, though it's still evidence).

I have seen one person be surprised (I think twice in the same convo) about what progress had been made.

ETA: Our observations are compatible.  It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

I'm very unconvinced by the results in the IOI paper.

I'd be interested to hear in more detail why you're unconvinced.

I think the main reasons to work on mechanistic interp do not look like "we can literally understand all the cognition behind a powerful AI", but instead "we can bound the behavior of the AI"

I assume "bound the behavior" means provide a worst-case guarantee. But if we don't understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don't understand wouldn't ruin our guarantee?
 


we can help other, weaker AIs understand the powerful AI

My understanding of interpretability is that humans understand what the AI is doing.  Weaker AIs understanding the powerful AI doesn't feel like a solution to interpretability. Instead it feels like a solution to amplification that's still uninterpretable by humans.

My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn't feel like a solution to interpretability. Instead it feels like a solution to amplification that's ultimately still uninterpretable by humans.

This somewhat feels like semantics to me - this still feels like a win condition! I don't personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it's good enough to make aligned systems.

I also think that interpretability lies on a spectrum rather than being a binary.

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone".

(however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

It could work if you can use interpretability to effectively prohibit this from happening before it is too late.  Otherwise, it doesn't seem like it would work.  

Strong tractability: We can build interpretable AGI-level systems without sacrificing too much competitiveness.

Interesting argument! I think my main pushback would be on clarifying exactly what "interpretable" means here. If you mean "we reverse engineer a system so well, and understand it so clearly, that we can use this understanding to build the system from scratch ourselves", then I find your argument somewhat plausible, but I also think it's pretty unlikely that we live in that world. My personal definition of strong tractability would be something like "AGI-level systems are made up of interpretable pieces, which correspond to understandable concepts. We can localise any model behaviour to the combination of several of these pieces, and understand the computation by which they fit together to produce that behaviour". I think this still seems pretty hard, and probably not true! And that if true, this would be a massive win for alignment. But in this world, I still think it's reasonable to expect us to still be unable to define and figure out how to assemble these pieces ourselves - there's likely to be a lot of complexity and subtlety in exactly what pieces form and why, how they're connected together, etc. Which seems much more easily done by a big blob of compute style approach than by human engineering.

I agree it's a spectrum.  I would put it this way: 

  • For any point on the spectrum there is some difficulty in achieving it.
  • We can approach that point from either direction, 1) starting with a "big blob of compute" and encountering the difficulty in extracting these pieces from the blob, or 2) starting with assembling the pieces, and encountering the difficulty in figuring out how to assemble them.
  • It's not at all clear that (1) would be easier than (2).
  • Probably it's best to do some of both. 

Regarding difficulty of (1) vs. (2), OTMH, there may be some sort of complexity-style argument that engineering, say, a circuit is harder than recognizing it.  However, the DNN doesn't produce the circuit, we still need to do that using interpretability techniques.  So I'm not sure how I feel about this argument.

Summary:
If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post, the author clarifies that this could be thought of as engineering (well-founded AI) vs. reverse engineering (interpretability). One pushback form John Wentworth is that we currently do not know how to build the car, or how the basic chemistry in the engine actually works; we do interpretability research in order to understand these processes better. Ryan Greenblatt pushes back that the post is more accurate if the word “interpretability” was replaced with “microscope AI” or “comprehensive reverse engineering”; this is because we do not need to understand every part of complex models in order to tell if they are deceiving us, so the level our interpretability understanding needs to be to be useful is lower than the level it needs to be for us to build the car from the ground up. Neel Nanda writes a similar comment about how, to him, high tractability is a lower bar, much lower than understanding every part of a system such that we could build it.

it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe


I would say "it may be better, and people should seriously consider this" not "it is better".