A Longlist of Theories of Impact for Interpretability

[-]Rohin Shah4y*130

The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.

Fwiw, I do have the reverse view, but my reason is more that "auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don't deploy your AI system, and someone else destroys the world instead).

There's a path to impact where you (a) see that your model is going to kill you and (b) convince everyone else of this, thereby buying you time (or even solving the problem altogether if we then have global coordination to not build AGI since clearly it would destroy us). I feel skeptical about global coordination (especially as it becomes cheaper and cheaper to build AGI over time) but agree that it could buy you time which then allows alignment to "catch up" and solve the problem. However, this pathway seems pretty conjunctive (it makes a difference in worlds where (a) people were uncertain about AGI risk, (b) your interpretability tools successfully revealed evidence that convinced most of them, and (c) the resulting increase in time made the difference).

In contrast, using interpretability tools is impactful if (a) not using the interpretability tools leads to deception (also required in the previous story), and (b) using the interpretability tools gets rid of that deception.

(Obviously "level of conjunctiveness" isn't the only thing that matters -- you also need probabilities for each of the conjuncts -- but this feels like the highest-level bit of why I'm more excited about putting tools in the training loop.)

(It's also not an either-or, e.g. you could use ELK inside of your training loop, and then do Circuits-style mechanistic interpretability as an audit at the end. But if I were forced to go all-in on one of the two options, it would be the training loop one.)

EDIT (March 26, 2023): Coming back to this comment a year later, I think it undersells the "auditing" theory of impact; there are also effects like "if people know you are auditing your models deeply they are less worried that you'll deploy something risky and so are less likely to race to beat you". I don't have a strong opinion on how those effects play out but they do seem important.

[-]Charlie Steiner4y30

Fwiw, I do have the reverse view, but my reason is more that "auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don't deploy your AI system, and someone else destroys the world instead).

The way I'd put something-like-this is that in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It's not like getting a coin to land Heads by flipping it again if it lands Tails - different AGI projects are not independent random variables, if you don't get good results the first time you won't get good results the next time unless you understand what happened. This means that auditing trained models isn't really appropriate for the middle of the skill curve.

Instead, it seems like something you could use after already being confident you're doing good stuff, as quality control. This sharply limits the amount you expect it to save you, but might increase some other benefits of having an audit, like convincing people you know what you're doing and aren't trying to play Defect.

[-]Evan R. Murphy4y00

(in which case you don't deploy your AI system, and someone else destroys the world instead).

Can you explain your reasoning behind this a bit more?

Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)

[-]Rohin Shah4y00

Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?

This one.

[-]Evan R. Murphy4y*20

Ok, I think there's a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.

I also think it's plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they'd have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously.

This is why "auditing a trained model" still seems like a useful ability to me.

Update: Perhaps I was reading Rohin's original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him "more excited" about interpretability for training, and that it's "not an either-or".

[-]Rohin Shah4y30

Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)

[-]Neel Nanda2y*80Review for 2022 Review

Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don't think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post's existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to do.

Object level I think the key point I wanted to make with this post was "there's a bunch of ways that interp can be helpful", which I think basically stands. I go back and forth on how much it's valuable to think about theories of impact day to day, vs just trying to do good science and pluck impactful low-hanging fruit, but I think that either way it's valuable to have a bunch in mind rather than carefully back-chaining from a specific and fragile theory of change.

This post got some extensive criticism in Against Almost Every Theory of Impact of Interpretability, but I largely agree with Richard Ngo and Rohin Shah's responses.

[-]TurnTrout3y80

Thinking over the last few months, I came to most strongly endorse (2: Better prediction of future systems), or something close to it. I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives). Secondarily, figure out the mechanistic picture of how reward events form different kinds of cognition in a network (e.g. if I reward the agent for writing this line of code, what does the ensuing gradient mean, statistically, across training runs?).

Also, "is this model considering deceiving me?" doesn't seem like that great of a question. Even an aligned AI would probably at least consider the plan of deceiving you, if that AI's originating lab is dallying on letting it loose, meanwhile unaligned AIs are becoming increasingly capable around the world. Perhaps instead check if the AI is actively planning to kill you -- that seems like better evidence on its alignment properties.

[-]evhub4y60

Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK - the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do.

Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?

[-]ryan_greenblatt4y30

By explanation, I think we mean 'reason why a thing happens' in some intuitive (and underspecified) sense. Explanation length gets at something like "how can you cluster/compress a justification for the way the program responds to inputs" (where justification is doing a lot of work). So, while the program itself is a great way to compress how the program responds to inputs, it doesn't justify why the program responds this way to inputs. Thus program length/simplicity prior isn't equivalent. Here are some examples demonstrating where (I think) these priors differ:

The axioms of arithmetic don't explain why the primes have a certain frequency - there is a short justification for this, but it's longer than just the axioms and has to include the axioms.
The explanation of why code golfed programs work is often longer than the programs (at least in English)
The shortest explanation for 'the SHA-512 hash of the first 2000 primes is x' probably has to include a full (long) computation trace despite the fact that a program which computes/checks this can be short.

Here's a short and bad explanation for why this is maybe useful for ELK.

The reason the good reporter works is because it accesses the model's concept for X and directly outputs it. The reason other possible reporter heads work is because they access the model's concept for X and then do something with that (where the 'doing something' might be done in the core model or in the head).

So, the explanation for why the other heads work still has to go through the concept for X, but then has some other stuff tacked on and must be longer than the good reporter.

[-]evhub4y20

The reason other possible reporter heads work is because they access the model's concept for X and then do something with that (where the 'doing something' might be done in the core model or in the head).

I definitely think there are bad reporter heads that don't ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.

[-]Beth Barnes4y10

Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don't understand exactly what's meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation - e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.

[-]evhub4y20

E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.

I don't think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it's not harder to prove something about a program with larger loop bounds, since you don't have to unroll the loop, you just have to demonstrate a loop invariant.

[-]Neel Nanda4y10

Honestly, I don't understand ELK well enough (yet!) to meaningfully comment. That one came from Tao Lin, who's a better person to ask.

[-]Raemon4y20

Curated.

I've long had a vague sense that interpretability should be helpful somehow, but recently when I tried to spell out exactly how it helped I had a surprisingly hard time. I appreciated this post's exploration of the concept.

[-]James Payor4y*20

If the field of ML shifts towards having a better understanding of models ...

I think this would be a negative outcome, and not a positive one.

Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.

[-]Charlie Steiner4y20

Fun exercise.

I made a publicly editable google sheet with my own answers already added here (though I wrote down my answers in a text document, without more than glancing at previous answers):

https://docs.google.com/spreadsheets/d/1DmI2Wfp9mDadsH3c068pLlgkHB_XzkXK49A7JkNiEQ4/edit?usp=sharing

Looks like I'm much more interested in interpretability as a cooperation / trust-building mechanism.

[-]Neel Nanda4y30

Good idea, thanks! I made a publicly editable spreadsheet for people to add their own https://docs.google.com/spreadsheets/d/1l3ihluDoRI8pEuwxdc_6H6AVBndNKfxRNPPS-LMU1jw/edit?usp=drivesdk

[-]VojtaKovarik1y10

even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy - if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!)

It seems to me that your analogy is the wrong way arond. IE, the right analogy would be "if I knew that a bunch of 5-year olds were reading my mind, I have...actually, a pretty good idea how to think deceptive thoughts in a way that avoids their tools".

(For what it's worth, I am not very excited about interpretability as an auditing tool. This -- ie, that powerful AIs might avoid this -- is one half of the reason. The other half is that I am sceptical that we will take audit warnings seriously enough -- ie, we might ignore any scheming that falls short of "clear-cut example of workable plan to kill many people". EG, ignore things like this. Or we might even decide to "fix" these issues by putting the interpretability in the loss function, and just deploying once the loss goes down.)

[-]Neel Nanda1y20

How would you evade their tools?

[-]VojtaKovarik1y10

If I could assume things like "they are much better at reading my inner monologue than my non-verbal thoughts", then I could create code words for prohibited things.
I could think in words they don't know.
I could think in complicated concepts they haven't understood yet. Or references to events, or my memories, that they don't know.
I could leave a part of my plans implicit, and only figure out the details later.
I could harm them through some action for which they won't understand that it is harmful, so they might not be alarmed even if they catch me thinking it. (Leaving a gas stove on with no fire.)
And then there are the more boring things, like if you know more details about how the mind-reading works, you can try to defeat it. (Make the evil plans at night when they are asleep, or when the shift is changing, etc.)

(Also, I assume you know Circumventing interpretability: How to defeat mind-readers, but mentioning it just in case.)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

47

A Longlist of Theories of Impact for Interpretability

47