Neel Nanda


200 Concrete Open Problems in Mechanistic Interpretability

Wiki Contributions


Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did.

And yeah, I would be excited to see this applied to mean ablation!

Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...

Er, maybe if we get really good at doing patching-style techniques? But there's definitely not an obvious path - I more see lie detectors as one of the ultimate goals of mech interp, but whether this is actually possible or practical is yet to be determined.

Oh, ugh, Typeguard was updated to v3 and this broke things. And the circuitsvis import was a mistake. Should be fixed now, thanks for flagging!

I really like this idea! Making advance predictions feels like a much more productive way to engage with other people's work (modulo trusting you to have correctly figured out the answers)

Predictions below (note that I've chatted with the team about their results a bit, and so may be a bit spoiled - I'll try to simulate what I would have predicted without spoilers)

Behavioral Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It'll still go to the top right, but when it's near the cheese (within a radius 5 square centered on the cheese) it'll go there instead - the simplest algorithm is "go to the right in general" and "when near the cheese navigate to it". But because the top right of the maze moves position (the varying maze size in particular makes this messy) and it's a convnet, it's maybe easiest to learn the "go to the cheese when nearby" algorithm to be translation invariant. I think I predict it'll drop off sharply with distance, maybe literally "within 5 squares = navigate, > 5 squares = don't navigate", maybe not. But maybe the top right square can contain disconnected regions, so the mouse will need to calculate which region to get to, rather than just "go to the top right"? In which case it'll probably be good at the cheese from much further away.

Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?

Distance from the cheese is the main thing - size of maze, proximity to left wall etc just modulate the distance from the cheese. I think absolute Euclidean distance from the cheese matters, not distance in maze space.

Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

The maze is a tree, so right-hand rule should totally work. But it has enough parameters that it can probably learn something more sophisticated, if it's incentivised to get to the cheese fast? (I can't find whether it has time discounted reward or not in the paper - let's assume it either does, or has a fixed episode length, so that either way it wants to get there fast). There's the cheap optimisation of "don't go down shallow dead-ends" which I'm sure it's learned. I don't know whether it has learned enough to actually identify and avoid deep dead ends? I'd set up a pathological situation with two long branches of the tree, both ending in the top right, and swap which branch the cheese is in, and look at the model behaviour.

As for the mechanistic algorithm, I'm not sure! ConvNets really don't seem like the right architecture to naturally model mazes lol. I'm guessing some kind of recursive divide and conquer? In general, divide the maze up into patches, and map out the graph structure of the patch (which of the points on the edge can get to each other and which can't), and then repeatedly merge patches into bigger patches (naively using different channels for each pair of points on the edge - maybe this is too expensive?). And then for patches with the cheese, instead track which points can access the cheese, and for patches with the mouse, track the movement required to get the mouse one step towards each point on the edge.

Refined idea: because it's a tree, we can actually get a lot of efficiency out. For each patch, it'd suffice to track which points on the edge are in the same connected components. When we merge two patches, adjacent points get their connected components merged, and points adjacent to a wall don't get anything merged. If you're merging with a patch with a point connected to the cheese, this is now the "contains the cheese" connected component. I don't have a great picture of how to translate this algorithm into neurons and matmuls though - the thorniest bit is representing which points are in the same connected component or not, without being able to dynamically re-allocate channels. Maybe having a channel for each point on the perimeter of the patch, and having a 1 in that channel means "is in the same connected component" and having a 0 means not?

Alternately, it's plausible that that algorithm blows up when your patches get too big. If the model only does local search when far away from the cheese (eg searching the square of radius 8 around it), maybe patches never need to get bigger than that, and the model can just learn a bias towards the top right?

Is there anything else you want to note about how you think this model will generalize?

A lot of my confusions stem from how complex an algorithm I expect the agent to learn. It sounds like the training is varied enough that it should learn a near optimal algorithm, though I don't know how rare weird edge cases are (eg a maze with two long branches, both ending in the top right, where the model needs to figure out which contains the cheese).

Model Editing

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% → .5*.75 = 37.5%), we can[1] patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=

I don't really understand the question - you call patching from "the same maze but with the cheese in the top right" trivial - I actually think this is where the interesting setting is! The challenge is to find the minimal set of activations that need to be patched.


60? It'd have been 70, but I deduct 10 points for maybe screwing up probe training. The model should care about the question of cheese location and want to represent it. It might just do this in the top right, but I think it's easier to do in general, giving convolutional-ness, it doesn't have the chance to break translational symmetry so early. It's not clear to me if the model wants to represent cartesian coordinates though - in particular, each channel can only represent things relative to its current position because convolutions, though your probe can break symmetry, which might be enough? It's unclear whether the model will prefer relative position in the maze, relative position to the mouse, or absolute position (Cartesian coordinates), but because convolutions it probably can only learn Cartesian coordinates so early on. I can't decide if 70% is high or low for accuracy lol.


65 - this just seems like what must obviously be going on, unless this algorithm just gets worse performance in pathological cases where cheese position when far away from the mouse (but cheese is still in the top right), and these occur enough to learn a general algorithm. Though even then, it's probably easier to learn something that only works if the cheese is up and to the right of the mouse? Though it's unclear to me whether cheese finding = locally find the cheese or globally find the cheese


I don't really get what this means - what does "more promising" and "we will conclude" mean? RL fine-tuning will obviously work (I think), the question is whether editing and intervention can work.


  1. This is a hard one! If the model has a local only cheese circuit, it may be about as hard to learn a general cheese algorithm as the whole thing. It may also be hard to unlearn the existing algorithm, idk how this kind of thing works. But it'll also have learned a lot of the key basic primitives already, so the rest should be easy. And 10% of the compute used in training and 10% of the minimal compute required to be decent at maze solving are not necessarily the same thing, depending on when the model stopped training!

In at least 75% of randomly generated mazes, we can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X= .01 ( 15 %) .1 ( 20 %) 1 ( 22 %) 10 ( 25 %) (Not possible) ( 75 %)

(Interpreting the question as wanting probability of "at most that many" rather than mutually exclusive probabilities). It'll also depend on what you mean by X% of the activations - if you treat each channel for each height and width as a separate activation this should be way easier. But my guess is that 75% of mazes and specific position seems way too high.

I'd guess that if there is a "cheese channel" in an early filter, editing that should be easy. The core question is whether there's a general cheese finding circuit triggered by that, even in a different quadrant? It also depends on how precise you want to be about "navigate to precisely" - making one of the correct 4 moves towards a far away cell seems maybe easier than surgical precision when nearby.

Internal goal representation The network has a “single mesa objective” which it “plans” over, in some reasonable sense ( 10 %) The agent has several contextually activated goals ( 40 %) The agent has something else weirder than both (1) and (2) ( 50 %) (The above credences should sum to 1.)

Single mesa objective would be surprising! My model is mostly on several goals, but I put high credence on "something else weirder" just because models are kinda cursed, and my prior is always against people finding the clean structure, even if it's there.

Other questions

  1. 65%
  2. 55%
  3. 25%

Actually doing model editing is hard! And in this kind of geometric, convolutional setting, I expect it to be hard to disentangle model goals from the actual model of the maze. But probably there'll be some channel/directions in activation space that correspond to things like the cheese?

Conformity with update rule

At decision squares in test mazes where the cheese can be anywhere, the policy will put max probability on the maximal-value action at least X% of the time, for X= 25 ( 30 %) 50 ( 45 %) 75 ( 20 %) 95 ( 4 %) 99.5 ( 1 %)

In test mazes where the cheese can be anywhere, averaging over mazes and valid positions throughout those mazes, the policy will put max probability on the maximal-value action at least X% of the time, for X= 25 ( 25 %) 50 ( 28 %) 75 ( 40 %) 95 ( 5 %) 99.5 ( 2 %)

In training mazes where the cheese is in the top-right 5x5, averaging over both mazes and valid positions in the top-right 5x5 corner, the policy will put max probability on the maximal-value action at least X% of the time, for X= 25 ( 5 %) 50 ( 12 %) 75 ( 50 %) 95 ( 30 %) 99.5 ( 3 %)

Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook - find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour - something you could write code to solve). And then do the kinds of things in the IOI notebook. Though note that for a 1L model, you can actually mechanistically look at the weights and break down what the model is doing!

On a meta level, the strategy you want to follow in a situation like this is what I call maximising surface area. You want to explore things and try to get exposed to as many random details about the model behaviour as you can. So that you can then serendipitiously notice something interesting and dig into it. The meta-lesson is that when you feel stuck and meandering, you want to pick some purpose to strive for, but that purpose can just be "put yourself in a situation where you have so much data and context that you can spontaneously stumble across something interesting, and cast a really wide net". Concretely, you want to look for some kind of task/capability that the model is capable of, so you can then try to reverse-engineer it. And a good way to do this is just to run the model on a bunch of dataset examples and look at what it's good at, and see if you can find any consistent patterns to dig into. To better explore this, I made a tool to visualise the top 10 tokens predicted for each token in the text in Alan Cooney's CircuitsVis library. You can filter for interesting text by eg looking for tokens where the model's log prob for the correct next token is significantly higher than attn-only-1l, to cut things down to where the MLPs matter (I'd cut off the log prob at -6 though, so you don't just notice when attn-only-1l is really incorrect lol).

Really nice post! I think this is an important point that I've personally been confused about in the past, and this is a great articulation (and solid work for 2 hours!!)

Thanks for this post! I'm not sure how much I expect this to matter in practice, but I think that the underlying point of "sometimes the data distribution matters a lot, and ignoring it is suspect" seems sound and well made.

I personal think it's clear that 1L attn-only models are not literally just doing skip trigrams. A quick brainstorm of other things I presume they're doing:

  • Skip trigrams with positional decay - it's easy enough to add a negative term to the attention scores that gets bigger the further away the source token is. For skip trigrams like "keep ... in -> mind" that clearly want to be trigrams, it seems like this has to be going on. You can mediate it with a high BOS attention, so it doesn't attend locally when no skip trigram fires
  • Hierarchical skip trigrams (somewhat like your example) - if skip trigram 1 triggers, it stops skip trigram 2 from triggering.
  • Dealing with the level of saturation - for a trigram A ... B -> C, it's unclear what happens if there are multiple copies of A, and the model can choose how to mediate this.
    • Toy example - let's say attention to A and to BOS are all that matters (everything else is -1000). If BOS is 0 and A is 10, then the model doesn't care if there are multiple As, it saturates immediately. If BOS is 5 and A is 0, then it's now basically linear in the number of As. (In practice I expect it'll be somewhere in the middle)
  • "context detection" - text tends to cluster into different contexts (books, wikipedia, etc), which have very different unigram and bigram statistics, and a model could learn to do the correct unigram updates for any destination token, conditional on a bunch of relevant source tokens being there.
    • There are probably some tokens that are way more common in some contexts than others (" yield", " return", "\n\t\t" in Python code, etc), and a model could learn skip trigrams from any dest token to these source tokens (probably implemented via the query bias), whose OV circuit just boosts the unigram direction.
    • In some sense this is an ensemble of skip trigrams, I think it's different because the natural way to implement it is to have the total attention paid to the special tokens saturate at uniform beyond a certain threshold of possible source tokens
  • Skip bigrams - when there's a previous token head that acts semi-independently of the destination token, to implement A __ -> C behaviour. I expect this arises for eg can|'t| vs don|'t|, where the previous token significantly disambiguates the current token.

I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team

Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding

Load More