All of Neel Nanda's Comments + Replies

Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did.

And yeah, I would be excited to see this applied to mean ablation!

Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...

1Xander Davies2d
Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."

Er, maybe if we get really good at doing patching-style techniques? But there's definitely not an obvious path - I more see lie detectors as one of the ultimate goals of mech interp, but whether this is actually possible or practical is yet to be determined.

Oh, ugh, Typeguard was updated to v3 and this broke things. And the circuitsvis import was a mistake. Should be fixed now, thanks for flagging!

1Charlie Steiner4d
Thanks for the cool notebook!

I really like this idea! Making advance predictions feels like a much more productive way to engage with other people's work (modulo trusting you to have correctly figured out the answers)

Predictions below (note that I've chatted with the team about their results a bit, and so may be a bit spoiled - I'll try to simulate what I would have predicted without spoilers)

Behavioral Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewh

... (read more)

Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook - find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour - something you could write code to solve). And then do the... (read more)

Really nice post! I think this is an important point that I've personally been confused about in the past, and this is a great articulation (and solid work for 2 hours!!)

3Lawrence Chan1mo
Thanks! (As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)

Thanks for this post! I'm not sure how much I expect this to matter in practice, but I think that the underlying point of "sometimes the data distribution matters a lot, and ignoring it is suspect" seems sound and well made.

I personal think it's clear that 1L attn-only models are not literally just doing skip trigrams. A quick brainstorm of other things I presume they're doing:

  • Skip trigrams with positional decay - it's easy enough to add a negative term to the attention scores that gets bigger the further away the source token is. For skip trigrams like
... (read more)

I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team

2Lawrence Chan1mo
At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!

Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding

but a quick inspection of the embeddings available through the huggingface model shows this isn't the case

That's GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn't. But idk, doesn't really matter

For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.

Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms

TLDR: The model ignores weird tokens when learning the embedding, and never predicts them in the output. In GPT-3 this means the model breaks a bit when a weird token is in the input, and will refuse to ever output it because it's hard coded the frequency statistics, and it's "repeat this token" circuits don't work on tokens it never needed to learn it for. In GPT-2, unlike GPT-3, embeddings are tied, meaningW_U = W_E.T, which explains much of the weird shit you see, because this is actually behaviour in the unembedding not the embedding (weird tokens neve... (read more)

GPT-J uses the GPT-2 tokenizer and has untied embeddings.

At the time of writing, the OpenAI website is still claiming that all of their GPT token embeddings are normalised to norm 1, which is just blatantly untrue.

Why do you think this is blatantly untrue? I don't see how the results in this post falsify that hypothesis

1Jessica Rumbelow1mo
This link: [] says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn't the case. I think that's the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance. 

I appreciate this post, and vibe a lot!

Different jobs require different skills.

Very strongly agreed, I did 3 different AI Safety internships in different areas, where I think I was fairly mediocre in each, before I found that mech interp was a good fit.

Also strongly agreed on the self-evaluation point, I'm still not sure I really internally believe that I'm good at mech interp, despite having pretty solid confirmation from my research output at this point - I can't really imagine having it before completing my first real project!

Thanks! I'd be excited to hear from anyone who ends up actually working on these :)

I threw together a rough demo of converting Tracr to PyTorch (to a mech interp library I'm writing called TransformerLens), and hacked it to be compatible with Python 3.8 - hope this makes it easier for other people to play with it! (All bugs are my fault)

Ah, thanks! Haven't looked at this point in a while, updated it a bit. I've since made my own transformer tutorial which (in my extremely biased opinion) is better esp for interpretability. It comes with a template notebook to fill out alongside part 2, (with tests!) and by the end you'll have implemented your own GPT-2.

More generally, my getting started in mech interp guide is a better place to start than this guide, and has more on transformers!

Super interesting, thanks! I hadn't come across that work before, and that's a cute and elegant definition.

To me, it's natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.

Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.

Er, I'm bad at time estimates at the best of times. And this is a particularly hard case, because it's going to depend wildly on someone's prior knowledge and skillset and you can choose how deep to go, even before accounting for general speed and level of perfectionism. Here are some rough guesses:

ML pre-reqs 10-40h Transformer implementation 10-20h Mech Interp Tooling 10-20h Learning about MI Field 5-20h

But I am extremely uncertain about these. And I would rather not put these into the main post, since it's easy to be misleading and easy to cause people ... (read more)

Interesting context, thanks for writing it up!

But language models seem like they morally should memorize some data points. Language models should recite the US constitution and Shakespeare and the Bible

I'm curious how you'd define memorisation? To me, I'd actually count this as the model learning features - a bunch of examples will contain the Bible verse as a substring, and so there's a non-trivial probability that any input contains it, so this is a genuine property of the data distribution. It feels analogous to the model learning bigrams or trigram... (read more)

An operational definition which I find helpful for thinking about memorization is Zhang et al's counterfactual memorization.

The counterfactual memorization of a document  is (roughly) the amount that the model's loss on  degrades when you remove  from its training dataset.

More precisely, it's the difference in expected loss on  between models trained on data distribution samples that happen to include , and models trained on data distribution samples that happen not to include .

This will be lower for ... (read more)

2Christopher Olah2mo
Qualitatively, when I discuss "memorization" in language models, I'm primarily referring to the phenomenon of languages models producing long quotes verbatim if primed with a certain start. I mean it as a more neutral term than overfitting. Mechanistically, the simplest version I imagine is a feature which activates when the preceding N tokens match a particular pattern, and predicts a specific N+1 token. Such a feature is analogous to the "single data point features" in this paper. In practice, I expect you can have the same feature also make predictions about the N+2, N+3, etc tokens via attention heads. This is quite different from a normal feature in the fact that it's matching a very specific, exact pattern. Agreed! This is why I'm describing it as "memorization" (which, again, I mean more neutrally than overfitting in the context of LLMs) and highlight that it really does seem like language models morally should do this. Although there's also lots of SEO spam that language models memorize because it's repeated which one might think of as overfitting, even though they're a property of the training distribution.

I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above

Importantly, the bar for "good person to explain ideas to" is much lower than the bar for "is a good collaborator". Finding good collaborators is hard!

Thanks for writing this post! (And man, if this is you deliberately writing fast and below your standards, you should lower your standards way more!). I very strongly agree with this within mechanistic interpretability and within pure maths (and it seems probably true in ML and in life generally, but those are the two areas I feel vaguely qualified to comment on).

Aversion to Schlepping

Man, I strongly relate to this one... There have been multiple instances of me having an experiment idea I put off for days to weeks, only to do it in 1-3 hours and get r... (read more)

3Lawrence Chan3mo
Thanks! This is probably true in general, to be honest. However, it's an explanation for why people don't do anything, and I'm not sure this differentially leads to delaying contact with reality more than say, delaying writing up your ideas in a Google doc.  I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above. I also think the meta strategy point of building a good workflow is super important!

Idk, it might be related to double descent? I'm not that convinced.

Firstly, IMO, the most interesting part of deep double descent is the model size wise/data wise descent, which totally don't apply here.

They did also find epoch wise (different from data wise, because it's trained on the same data a bunch), which is more related, but looks like test loss going down, then going up again, then going down. You could argue that grokking has test loss going up, but since it starts at uniform test loss I think this doesn't count.

My guess is that the descent part ... (read more)

Missed a period (I'm impressed I didn't miss more tbh, I find it hard to remember that you're supposed to have them at the end of paragraphs)

You're welcome, though did you miss a period here or did you want to write more?

I like the analogy! I hadn't explicitly made the connection, but strongly agree (both that this is an important general phenomena, and that it specifically applies here). Though I'm pretty unsure how much I/other MI researchers are in 1 vs 3 when we try to reason about systems!

To be clear, I definitely do not want to suggest that people don't try to rigorously reverse engineer systems a bunch, and be super details oriented. Linked to your comment in the post.

Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)

Cool, agreed. Maybe my main objection is just that I'd have put it last not first, but this is a nit-pick

I'm really appreciating the series of brief posts on Alignment relevant papers plus summaries!

Dumb question: You say that your toy model generation process gets correlated features. But doesn't it just get correlated feature probabilities. But that, given that you know the probabilities of feature 1 and feature 2 being present, knowing that feature 1 is actually present tells you nothing about feature 2?

1Lee Sharkey3mo
That's correct. 'Correlated features' could ambiguously mean "Feature x tends to activate when feature y activates" OR "When we generate feature direction x, its distribution is correlated with feature y's". I don't know if both happen in LMs. The former almost certainly does. The second doesn't really make sense in the context of LMs since features are learned, not sampled from a distribution.

Non X-risks from AI are still intrinsically important AI safety issues.

I want to push back on this - I think it's true as stated, but that emphasising it can be misleading. 

Concretely, I think that there can be important near-term, non-X-risk AI problems that meet the priority bar to work on. But the standard EA mindset of importance, tractability and neglectedness still applies. And I think often near-term problems are salient and politically charged, in a way that makes these harder to evaluate. 

I think these are most justified on problems with... (read more)

2Stephen Casper3mo
No disagreements substance-wise. But I'd add that I think work to avoid scary autonomous weapons is likely at least as important as recommender systems. If this post's reason #1 were the only reason for working on nerartermist AI stuff, then it would probably be like a lot of other very worthy but likely not top-tier impactful issues. But I see it as emphasis-worthy icing on the cake given #2 and #3. 

I strongly agree with the message in this post, but think the title is misleading. When I read it, it seemed to imply that alignment is distinct from near-term alignment concerns, while after having read it, it's specifically about how AI is used in the near-term. A title like "AI Alignment is distinct from how it is used in the near-term" would feel better by me.

I'm concerned about this, because I think the long-term vs near-term safety distinctions are somewhat overrated, and really wish these communities would collaborate more and focus more on the comm... (read more)

What's the mechanism you're thinking of, through which hype does damage?

This ship may have sailed at this point, but to me the main mechanism is getting other actors to pay attention, focus on the most effective kind of capabilities work, and making it more politically feasible to raise support. Eg, I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM. Legibly making a ton of money with it falls in a similar category to me.

Gopher is a good example of not really seein... (read more)

2Lawrence Chan3mo
Wouldn't surprise me if this was true, but I agree with you that it's possible the ship has already sailed on LLMs. I think this is more so the case if you have a novel insight about what paths are more promising to AGI (similar to the scaling hypothesis in 2018)---getting ~everyone to adopt that insight would significantly advance timelines, though I'd argue that publishing it (such that only the labs explicitly aiming at AGI like OpenAI and Deepmind adopt it) is not clearly less bad than hyping it up. Surely this is because it didn't say anything except "Deepmind is also now in the LLM game", which wasn't surprising given Geoff Irving left OpenAI for Deepmind? There weren't significant groundbreaking techniques used to train Gopher as far as I can remember.  Chinchilla, on the other hand, did see a ton of fanfare.  Cool. I agree with you that conceptual work is bad in part because of a lack of good examples/grounding/feedback loops, though I think this can be overcome with clever toy problem design and analogies to current problems (that you can then get the examples/grounding/feedback loops from). E.g. surely we can test toy versions of shard theory claims using the small algorithmic neural networks we're able to fully reverse engineer.

I appreciate this post! It feels fairly reasonable, and much closer to my opinion than (my perception of) previous MIRI posts. Points that stand out:

  • Publishing capabilities work is notably worse than just doing the work.
    • I'd argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.
    • Though, a counter-point is that if an organisation doesn't have great cyber-security and is a target for hacking, capabilities can easily leak (see, eg, the Soviets getting nuclear weapons 4 year after the US, despite it being a
... (read more)
4Lawrence Chan3mo
What's the mechanism you're thinking of, through which hype does damage? I also doubt that good capabilities work will be published "without fanfare", given how watched this space is.  I think this is more an indictment of existing work, and less a statement about what work needs to be done. e.g. my guess is we'll both agree that the original inner alignment work from Evan Hubinger is pretty decent conceptual research. And I think much conceptual work seems pretty serial to me, and is hard to parallelize due to reasons like "intuitions from the lead researcher are difficult to share" and communications difficulties in general.  Of course, I also do agree that there's a synergy between empirical data and thinking -- e.g. one of the main reasons I'm excited about Redwood's agenda is because it's very conceptually driven, which lets it be targeted at specific problems (for example, they're coming with techniques that aim to solve the mechanistic anomaly detection problem [], and finding current analogues and doing experiments with those). 

I'm interested in hearing other people's takes on this question! I also found that a tiny modular addition model was very clean and interpretable. My personal guess is that discrete input data lends itself to clean, logical algorithms more so than than continuous input data, and that image models need to devote a lot of parameters to processing the inputs into meaningful features at all, in a way that leads to the confusion. OTOH, maybe I'm just overfitting.

Exciting! I look forward to the first "interesting circuit entirely derived by causal scrubbing" paper

2Ryan Greenblatt4mo
I would typically call MLP(x) = f(x) + (MLP(x) - f(x)) a non-linear decomposition as f(x) is an arbitrary function. Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it's the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually. One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).

Thanks for the clarification! If I'm understanding correctly, you're saying that the important part is decomposing activations (linearly?) and that there's nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that "the activation component in that direction" is a feature?

2Kshitij Sachan4mo
Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as: MLP(x) = f(x) + (MLP(x) - f(x)) and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.

Really excited to see this come out! I'm in generally very excited to see work trying to make mechanistic interpretability more rigorous/coherent/paradigmatic, and think causal scrubbing is a pretty cool idea, though have some concerns that it sets the bar too high for something being a legit circuit. The part that feels most conceptually elegant to me is the idea that an interpretability hypothesis allows certain inputs to be equivalent for getting a certain answer (and the null hypothesis says that no inputs are equivalent), and then the recursive algori... (read more)

4Ansh Radhakrishnan4mo
I'd like to flag that this has been pretty easy to do - for instance, this process can look like resample ablating different nodes of the computational graph (eg each attention head/MLP), finding the nodes that when ablated most impact the model's performance and are hence important, and then recursively searching for nodes that are relevant to the current set of important nodes by ablating nodes upstream to each important node.
2Kshitij Sachan4mo
Nice summary! One small nitpick: > In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can "rewrite" our model into an equivalent form that better reflects the computation it's performing. For example, if we claim that a certain direction in an MLP's output is important, we could rewrite the single MLP node as the sum of the MLP output in the direction + the residual term. Then, we could make claims about the direction we pointed out and also claim that the residual term is unimportant. The important point is that we are allowed to rewrite our model however we want as long as the rewrite is equivalent.

Before we even start a training run, we should try to have *actually good *abstract arguments about alignment properties of the AI. Interpretability work is easier if you're just trying to check details relevant to those arguments, rather than trying to figure out the whole AI.

Thanks for the post! I particularly appreciated this point

Thanks a lot for writing up this post! This felt much clearer and more compelling to me than the earlier versions I'd heard, and I broadly buy that this is a lot of what was going on with the phase transitions in my grokking work.

The algebra in the rank-1 learning section was pretty dense and not how I would have phrased it, so here's my attempt to put it in my own language:

We want to fit to some fixed rank 1 matrix , with two learned vectors , forming . Our objective function is . Rank one matrix facts - ... (read more)

2Lawrence Chan4mo
(Adam Jermyn ninja'ed my rank 2 results as I forgot to refresh, lol) Weight decay just means the gradient becomes −∇xL=2(⟨b,y⟩a−⟨y,y⟩x)−λx, which effectively "extends" the exponential phase.  It's pretty easy to confirm that this is the case: You can see the other figures from the main post here: []  (Lighter color shows loss curve for each of 10 random seeds.) Here's my code for the weight decay experiments if anyone wants to play with them or check that I didn't mess something up: []
6Adam Jermyn4mo
I agree with both of your rephrasings and I think both add useful intuition! Regarding rank 2, I don't see any difference in behavior from rank 1 other than the "bump" in alignment that Lawrence mentioned. Here's an example: This doesn't happen in all rank-2 cases but is relatively common. I think usually each vector grows primarily towards 1 or the other target. If two vectors grow towards the same target then you get this bump where one of them has to back off and align more towards a different target [at least that's my current understanding, see my reply to Lawrence for more detail!]. What does a cross-entropy setup look like here? I'm just not sure how to map this toy model onto that loss (or vice-versa). Agreed! I expect weight decay to (1) make the converged solution not actually minimize the original loss (because the weight decay keeps tugging it towards lower norms) and (2) accelerate the initial decay. I don't think I expect any other changes. I'm not sure! Do you have a setup in mind? I agree this breaks my theoretical intuition. Experimentally most of the phenomenology is the same, except that the full-rank (rank 100) case regains a plateau. Here's rank 2: rank 10: (maybe there's more 'bump' formation here than with SGD?) rank 100: It kind of looks like the plateau has returned! And this replicates across every rank 100 example I tried, e.g. The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?). I'd be curious if you have any intuition about this!

Thanks for sharing this! I'm excited to see more interpretability posts. (Though this felt far too high production value - more posts, shorter posts and lower effort per post plz)

If we plot the distribution of the singular vectors, we can see that the rank only slowly decreases until 64 then rapidly decreases. This is because, fundamentally, the OV matrix is only of rank 64. The singular value distribution of the meaningful ranks, however, declines slowly in log-space, giving at least some evidence towards the idea that the network is utilizing most of t

... (read more)

I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in

... (read more)
2Ben Pace4mo
Thanks for the link, I'll aim to give that podcast a listen, it's relevant to a bunch of my current thinking.

Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.

Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour "I could probably spend a full-time day working on this and find something mildly c... (read more)

I'd recommend editing a link to Ethan's comment to the top of the post - I think people could easily lead with a misleading impression otherwise

2Edouard Harris4mo
Done, a few days ago. Sorry thought I'd responded to this comment.

Oh that's sketchy af lol. Thanks!

See my other comment - it turns out to be the boring fact that there's a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn't used for much, and MLP0 contains most of the information about the value of the toke... (read more)

0Aryaman Arora4mo
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more. Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

Thanks for clarifying your position, that all makes sense.

I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.

Huh, can you say more about this? I'm not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)

1Charlie Steiner4mo
I'm thinking about the paper Ewert 1987 [], which I know about because it spurred Dennet's great essay Eliminate the Middletoad [], but I don't really know the gory details of, sorry. I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of "human science being able to find something interesting in situations kinda like this," which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a big mysterious network?
Load More