All of Neel Nanda's Comments + Replies

I think a cool mechanistic interpretability project could be investigating why this happens! It's generally a lot easier to work with small models, how strong was the effect for the 7B models you studied? (I found the appendix figures hard to parse) Do you think there's a 7B model where this would be interesting to study? I'd love takes for concrete interpretability questions you think might be interesting here

1Owain Evans3h
My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance. 

Cool work! I really like the ACDC on the parenthesis feature part, I'd love to see more work like that, and work digging into exactly how things compose with each other in terms of the weights.

2Logan Riggs Smith9d
I've had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually. We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can't straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK. I do think you could find Residual features that are sufficient to activate the MLP features[1], but not all linear combinations from just the weights. Using a dataset-based method, you could find causal features in practice (the ACDC portion of the paper was a first attempt at that), and would be interested in an activation*gradient method here (though I'm largely ignorant).    1. ^ Specifically, I think you should scale the residual stream activations by their in-distribution max-activating examples.

Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees

One year is shockingly short, why such fast turnaround?

And great post I'm excited to see responsible scaling policies becoming a thing!

One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!

That might work, though you could easily end up with the final model not actually faithfully using its world model to make the correct moves - if there's more efficient/correct heuristics, there's no guarantee it'll use the expensive world model, or not just forget about it.

I would expect it to not work in the limit. All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing. (You don't need to model edge-cases or weird scenarios which don't ever come up while pursuing the optimal policy, and the optimal 'world-model' can be arbitrarily tinier and unfaithful to the full true world dynamics.*) Simply hardwiring a world model doesn't change this, any more than feeding in the exact board state as an input would lead to it caring about or paying attention to the irrelevant parts of the board state. As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected. * I'm sure Nanda knows this but for those whom this isn't obvious or haven't seen other discussions on this point (some related to the 'simulators' debate): a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward. For a complicated world or incomplete maximization, this may induce a very rich world-model inside the agent, but the final converged optimal agent may have an arbitrarily impoverished world model. In this case, imagine a version of Othello where at the first turn, the agent may press a button labeled 'win'. Obviously, the optimal agent will learn nothing at all beyond learning 'push the button on the first move' and won't learn any world-model at all of Othello! No matter how rich and fascinating the rest of the game may be, the optimal agent neither knows nor cares.

Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model - it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I'd love to hear about any progress you make, and possible room for collaboration.

Oh that's fascinating, thanks for sharing! In the model I was studying I found that intervening on the token direction mattered a lot for ending lines after 80 characters. Maybe there are multiple directions...? Very weird!

Cool work! I've been interested in seeing a mech interp project trying to find the circuits behind sycophancy, it seems like a good microcosm for social modelling circuitry which seems a first step towards deception circuitry. How good is LLaMA 7B at being sycophantic? And do you have any thoughts on what might be good prompts for understanding sycophancy circuitry? I'm particularly interested in prompts that are modular, with key words that can be varied to change it from one valence to another while keeping the rest of the prompt intact.

Huh, that's very useful context, thanks! Seems like pretty sad behaviour.

Great questions, thanks!

Background: You don't need to know anything beyond "a language model is a stack of matrix multiplications and non-linearities. The input is a series of tokens (words and sub-words) which get converted to vectors by a massive lookup table called the embedding (the vectors are called token embeddings). These vectors have really high cosine sim in GPT-Neo".

Re how long it took for scholars, hmm, maybe an hour? Not sure, I expect it varied a ton. I gave this in their first or second week, I think.

Thanks for writing this retrospective! I appreciate the reflections

Ah, thanks! As I noted at the top, this was an excerpt from that post, which I thought was interesting as a stand alone, but I didn't realise how much of that section did depend on attribution patching knowledge

1Ryan Greenblatt4mo
Oops, somehow I missed that context. Thanks for the clarification.

Re modus tollens, looking at the data, it seems like the correct answer is always yes. This admits far more trivial solutions than really understanding the task (ie always say yes, vs always say no). Has anyone checked it for varying the value of the answer?

Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).

Another thought: The main thing I find exciting about model editing is when it is surgical - it's easy to use gradient descent to find ways to intervene on a model, while breaking performance everywhere else. But if you can really localise where a concept is represented in the model and apply it there, that feels really exciting to me! Thus I find this work notably more exciting (because it edits a single latent variable) than ROME/MEMIT (which apply gradient descent)

Thanks for sharing! I think the paper is cool (though massively buries the lede). My summary:

  • They create a synthetic dataset for lit and unlit rooms with styleGAN. They exploit the fact that the GAN has disentangled and meaningful directions in its latent space, that can be individually edited. They find a lighting latent automatically, by taking noise that produces rooms, editing each latent in turn and looking for big changes specifically on light pixels
    • StyleGAN does not have a text input, and there's no mention of prompting (as far as I can tell - I
... (read more)
2Sam Marks5mo
Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.

To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).

Why do you believe this? It's fairly plausible to me that "train an AI to use interpretability tools to show that this other AI is being deceptive" is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me

The problem is robustly getting the incentive to show that the other AI is being deceptive. Giving access to the weights, activations and tools may give debaters the capability to expose deception - but that alone gets you nothing. You're still left saying: So long as we can get the AI to robustly do what we want (i.e. do its best to expose deception), we can get the AI to robustly do what we want. Similarly, "...and avoid cooperation" is essentially the entire problem. To be clear, I'm not saying that an approach of this kind will never catch any instances of an AI being deceptive. (this is one reason I'm less certain on (1)) I'm am saying that there's no reason to predict anything along these lines should catch all such instances. I see no reason to think it'll scale. Another issue: unless you have some kind of true name of deception (I see no reason to expect this exists), you'll train an AI to detect [things that fit your definition of deception], and we die to things that didn't fit your definition.

Maybe in contrast to other fields of ML? (Though that's definitely stopped being true for eg LLMs)

Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022)Nanda et al. (2023)Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.

If you want to build on Li et al (the Othello paper), my follow-up work is likely to be a useful starting point, and then the post I wrote about future directions I'm particularly excited about

Some recommended ways to upskill at empirical research (roughly in order):

For people specifically interested in getting into mechanistic interpretability, my guide to getting started may be useful - it's much more focused on the key, relevant parts of deep learning, with a bunch more interpretability specific stuff

Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excelle

... (read more)

How are OpenAI training these tokenizers?! I'm surprised they still have weird esoteric tokens like these in there, when presumably there's eg a bunch of words that are worth learning

Interesting, thanks for the context. I buy that this could be bad, but I'm surprised that you see little upside - the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works

I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.

Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.

I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and ther... (read more)

And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.

I'm surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)

3James Payor5mo
I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc. I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction). The transformer circuits work strikes me this way, so does a bunch of others. Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.

Thanks, that looks really useful! Do you have GPU price performance numbers for lower precision training? Models like Chinchilla were trained in bf16, so that seems a more relevant number.

1Jaime Sevilla5mo
Thanks Neel! The difference between tf16 and FP32 comes to a x15 factor IIRC. Though also ML developers seem to prioritise other characteristics than cost effectiveness when choosing GPUs like raw performance and interconnect, so you can't just multiply the top price performance we showcase by this factor and expect that to match the cost performance of the largest ML runs today. More soon-ish.

Thanks! I also feel more optimistic now about speed research :) (I've tried similar experiments since, but with much less success - there's a bunch of contingent factors around not properly hitting flow and not properly clearing time for it though). I'd be excited to hear what happens if you try it! Though I should clarify that writing up the results took a month of random spare non-work time...

Re models can be deeply understood, yes, I think you raise a valid and plausible concern and I agree that my work is not notable evidence against. Though also, idk ... (read more)

Er, hmm. To me this feels like a pretty uncontroversial claim when discussing a small model on an algorithmic task like this. (Note that the model is literally trained on uniform random legal moves, it's not trained on actual Othello game transcripts). Though I would agree that eg "literally all that GPT-4 cares about is predicting the next token" is a dubious claim (even ignoring RLHF). It just seems like Othello-GPT is so small, and trained on such a clean and crisp task that I can't see it caring about anything else? Though the word care isn't really we... (read more)

3Alex Turner5mo
This statement seems fine, yeah!  (Rereading my initial comment, I regret that it has a confrontational tone where I didn't intend one. I wanted to matter-of-factly state my concern, but I think I should have prefaced with something like "by the way, not a huge deal overall, but I think your statement here isn't known to be strictly true." Edited.)

I previously had considered that any given corpus could have been generated by a large number of possible worlds, but I now don't weight this objection as highly.

Interesting, I hadn't seen that objection before! Can you say more? (Though maybe not if you aren't as convinced by it any more). To me, it'd be that there's many worlds but they all share some commonalities and those commonalities are modelled. Or possibly that the model separately simulates the different worlds.

2Alex Turner5mo
So, first, there's an issue where the model isn't "remembering" having "seen" all of the text. It was updated by gradients taken over its outputs on the historical corpus. So there's a subtlety, such that "which worlds are consistent with observations" is a wrongly-shaped claim. (I don't think you fell prey to that mistake in OP, to be clear.) Second, on my loose understanding of metaphysics (ie this is reasoning which could very easily be misguided), there exist computable universes which contain entities training this language model given this corpus / set of historical signals, such that this entire setup is specified by the initial state of the laws of physics. In that case, the corpus and its regularities ("dogs" and "syntax" and such) wouldn't necessarily reflect the world the agent was embedded in, which could be anything, really. Like maybe there's an alien species on a gas giant somewhere which is training on fictional sequences of tokens, some of which happen to look like "dog". Of course, by point (1), what matters isn't the corpus itself (ie what sentences appear) but how that corpus imprints itself into the network via the gradients. And your post seems like evidence that even a relatively underspecified corpus (sequences of legal Othello moves) appears to imprint itself into the network, such that the network has a world model of the data generator (i.e. how the game works in real life).   Does this make sense? I have some sense of having incommunicated poorly here, but hopefully this is better than leaving your comment unanswered.

The frequently accompanying, action-relevant claim -- that substantially easier-to-interpret alternatives exist -- is probably false and distracts people with fake options. That's my main thesis.

I agree with this claim (anything inherently interpretable in the conventional seems totally doomed). I do want to push back on an implicit vibe of "these models are hard to interpret because of the domain, not because of the structure" though - interpretability is really fucking hard! It's possible, but these models are weird and cursed and rife with bullshit l... (read more)

Lol. This is a surprisingly decent summary, and the weaknesses are correctly identified things I did not try to cover

I tried to be explicit in the post that I don't personally care all that much about the world model angle - Othello-GPT clearly does form a world model, it's very clear evidence that this is possible. Whether it happens in practice is a whole other question, but it clearly does happen a bit.

They are still statistical next token predictors, it's just the statistics are so complicated it essentially becomes a world model. The divide between these concepts is artificial.

I think this undersells it. World models are fundamentally different from surface leve... (read more)

Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did.

And yeah, I would be excited to see this applied to mean ablation!

Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...

1Xander Davies6mo
Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."

Er, maybe if we get really good at doing patching-style techniques? But there's definitely not an obvious path - I more see lie detectors as one of the ultimate goals of mech interp, but whether this is actually possible or practical is yet to be determined.

Oh, ugh, Typeguard was updated to v3 and this broke things. And the circuitsvis import was a mistake. Should be fixed now, thanks for flagging!

1Charlie Steiner6mo
Thanks for the cool notebook!

I really like this idea! Making advance predictions feels like a much more productive way to engage with other people's work (modulo trusting you to have correctly figured out the answers)

Predictions below (note that I've chatted with the team about their results a bit, and so may be a bit spoiled - I'll try to simulate what I would have predicted without spoilers)

Behavioral Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewh

... (read more)

Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook - find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour - something you could write code to solve). And then do the... (read more)

Really nice post! I think this is an important point that I've personally been confused about in the past, and this is a great articulation (and solid work for 2 hours!!)

3Lawrence Chan7mo
Thanks! (As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)

Thanks for this post! I'm not sure how much I expect this to matter in practice, but I think that the underlying point of "sometimes the data distribution matters a lot, and ignoring it is suspect" seems sound and well made.

I personal think it's clear that 1L attn-only models are not literally just doing skip trigrams. A quick brainstorm of other things I presume they're doing:

  • Skip trigrams with positional decay - it's easy enough to add a negative term to the attention scores that gets bigger the further away the source token is. For skip trigrams like
... (read more)

I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team

2Lawrence Chan8mo
At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!

Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding

but a quick inspection of the embeddings available through the huggingface model shows this isn't the case

That's GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn't. But idk, doesn't really matter

For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.

Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms

TLDR: The model ignores weird tokens when learning the embedding, and never predicts them in the output. In GPT-3 this means the model breaks a bit when a weird token is in the input, and will refuse to ever output it because it's hard coded the frequency statistics, and it's "repeat this token" circuits don't work on tokens it never needed to learn it for. In GPT-2, unlike GPT-3, embeddings are tied, meaningW_U = W_E.T, which explains much of the weird shit you see, because this is actually behaviour in the unembedding not the embedding (weird tokens neve... (read more)

GPT-J uses the GPT-2 tokenizer and has untied embeddings.

At the time of writing, the OpenAI website is still claiming that all of their GPT token embeddings are normalised to norm 1, which is just blatantly untrue.

Why do you think this is blatantly untrue? I don't see how the results in this post falsify that hypothesis

1Jessica Rumbelow8mo
This link: says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn't the case. I think that's the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance. 

I appreciate this post, and vibe a lot!

Different jobs require different skills.

Very strongly agreed, I did 3 different AI Safety internships in different areas, where I think I was fairly mediocre in each, before I found that mech interp was a good fit.

Also strongly agreed on the self-evaluation point, I'm still not sure I really internally believe that I'm good at mech interp, despite having pretty solid confirmation from my research output at this point - I can't really imagine having it before completing my first real project!

Thanks! I'd be excited to hear from anyone who ends up actually working on these :)

I threw together a rough demo of converting Tracr to PyTorch (to a mech interp library I'm writing called TransformerLens), and hacked it to be compatible with Python 3.8 - hope this makes it easier for other people to play with it! (All bugs are my fault)

Ah, thanks! Haven't looked at this point in a while, updated it a bit. I've since made my own transformer tutorial which (in my extremely biased opinion) is better esp for interpretability. It comes with a template notebook to fill out alongside part 2, (with tests!) and by the end you'll have implemented your own GPT-2.

More generally, my getting started in mech interp guide is a better place to start than this guide, and has more on transformers!

Super interesting, thanks! I hadn't come across that work before, and that's a cute and elegant definition.

To me, it's natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.

Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.

Er, I'm bad at time estimates at the best of times. And this is a particularly hard case, because it's going to depend wildly on someone's prior knowledge and skillset and you can choose how deep to go, even before accounting for general speed and level of perfectionism. Here are some rough guesses:

ML pre-reqs 10-40h Transformer implementation 10-20h Mech Interp Tooling 10-20h Learning about MI Field 5-20h

But I am extremely uncertain about these. And I would rather not put these into the main post, since it's easy to be misleading and easy to cause people ... (read more)

Load More