Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did.
And yeah, I would be excited to see this applied to mean ablation!
Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...
Er, maybe if we get really good at doing patching-style techniques? But there's definitely not an obvious path - I more see lie detectors as one of the ultimate goals of mech interp, but whether this is actually possible or practical is yet to be determined.
Oh, ugh, Typeguard was updated to v3 and this broke things. And the circuitsvis import was a mistake. Should be fixed now, thanks for flagging!
I really like this idea! Making advance predictions feels like a much more productive way to engage with other people's work (modulo trusting you to have correctly figured out the answers)
Predictions below (note that I've chatted with the team about their results a bit, and so may be a bit spoiled - I'll try to simulate what I would have predicted without spoilers)
Behavioral Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewh
Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook - find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour - something you could write code to solve). And then do the...
Really nice post! I think this is an important point that I've personally been confused about in the past, and this is a great articulation (and solid work for 2 hours!!)
Thanks for this post! I'm not sure how much I expect this to matter in practice, but I think that the underlying point of "sometimes the data distribution matters a lot, and ignoring it is suspect" seems sound and well made.
I personal think it's clear that 1L attn-only models are not literally just doing skip trigrams. A quick brainstorm of other things I presume they're doing:
I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team
Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding
but a quick inspection of the embeddings available through the huggingface model shows this isn't the case
That's GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn't. But idk, doesn't really matter
For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms
TLDR: The model ignores weird tokens when learning the embedding, and never predicts them in the output. In GPT-3 this means the model breaks a bit when a weird token is in the input, and will refuse to ever output it because it's hard coded the frequency statistics, and it's "repeat this token" circuits don't work on tokens it never needed to learn it for. In GPT-2, unlike GPT-3, embeddings are tied, meaning
W_U = W_E.T, which explains much of the weird shit you see, because this is actually behaviour in the unembedding not the embedding (weird tokens neve...
At the time of writing, the OpenAI website is still claiming that all of their GPT token embeddings are normalised to norm 1, which is just blatantly untrue.
Why do you think this is blatantly untrue? I don't see how the results in this post falsify that hypothesis
I appreciate this post, and vibe a lot!
Different jobs require different skills.
Very strongly agreed, I did 3 different AI Safety internships in different areas, where I think I was fairly mediocre in each, before I found that mech interp was a good fit.
Also strongly agreed on the self-evaluation point, I'm still not sure I really internally believe that I'm good at mech interp, despite having pretty solid confirmation from my research output at this point - I can't really imagine having it before completing my first real project!
Thanks! I'd be excited to hear from anyone who ends up actually working on these :)
I threw together a rough demo of converting Tracr to PyTorch (to a mech interp library I'm writing called TransformerLens), and hacked it to be compatible with Python 3.8 - hope this makes it easier for other people to play with it! (All bugs are my fault)
Ah, thanks! Haven't looked at this point in a while, updated it a bit. I've since made my own transformer tutorial which (in my extremely biased opinion) is better esp for interpretability. It comes with a template notebook to fill out alongside part 2, (with tests!) and by the end you'll have implemented your own GPT-2.
More generally, my getting started in mech interp guide is a better place to start than this guide, and has more on transformers!
Super interesting, thanks! I hadn't come across that work before, and that's a cute and elegant definition.
To me, it's natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.
Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.
Er, I'm bad at time estimates at the best of times. And this is a particularly hard case, because it's going to depend wildly on someone's prior knowledge and skillset and you can choose how deep to go, even before accounting for general speed and level of perfectionism. Here are some rough guesses:
ML pre-reqs 10-40h Transformer implementation 10-20h Mech Interp Tooling 10-20h Learning about MI Field 5-20h
But I am extremely uncertain about these. And I would rather not put these into the main post, since it's easy to be misleading and easy to cause people ...
Interesting context, thanks for writing it up!
But language models seem like they morally should memorize some data points. Language models should recite the US constitution and Shakespeare and the Bible
I'm curious how you'd define memorisation? To me, I'd actually count this as the model learning features - a bunch of examples will contain the Bible verse as a substring, and so there's a non-trivial probability that any input contains it, so this is a genuine property of the data distribution. It feels analogous to the model learning bigrams or trigram...
An operational definition which I find helpful for thinking about memorization is Zhang et al's counterfactual memorization.
The counterfactual memorization of a document is (roughly) the amount that the model's loss on degrades when you remove from its training dataset.
More precisely, it's the difference in expected loss on between models trained on data distribution samples that happen to include , and models trained on data distribution samples that happen not to include .
This will be lower for ...
I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above
Importantly, the bar for "good person to explain ideas to" is much lower than the bar for "is a good collaborator". Finding good collaborators is hard!
Thanks for writing this post! (And man, if this is you deliberately writing fast and below your standards, you should lower your standards way more!). I very strongly agree with this within mechanistic interpretability and within pure maths (and it seems probably true in ML and in life generally, but those are the two areas I feel vaguely qualified to comment on).
Aversion to Schlepping
Man, I strongly relate to this one... There have been multiple instances of me having an experiment idea I put off for days to weeks, only to do it in 1-3 hours and get r...
Idk, it might be related to double descent? I'm not that convinced.
Firstly, IMO, the most interesting part of deep double descent is the model size wise/data wise descent, which totally don't apply here.
They did also find epoch wise (different from data wise, because it's trained on the same data a bunch), which is more related, but looks like test loss going down, then going up again, then going down. You could argue that grokking has test loss going up, but since it starts at uniform test loss I think this doesn't count.
My guess is that the descent part ...
Missed a period (I'm impressed I didn't miss more tbh, I find it hard to remember that you're supposed to have them at the end of paragraphs)
Lol thanks. Fixed
I like the analogy! I hadn't explicitly made the connection, but strongly agree (both that this is an important general phenomena, and that it specifically applies here). Though I'm pretty unsure how much I/other MI researchers are in 1 vs 3 when we try to reason about systems!
To be clear, I definitely do not want to suggest that people don't try to rigorously reverse engineer systems a bunch, and be super details oriented. Linked to your comment in the post.
Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)
Cool, agreed. Maybe my main objection is just that I'd have put it last not first, but this is a nit-pick
I'm really appreciating the series of brief posts on Alignment relevant papers plus summaries!
Dumb question: You say that your toy model generation process gets correlated features. But doesn't it just get correlated feature probabilities. But that, given that you know the probabilities of feature 1 and feature 2 being present, knowing that feature 1 is actually present tells you nothing about feature 2?
Non X-risks from AI are still intrinsically important AI safety issues.
I want to push back on this - I think it's true as stated, but that emphasising it can be misleading.
Concretely, I think that there can be important near-term, non-X-risk AI problems that meet the priority bar to work on. But the standard EA mindset of importance, tractability and neglectedness still applies. And I think often near-term problems are salient and politically charged, in a way that makes these harder to evaluate.
I think these are most justified on problems with...
I strongly agree with the message in this post, but think the title is misleading. When I read it, it seemed to imply that alignment is distinct from near-term alignment concerns, while after having read it, it's specifically about how AI is used in the near-term. A title like "AI Alignment is distinct from how it is used in the near-term" would feel better by me.
I'm concerned about this, because I think the long-term vs near-term safety distinctions are somewhat overrated, and really wish these communities would collaborate more and focus more on the comm...
What's the mechanism you're thinking of, through which hype does damage?
This ship may have sailed at this point, but to me the main mechanism is getting other actors to pay attention, focus on the most effective kind of capabilities work, and making it more politically feasible to raise support. Eg, I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM. Legibly making a ton of money with it falls in a similar category to me.
Gopher is a good example of not really seein...
I appreciate this post! It feels fairly reasonable, and much closer to my opinion than (my perception of) previous MIRI posts. Points that stand out:
I'm interested in hearing other people's takes on this question! I also found that a tiny modular addition model was very clean and interpretable. My personal guess is that discrete input data lends itself to clean, logical algorithms more so than than continuous input data, and that image models need to devote a lot of parameters to processing the inputs into meaningful features at all, in a way that leads to the confusion. OTOH, maybe I'm just overfitting.
Exciting! I look forward to the first "interesting circuit entirely derived by causal scrubbing" paper
Thanks! Can you give a non-linear decomposition example?
Thanks for the clarification! If I'm understanding correctly, you're saying that the important part is decomposing activations (linearly?) and that there's nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that "the activation component in that direction" is a feature?
Really excited to see this come out! I'm in generally very excited to see work trying to make mechanistic interpretability more rigorous/coherent/paradigmatic, and think causal scrubbing is a pretty cool idea, though have some concerns that it sets the bar too high for something being a legit circuit. The part that feels most conceptually elegant to me is the idea that an interpretability hypothesis allows certain inputs to be equivalent for getting a certain answer (and the null hypothesis says that no inputs are equivalent), and then the recursive algori...
Before we even start a training run, we should try to have *actually good *abstract arguments about alignment properties of the AI. Interpretability work is easier if you're just trying to check details relevant to those arguments, rather than trying to figure out the whole AI.
Thanks for the post! I particularly appreciated this point
Thanks a lot for writing up this post! This felt much clearer and more compelling to me than the earlier versions I'd heard, and I broadly buy that this is a lot of what was going on with the phase transitions in my grokking work.
The algebra in the rank-1 learning section was pretty dense and not how I would have phrased it, so here's my attempt to put it in my own language:
We want to fit to some fixed rank 1 matrix , with two learned vectors , forming . Our objective function is . Rank one matrix facts - ...
Thanks for sharing this! I'm excited to see more interpretability posts. (Though this felt far too high production value - more posts, shorter posts and lower effort per post plz)
If we plot the distribution of the singular vectors, we can see that the rank only slowly decreases until 64 then rapidly decreases. This is because, fundamentally, the OV matrix is only of rank 64. The singular value distribution of the meaningful ranks, however, declines slowly in log-space, giving at least some evidence towards the idea that the network is utilizing most of t
I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in
Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.
Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour "I could probably spend a full-time day working on this and find something mildly c...
I'd recommend editing a link to Ethan's comment to the top of the post - I think people could easily lead with a misleading impression otherwise
See my other comment - it turns out to be the boring fact that there's a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn't used for much, and MLP0 contains most of the information about the value of the toke...
Thanks for clarifying your position, that all makes sense.
I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.
Huh, can you say more about this? I'm not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)
These bugs should be fixed, thanks for flagging!