Neel Nanda

Wiki Contributions


Before we even start a training run, we should try to have *actually good *abstract arguments about alignment properties of the AI. Interpretability work is easier if you're just trying to check details relevant to those arguments, rather than trying to figure out the whole AI.

Thanks for the post! I particularly appreciated this point

Thanks a lot for writing up this post! This felt much clearer and more compelling to me than the earlier versions I'd heard, and I broadly buy that this is a lot of what was going on with the phase transitions in my grokking work.

The algebra in the rank-1 learning section was pretty dense and not how I would have phrased it, so here's my attempt to put it in my own language:

We want to fit to some fixed rank 1 matrix , with two learned vectors , forming . Our objective function is . Rank one matrix facts - and .

So our loss function is now . So what's the derivative with respect to x? This is the same question as "what's the best linear approximation to how does this function change when ". Here we can just directly read this off as

The second term is an exponential decay term, assuming the size of y is constant (in practice this is probably a good enough assumption). The first term is the actual signal, moving along the correct direction, but is proportional to how well the other part is doing, which starts bad and then increases, creating the self-reinforcing properties that make it initially start slow then increase.

Another rephrasing - x consists of a component in the correct direction (a), and the rest of x is irrelevant. Ditto y. The components in the correct directions reinforce each other, and all components experience exponential-ish decay, because MSE loss wants everything not actively contributing to be small. At the start, the irrelevant components are way bigger (because they're in the rank 99 orthogonal subspace to a), and they rapidly decay, while the correct component slowly grows. This is a slight decrease in loss, but mostly a plateau. Then once the irrelevant component is small and the correct component has gotten bigger, the correct signal dominates. Eventually, the exponential decay is strong enough in the correct direction to balance out the incentive for future growth.

Generalising to higher dimensional subspaces, "correct and incorrect" component corresponds to the restriction to the subspace of the a terms, and to the complement of that, but so long as the subspace is low rank, "irrelevant component bigger so it initially dominates" still holds.

My remaining questions - I'd love to hear takes:

  • The rank 2 case feels qualitatively different from the rank 1 case because there's now a symmetry to break - will the first component of Z match the first or second component of C? Intuitively, breaking symmetries will create another S-shaped vibe, because the signal for getting close to the midpoint is high, while the signal to favour either specific component is lower.
  • What happens in a cross-entropy loss style setup, rather than MSE loss? IMO cross-entropy loss is a better analogue to real networks. Though I'm confused about the right way to model an internal sub-circuit of the model. I think the exponential decay term just isn't there?
  • How does this interact with weight decay? This seems to give an intrinsic exponential decay to everything
  • How does this interact with softmax? Intuitively, softmax feels "S-curve-ey"
  • How does this with interact with Adam? In particular, Adam gets super messy because you can't just disentangle things
    • Even worse, how does it interact with AdamW?

Thanks for sharing this! I'm excited to see more interpretability posts. (Though this felt far too high production value - more posts, shorter posts and lower effort per post plz)

If we plot the distribution of the singular vectors, we can see that the rank only slowly decreases until 64 then rapidly decreases. This is because, fundamentally, the OV matrix is only of rank 64. The singular value distribution of the meaningful ranks, however, declines slowly in log-space, giving at least some evidence towards the idea that the network is utilizing most of the 'space' available in this OV circuit head.

Quick feedback that the graph after this paragraph feels sketchy to me - obviously the singular values are zero beyond 64, and they're so far low down that all singular values above look identical. But the y axis is screwed up, so you can't really see this. What does the graph look like if you fix it? To me, it looks like there's actually some sparsity and the early singular values are far larger (looks like there's a big kink at the start, though it looks tiny because we're so zoomed out).

I also personally think that a linear scale is often more principled for a spectrum graph, but not confident in that take.

I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in my interactions, and a route must be found where communicating such things (insofar as that's what someone believes) isn't going to destroy or end the coordination/trade agreement. Speaking the truth is not something to be traded away, however costly it may be.

I can't comment on Conjecture specifically's coordination efforts, but I fairly strongly disagree with this as a philosophy of coordination. There exist a lot of people in the world who have massive empirical or ethical disagreements with me that lead to them taking actions I think are misguided to actively harmful to extremely dangerous. But I think that this often is either logical or understandable from their perspective. I think that being able to communicate productively with these people. see things from their point of view, and work towards common ground is a valuable skill, and an important part of the spirit of cooperation. For example, I think that Leah Garces's work cooperating with chicken farmers to reduce factory farming is admirable and worthwhile, and I imagine she isn't always frank and honest with people.

In particular, I think that being frank and honest in this context can basically kill possible cooperation. And good cooperation can lead to things being better by everyone's lights, so this is a large and important cost not worth taking lightly. Not everyone has to strive for cooperation, but I think it's very important that at least some people do! I do think that being so cooperative that you lose track of what you personally believe can be misguided and corrosive, but that there's a big difference between having clear internal beliefs and needing to express all of those beliefs.

Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.

Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour "I could probably spend a full-time day working on this and find something mildly cool worth writing up into a rough blog post"). I would love a wealth of these posts to exist that I could point people to and read myself! I've tried to set myself a much lower bar for this, and still mostly procrastinated on this. I would love to see this.

EDIT: This is also a comparative advantage of being an org outside academia whose employees mostly aren't aiming for a future career in academia. I gather that in standard academic incentives, being scooped on your research makes the work much less impressive and publishable and can be bad for your career, discincentivising discussing partial results, especially in public. This seems pretty crippling to having healthy and collaborative discourse, but it's also hard to fault people for following their incentives!

More generally, I really appreciate the reflective tone and candour of this post! I broadly agree re the main themes and that I don't think Conjecture has really made actions that cut at the hard core of alignment, and these reflections seem plausible to me re concrete but fixable mistakes and deeper and more difficult problems. I look forwards to seeing what you do next!

I'd recommend editing a link to Ethan's comment to the top of the post - I think people could easily lead with a misleading impression otherwise

Oh that's sketchy af lol. Thanks!

See my other comment - it turns out to be the boring fact that there's a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn't used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn't. And generally the kind of tasks that I'd expect to depend on tokens depend substantially on MLP0

Thanks for clarifying your position, that all makes sense.

I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.

Huh, can you say more about this? I'm not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)

Load More