Neel Nanda

Wiki Contributions


To verify this claim, here we collect together activations in a) the channel dimension in InceptionV1 and b) various MLP layers in GPT2 and cluster them using HDBSCAN, a hierarchical clustering technique

Are the clusters in this section clustering the entire residual stream (ie a vector) or the projection onto a particular direction? (ie a scalar)

Interesting, thanks! Like, this lets the model somewhat localise the scaling effect, so there's not a ton of interference? This seems maybe linked to the results on Emergent Features in the residual stream

Excited to see this work come out!

One core confusion I have: Transformers apply a LayerNorm every time they read from the residual stream, which scales the vector to have unit norm (ish). If features are represented as features, this is totally fine - it's the same feature, just rescaled. But if they're polytopes, and this scaling throws it into a different polytope, this is totally broken. And, importantly, the scaling factor is a global thing about all of the features currently represented by the model, and so is likely pretty hard to control. Shouldn't this create strong regularisation favouring using meaningful directions over meaningful polytopes?

Interesting post! I'm pretty curious about these.

A great resource for answering these questions is a set of model runs put out by the Stanford Center for Research into Foundation Models - they trained 5 runs of GPT-2 small and GPT-2 medium with 600 checkpoints and different random seeds, and released the weights. It seems like a good way to get some surface area on these questions with interesting real models. A few ideas that are somewhere on my maybe/someday research ideas list:

  • For each pair of models, feed in a bunch of text and look at the log prob for predicting each next token, and look at the scatter plot of these - does it look highly correlated? Poke at any outliers and see if there are any consistent patterns of things one model can do and the other cannot
    • Repeat this for a checkpoint halfway through training. If you find capabilities in one model and not in another, have they converged by the end of training?
    • Look at the PCA of these per-token losses across, say, 1M tokens of text, and see if you can find anything interesting about the components
  • Evaluate the models for a bunch of behaviours - ability to use punctuation correctly, to match open and close parentheses, patterns in the syntax and structure of the data (capital letters at the start of a sentence, email addresses having an @ and a .com in them, taking text in other languages and continuing it with text of that language, etc), specific behaviour like the ability to memorise specific phrases, complete acronyms, use induction-like behaviour, basic factual knowledge about the world, etc
    • The medium models will have more interesting + sophisticated behaviour, and are probably a better place to look for specific circuits
  • Look at the per-token losses for some text over training (esp for tokens with significant deviation between final models) and see whether it looks smooth or S-shaped - S-shaped would suggest higher path dependence to me
  • Look for induction head phase changes in each model during training, and compare when they happen.

I'm currently writing a library for mechanistic interpretability of LLMs, with support for loading these models + their checkpoints - if anyone might be interested on working on this, happy to share ideas. This is a small subset of OpenWebText that seems useful for testing.

Unrelatedly, a mark against path dependence is the induction head bump result, where we found that models have a phase change where they suddenly form induction heads, and that across a range of model sizes and architecture it forms consistently and around the same point (though not all architectures tested). Anecdotally, I've found that the time of formation is very sensitive to the exact positional embeddings used though.

Thanks, I really appreciate it.  Though I wouldn't personally consider this among the significant alignment work, and would love to hear about why you do!

Update 2: The nicely LaTeXed version of my Grokking post was also rejected from Arxiv?! I'll revisit this at some point in the next few weeks, but I'm going to give up on this for now. I consider this a mark against putting posts on Arxiv being an easy and fairly low effort thing to do (though plausibly still worth the effort).

Note: While I think the argument in this post is important and evidence, I overall expect phase changes to be a big deal. I consider my grokking work pretty compelling evidence that phase changes are tied to the formation of circuits and I'm excited about doing more research on this direction. Though it's not at all obvious to me that sophisticated behaviors like deception will be a single circuit vs many.

Thanks for the thoughts, and sorry for dropping the ball on responding to this!

I appreciate the pushback, and broadly agree with most of your points.

In particular, I strongly agree that if you're trying to form the ability to be a research lead in alignment (and less strongly, be an RE/otherwise notably contribute to research) that forming an inside view is important, totally independently from how well it tracks the truth, and agree that I undersold that in my post. 

In part, I think the audience I had in mind is different from you? I see this as partially aimed at proto-alignment researchers, but also a lot of people who are just trying to figure out whether to work on it/how to get into the field, including in less technical roles (policy, ops, community building), where I also have often seen a strong push for inside views. I strongly agree that if someone is actively trying to be an alignment researcher that forming inside views is useful. Though it seems pretty fine to do this on the job/after starting a PhD program, and in parallel to trying to do research under a mentor.

don't reject an expert's view before you've tried really hard to understand it and make it something that does work

I'm pretty happy with this paraphrase of what I mean. Most of what I'm pointing to is using the mental motion of trying to understand things rather than the mental motion of trying to evaluate things, I agree that being literally unable to evaluate would be pretty surprising. 

One way that I think it's importantly different is that it feels more comfortable to maintain black boxes when trying to understand something than when trying to evaluate something. Eg, I want to understand why people in the field have short timelines. I get to the point where I see how if I bought scaling laws continuing then everything follows. I am not sure why people believe this, and personally feel pretty confused, but expect other people to be much more informed than me. This feels like an instance where I understand why they hold their view fairly well, and maybe feel comfortable deferring to them, but don't feel like I can really evaluate their view?

What fraction of people who are trying to build inside views do you think have these problems? (Relevant since I often encourage people to do it)

Honestly I'm not sure - I definitely did, and have had some anecdata of people telling me they found my posts/claims extremely useful, or that they found these things pretty stressful, but obviously there's major selection bias. This is also just an objectively hard thing that I think many people find overwhelming (especially when tied to their social identity, status, career plans, etc). I'd guess maybe 40%? I expect framing matters a lot, and that eg pointing people to my posts may help?

I'm not immediately thinking of examples of people without inside views doing independent research that I would call "great safety relevant work".

Agreed, I'd have pretty different advice for people actively trying to do impactful independent research.

Idk, I feel like I formed my inside views by locking myself in my room for months and meditating on safety.

Interesting, thanks for the data point! That's very different from the kinds of things that work well for me (possibly just because I find locking myself in my room for a long time hard and exhausting), and suggests my advice may not generalise that well. Idk, people should do what works for them. I've found that spending time in the field resulted in me being exposed to a lot of different perspectives and research agendas, forming clearer views on how to do research, flaws in different approaches, etc. And all of this has helped me figure out my own views on things. Though I would like to have much better and clearer views than I currently do.

I'm pretty unconvinced by this. I do not think that any substantial fraction of AI x-risk comes from an alignment research who thinks carefully about x-risk deciding that a GPT-3 level system isn't scary enough to take significant precautions with re boxing.

I think taking frivolous risks is bad, but that risk aversion to the point of not being able to pursue otherwise promising research directions seems pretty costly, while the benefits of averting risks >1e-9 is pretty negligible in comparison.

(To be clear, this argument does not apply to more powerful systems! As systems get smarter we should be more capable, and try to be very conservative! But ultimately everything is a trade-off - letting GPT-3 talk to human contractors giving feedback is a way of letting it out of the box!)

like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons

This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than "I think OpenAI's alignment team is making bad prioritisation decisions".

Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors - it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.

(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)

Load More