Neel Nanda


Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability

Wiki Contributions


Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees

One year is shockingly short, why such fast turnaround?

And great post I'm excited to see responsible scaling policies becoming a thing!

That might work, though you could easily end up with the final model not actually faithfully using its world model to make the correct moves - if there's more efficient/correct heuristics, there's no guarantee it'll use the expensive world model, or not just forget about it.

Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model - it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I'd love to hear about any progress you make, and possible room for collaboration.

Oh that's fascinating, thanks for sharing! In the model I was studying I found that intervening on the token direction mattered a lot for ending lines after 80 characters. Maybe there are multiple directions...? Very weird!

Cool work! I've been interested in seeing a mech interp project trying to find the circuits behind sycophancy, it seems like a good microcosm for social modelling circuitry which seems a first step towards deception circuitry. How good is LLaMA 7B at being sycophantic? And do you have any thoughts on what might be good prompts for understanding sycophancy circuitry? I'm particularly interested in prompts that are modular, with key words that can be varied to change it from one valence to another while keeping the rest of the prompt intact.

Huh, that's very useful context, thanks! Seems like pretty sad behaviour.

Great questions, thanks!

Background: You don't need to know anything beyond "a language model is a stack of matrix multiplications and non-linearities. The input is a series of tokens (words and sub-words) which get converted to vectors by a massive lookup table called the embedding (the vectors are called token embeddings). These vectors have really high cosine sim in GPT-Neo".

Re how long it took for scholars, hmm, maybe an hour? Not sure, I expect it varied a ton. I gave this in their first or second week, I think.

Thanks for writing this retrospective! I appreciate the reflections

Ah, thanks! As I noted at the top, this was an excerpt from that post, which I thought was interesting as a stand alone, but I didn't realise how much of that section did depend on attribution patching knowledge

Re modus tollens, looking at the data, it seems like the correct answer is always yes. This admits far more trivial solutions than really understanding the task (ie always say yes, vs always say no). Has anyone checked it for varying the value of the answer?

Load More