Neel Nanda


Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability

Wiki Contributions


Ah, thanks! As I noted at the top, this was an excerpt from that post, which I thought was interesting as a stand alone, but I didn't realise how much of that section did depend on attribution patching knowledge

Re modus tollens, looking at the data, it seems like the correct answer is always yes. This admits far more trivial solutions than really understanding the task (ie always say yes, vs always say no). Has anyone checked it for varying the value of the answer?

Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).

Another thought: The main thing I find exciting about model editing is when it is surgical - it's easy to use gradient descent to find ways to intervene on a model, while breaking performance everywhere else. But if you can really localise where a concept is represented in the model and apply it there, that feels really exciting to me! Thus I find this work notably more exciting (because it edits a single latent variable) than ROME/MEMIT (which apply gradient descent)

Thanks for sharing! I think the paper is cool (though massively buries the lede). My summary:

  • They create a synthetic dataset for lit and unlit rooms with styleGAN. They exploit the fact that the GAN has disentangled and meaningful directions in its latent space, that can be individually edited. They find a lighting latent automatically, by taking noise that produces rooms, editing each latent in turn and looking for big changes specifically on light pixels
    • StyleGAN does not have a text input, and there's no mention of prompting (as far as I can tell - I'm not familiar with GANs). This is not a DALL-E style model. Its input is just Gaussian noise
    • This is a really cool result, and I am excited about it! The claim that GANs have disentangled latents (and that this is known), which makes this less exciting (man I wish this was true of LLMs). But it's still solid!
    • This is in section 4.1
  • They manually create a dataset of lit and unlit rooms, which isn't that interesting. They use this for benchmarking their method, not for actually training it (I don't find this that exciting)
  • They use the GAN as a source of training data, to train a model specifically for lit -> unlit rooms (I don't find this that exciting)

To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).

Why do you believe this? It's fairly plausible to me that "train an AI to use interpretability tools to show that this other AI is being deceptive" is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me

Maybe in contrast to other fields of ML? (Though that's definitely stopped being true for eg LLMs)

Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022)Nanda et al. (2023)Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.

If you want to build on Li et al (the Othello paper), my follow-up work is likely to be a useful starting point, and then the post I wrote about future directions I'm particularly excited about

Some recommended ways to upskill at empirical research (roughly in order):

For people specifically interested in getting into mechanistic interpretability, my guide to getting started may be useful - it's much more focused on the key, relevant parts of deep learning, with a bunch more interpretability specific stuff

Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excellent seldom feels like you’re excellent, because your own abilities set your baseline for what feels normal.


I relate a lot with this, this feels like one of the clearer markers internally for me of what becoming good at interpretability research felt like - there's so much low hanging fruit! Why aren't other people plucking it?

There's also just some internal sense of "I kind of know what I'm doing, and have ideas for what to do next", though this is much clearer to me when mentoring and advising other people, where I have strong opinions, than when applying it to myself, where I can sometimes pull it off but find it easily to fall into random spirals of doubt

Load More