Seeking Interns/RAs for Mechanistic Interpretability Projects

UPDATE: The deadline for applying for this is 11:59pm PT Sat 27 Aug. I'll be selecting candidates following a 2 week (paid) work trial, 10hr/week starting Sept 12 where participants pick a concrete research idea and try to make progress on it.

Hey! My name is Neel Nanda. I used to work at Anthropic on LLM interpretability under Chris Olah (the Transformer Circuits agenda), and am currently doing some independent mechanistic interpretability work, most recently on grokking (see summary). I have a bunch of concrete project ideas, and am looking to hire an RA/intern to help me work on some of them!

Role details: The role would be remote by default. Full-time (~40 hrs/week). Roughly for the next 2-3 months, but flexible-ish. I can pay you $50/hr (via a grant).

What I can offer: I can offer concrete project ideas, help getting started, and about 1hr/week of ongoing mentorship. (Ideally more, but that's all I feel confident committing to) I'm much better placed to offer high-level research mentorship/guidance than ML engineering guidance.

Pre-requisites: As such, I'd be looking for someone with existing ML skill, enough to be able to write and run simple experiments, esp with transformers. (e.g. writing your own GPT-2-style transformer from scratch in PyTorch is more than enough). Someone who’s fairly independently minded and could mostly work independently if given a concrete project idea, and mostly needs help with high-level direction. Familiarity with transformer circuits, and good linear algebra intuitions are nice-to-haves, but not essential. I expect this is best suited to someone who wants to test fit for doing alignment work, and is particularly interested in mechanistic interpretability.

Project ideas: Two main categories.

One of the future directions from my work on grokking, interpreting how models change as they train.
1. This has a fairly mathsy flavour and involves training and interpreting tiny models on simple problems, looking at how concrete circuits develop over training, and possibly what happens when we put interpretability-inspired metrics in the loss function. Some project ideas involve training larger models, eg small SoLU transformers
Work on building better techniques to localise specific capabilities/circuits in large language models, inspired by work such as ROME.
1. This would have a less mathsy flavour and more involved coding, and you’d be working with pre-trained larger models (GPT-2 scale). It would likely look like moving between looking for specific circuits in an ad-hoc way (eg, how GPT-2 knows that the Eiffel Tower is in Paris), and then distilling the techniques that worked best into standard tools, ideally automatable.
2. Optionally, this could also involve making open source interpretability tooling for LLMs

Actions: If you're interested in potentially working with me please reach out at neelnanda27@gmail.com. My ideal email would include:

Details on your background/AI/Interpretability experience (possible formats: LinkedIn link, resume, Github repo links for projects you're proud of, description of how confident you'd feel writing your own tiny GPT-2, whether you've read/understood A Mathematical Framework for Transformer Circuits, and anything else that feels relevant),
Why you're interested in working with me/on mechanistic interpretability, and any thoughts on the kinds of projects you'd be excited about
What you would want to get out of working together
What kind of time frame would you want to work together on, and how much time could you commit to this
No need to overthink it! This could be a single paragraph

Disclaimers:

I will most likely take on either one or zero interns, apologies in advance if I reject you! I expect this to be an extremely noisy hiring process, so please don't take it personally
I expect to have less capacity for mentorship time than I'd like, and it's very plausible you could get a better option elsewhere
I've never had an intern before, and expect this to be a learning experience!
Mechanistic interpretability is harder to publish than most fields of ML, and by default I would not expect this to result in papers published in conferences (though I'd be happy to support you in aiming for that!).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

29

Seeking Interns/RAs for Mechanistic Interpretability Projects

29