UPDATE: The deadline for applying for this is 11:59pm PT Sat 27 Aug. I'll be selecting candidates following a 2 week (paid) work trial, 10hr/week starting Sept 12 where participants pick a concrete research idea and try to make progress on it.
Hey! My name is Neel Nanda. I used to work at Anthropic on LLM interpretability under Chris Olah (the Transformer Circuits agenda), and am currently doing some independent mechanistic interpretability work, most recently on grokking (see summary). I have a bunch of concrete project ideas, and am looking to hire an RA/intern to help me work on some of them!
Role details: The role would be remote by default. Full-time (~40 hrs/week). Roughly for the next 2-3 months, but flexible-ish. I can pay you $50/hr (via a grant).
What I can offer: I can offer concrete project ideas, help getting started, and about 1hr/week of ongoing mentorship. (Ideally more, but that's all I feel confident committing to) I'm much better placed to offer high-level research mentorship/guidance than ML engineering guidance.
Pre-requisites: As such, I'd be looking for someone with existing ML skill, enough to be able to write and run simple experiments, esp with transformers. (e.g. writing your own GPT-2-style transformer from scratch in PyTorch is more than enough). Someone who’s fairly independently minded and could mostly work independently if given a concrete project idea, and mostly needs help with high-level direction. Familiarity with transformer circuits, and good linear algebra intuitions are nice-to-haves, but not essential. I expect this is best suited to someone who wants to test fit for doing alignment work, and is particularly interested in mechanistic interpretability.
Project ideas: Two main categories.
- One of the future directions from my work on grokking, interpreting how models change as they train.
- This has a fairly mathsy flavour and involves training and interpreting tiny models on simple problems, looking at how concrete circuits develop over training, and possibly what happens when we put interpretability-inspired metrics in the loss function. Some project ideas involve training larger models, eg small SoLU transformers
- Work on building better techniques to localise specific capabilities/circuits in large language models, inspired by work such as ROME.
- This would have a less mathsy flavour and more involved coding, and you’d be working with pre-trained larger models (GPT-2 scale). It would likely look like moving between looking for specific circuits in an ad-hoc way (eg, how GPT-2 knows that the Eiffel Tower is in Paris), and then distilling the techniques that worked best into standard tools, ideally automatable.
- Optionally, this could also involve making open source interpretability tooling for LLMs
Actions: If you're interested in potentially working with me please reach out at neelnanda27@gmail.com. My ideal email would include:
- Details on your background/AI/Interpretability experience (possible formats: LinkedIn link, resume, Github repo links for projects you're proud of, description of how confident you'd feel writing your own tiny GPT-2, whether you've read/understood A Mathematical Framework for Transformer Circuits, and anything else that feels relevant),
- Why you're interested in working with me/on mechanistic interpretability, and any thoughts on the kinds of projects you'd be excited about
- What you would want to get out of working together
- What kind of time frame would you want to work together on, and how much time could you commit to this
- No need to overthink it! This could be a single paragraph
Disclaimers:
- I will most likely take on either one or zero interns, apologies in advance if I reject you! I expect this to be an extremely noisy hiring process, so please don't take it personally
- I expect to have less capacity for mentorship time than I'd like, and it's very plausible you could get a better option elsewhere
- I've never had an intern before, and expect this to be a learning experience!
- Mechanistic interpretability is harder to publish than most fields of ML, and by default I would not expect this to result in papers published in conferences (though I'd be happy to support you in aiming for that!).