Note: If you’ll forgive the shameless self-promotion, applications for my MATS stream are open until Sept 12. I help people write a mech interp paper, often accept promising people new to mech interp, and alumni often have careers as mech interp researchers. If you’re interested in this post I recommend applying! The application should be educational whatever happens: you spend a weekend doing a small mech interp research project, and show me what you learned.
Last updated Sept 2 2025
Mechanistic interpretability (mech interp) is, in my incredibly biased opinion, one of the most exciting research areas out there. We have these incredibly complex AI models that we don't understand, yet there are tantalizing signs of real structure inside them. Even partial understanding of this structure opens up a world of possibilities, yet is neglected by 99% of machine learning researchers. There’s so much to do!
I think mech interp is an unusually easy field to learn about on your own: there’s a lot of educational materials, you don’t need too much compute, and there’s short feedback loops. But if you're new, it can feel pretty intimidating to get started. This is my updated guide on how to skill up, get involved, reach the point where you can do actual research, and some advice on how to go from there to a career/academic role in the field.
This guide is deliberately highly opinionated. My goal is to convey a productive mindset and concrete steps that I think will work well, and give a sense of direction, rather than trying to give a fully broad overview or perfect advice. (And many of the links are to my own work because that's what I know best. Sorry!)
My core philosophy for getting into mech interp is this: learn the absolute minimal basics as quickly as possible, and then immediately transition to learning by doing research.
The goal is not to read every paper before you touch research. When doing research you'll notice gaps and go back to learn more. But being grounded in a project will give you vastly more direction to guide your learning, and contextualise why anything you’re learning actually matters. You just want enough grounding to start a project with some understanding of what you’re doing.
Don't stress about the research quality at first, or having the perfect project idea. Key skills, like research taste and the ability to prioritize, take time to develop. Gaining experience—even messy experience—will teach you the basics like how to run and interpret experiments, which in turn help you learn the high-level skills.
I break this down into three stages:
Your goal here is learning the basics: how to write experiments with a mech interp library, understanding the key concepts, getting the lay of the land.
Your aim is learning enough that the rest of your learning can be done via doing research, not finishing learning. Prioritize ruthlessly. After max 1 month[1], move on to stage 2. I’ve flagged which parts of this I think are essential, vs just nice to have.
Do not just read papers - a common mistake among academic types is to spend months reading as many papers as they can get their hands on before writing code. Don’t do it. Mech interp is an empirical science, getting your hands dirty gives key context for your learning. Intersperse reading papers with doing coding tutorials or small research explorations. See my research walkthroughs for an idea of what tiny exploratory projects can look like.
LLMs are a key tool - see the section below for advice on using them well
Assuming you already know basic Python and introductory ML concepts.
Code a simple Transformer (like GPT-2) from scratch. ARENA Chapter 1.1 is a great coding tutorial[2]
This builds intuitions for mech interp and on using PyTorch.
I have two video tutorials on this, starting from the basics - start here if you’re not sure what to do!
And use LLMs to fill in any background things you’re missing, like PyTorch basics
A lot of mech interp research looks like knowing the right technique to apply and in what context. This is a key thing to prioritise getting your head around when starting out. You’ll learn this with a mix of reading educational materials and doing coding tutorials like ARENA (discussed in next sub-section).
Essential: Make sure you understand these core techniques, well enough that you can code it up yourself on a simple model like GPT-2 Small[3]:
Activation Patching
Linear Probes
Using Sparse Autoencoders (SAEs) (you only need to write code that uses an SAE, not trains one)
Max Activating Dataset Examples
Nice-to-have:
Steering Vectors
Direct Logit Attribution (DLA) (a simpler version is called logit lens)
Key exercise: Describe each technique to an LLM with Ferrando et al in the context window and ask for feedback. Iterate until you get it all right.
Use an anti-sycophancy prompt to get real feedback, by pretending someone else wrote your answer, e.g. “I saw someone claim this, it seems pretty off to me, can you help me give them direct but constructive feedback on what they missed? [insert your description]”
Goal: Get comfortable running experiments and "playing" with model internals. Get the engineering basics down[4]. Get your hands dirty.
Exercise: Make a happiness steering vector (for e.g. GPT-2 Small) by having an LLM via an API generate 32 happy prompts and 32 sad prompts, and taking the difference in mean activations[5] (e.g. the residual stream at the middle layer). Add this vector to the model’s residual stream[6] while generating responses to some example prompts, and use an LLM API to rate how happy they seem, and see this score go up when steering.
As of early Sept 2025, Qwen3 is a good default model family. Each model has reasoning and non-reasoning mode, there’s a good range of sizes, and most are dense[7]
Gemma 3 and LLaMA 3.3 are decent non-reasoning models. I’ve heard bad things about gpt-oss and LLaMA 4
Your priority is to understand the concepts and the basics, but you want a sense for the landscape of the field, so you should practice reading at least some papers.
As a tool to help you skim a paper - put the paper in the context window[8] then get a summary, ask it questions, etc
Note: I expect this section to go out of date fast! Written early Sept 2025
LLMs are a super useful tool for learning, especially in a new field. While they struggle to beat experts, they often beat novices. If you aren’t using them regularly throughout this process, I’d guess you’re leaving a bunch of value on the table.
But LLMs have weird flaws and strengths, and it’s worth being intentional about how you use them:
Context engineering: Modern LLMs are much more useful with relevant info in context. If you give them the paper in question, or source code of the relevant library[9], they’ll be far more helpful.
See this folder for a bunch of saved context files for mech interp queries. If you don’t know what you need, just use this default file.
I recommend Gemini 2.5 Pro (1M context window) via aistudio.google.com; the UI is better. Always turn compare mode on, you get two answers in parallel
Feel free to skip to the “what should I do next” part
At this point it’s worth reflecting on what mech interp actually is. What are we even doing here? There isn't a consensus definition on how exactly to define mechanistic interpretability, and different researchers will give very different takes. But my working definition is as follows[10].
Why this definition? To do impactful research, it's often good to find the directions that other people are missing. I think of most of machine learning as non-mechanistic non-interpretability. 99% of ML research just looks at the inputs and outputs to models, and treats its north star as controlling their behavior. Progress is defined by making a number go up, not to explain why it works. This has been very successful, but IMO leaves a lot of value on the table. Mechanistic interpretability is about doing better than this, and has achieved a bunch of cool stuff, like teaching grandmasters how to play chess better by interpreting AlphaZero.
Why care? Obviously, our goal is not “do things if and only if they fit the above definition”, but I find it a useful one. To discuss this, let’s first consider our actual goals here. To me, the ultimate goal is to make human-level AI systems (or beyond) safer. I do mech interp because I think we’ll find enough understanding of what happens inside a model to be pragmatically useful here (also, because mech interp is fun!): to better understand how they work, detect if they're lying to us, detect and diagnose unexpected failure modes, etc. But people’s goals vary, e.g. real-world usefulness today, aesthetic beauty, or scientific insight. It’s worth thinking about what yours are.
Some implications of this framing worth laying out:
This is a broad definition. Historically, the field has focused on more specific agendas, like ambitious reverse engineering of models. But I think we shouldn’t limit ourselves, there’s many other important and neglected directions and the field is large enough to cover a lot of ground[11]
So, you've gone through the tutorials, you understand the core concepts, and you can write some basic experimental code. Now comes the hard part: learning how to actually do mech interp research[12].
This is an inherently difficult thing to learn, of course. But IMO people often misunderstand what they need to do here, try to learn everything at once, or more generally make life unnecessarily hard for themselves. The key is to break the process down, understand the different skills involved, and focus on learning the pieces with the fastest feedback loops first.
I suggest breaking this down into two stages[13].
Stage 2: working on a bunch of throwaway mini projects of 1-5 days each. Don't stress about choosing the best projects or producing public output. The goal is to learn the skills with the fastest feedback loops.
Stage 3: After a few weeks of these, start to be more ambitious: paying more attention to how you choose your projects, gaining the subtler skills, and how to write things up. I still recommend working iteratively, in one to two week sprints, but ending up with longer-term projects if things go well.
Note: Unlike stage 1 to 2, the transition from stages two to three should be fairly gradual as you take on larger projects and become more ambitious. A good default would be after three to four weeks in stage two, but you don’t need to have a big formal shift.
Mentorship: A good mentor is a major accelerator, and finding one should be a major priority for you. In the careers section, I provide advice on how to go about finding a good mentor, and how concretely they can add value. In the rest of the post I'll write most of it assuming you do not have a mentor and then flag the ways to use a mentor where appropriate.
I find it helpful to think of research as a cycle of four distinct stages. Read my blog post on the research proces for full details, but in brief:
Underpinning these stages is a host of skills, best separated by how quickly you can apply them and get feedback. We learn by doing things and getting feedback, so you’ll learn the fast ones much more quickly. I put a rough list and categorization below.
My general advice is to prioritize learning these in order of feedback loops. If it seems like you need a slow skill to get started, like the taste to choose a good research problem, find a way to cheat rather than stressing about not having that skill (e.g. doing an incremental extension to a paper, getting one from a mentor, etc).
Your progression should be simple: First, focus on the fast/medium skills behind exploration and understanding with throwaway projects. Then, graduate to end-to-end projects where you can intentionally practice the deeper skills, and practice ideation and distillation too.
A particularly important and fuzzy type of skill is called research taste. I basically think of this as the bundle of intuitions you get with enough research experience that let you do things like come up with good ideas, predict if an idea is promising, have conviction in good research directions, etc. Check out my post on the topic for more thoughts.
I broadly think you should just ignore it for now, find ways to compensate for not having much yet, and focus on learning the fast-medium skills, and this will give you a much better base for learning it. In particular, it's much faster to learn with a mentor, so if you don't have a mentor at the start, you should prioritize other things.
But you want to learn it eventually, so it's good to be mindful of it throughout, and look for opportunities to practice and learn lessons. I recommend treating it as a nice-to-have but not stressing about it
Note, one important trap here is that having good taste can often manifest as having confidence and conviction in some research direction. But often novice researchers develop this confidence and conviction significantly before they develop the ability to not be confident in bad ideas. It’s often a good learning experience to once or twice pursue a thing you feel really convinced is going to be epic and then discover you're wrong, so it's not that bad an outcome, especially in stage 2 (mini-projects) but be warned.
With that big picture in mind, let's get our hands dirty. You want to do a series of ~1-5 day mini-projects, for maybe 2-4 weeks. The goal right now is to learn the craft, not to produce groundbreaking research.
Focus on practicing exploration and understanding and gaining the fast/medium skills, leave aside ideation and distillation for now. If you produce something cool and want to write it up, great! But that’s a nice-to-have, not a priority.
Once you finish a mini-project, remember to do a post-mortem. Spend at least an hour analyzing: what did you do? What did you try? What worked? What didn't? What mistakes did you make? What would you do differently if doing this again? And how can you integrate this into your research strategy going forwards?
Some suggested starter projects
Those cover two kinds of starter projects:
Common mistakes:
The idea of exploration as a phase in itself often trips up people new to mech interp. They feel like they always need to have a plan, a clear thing they're doing at any given point, etc. In my experience, you will often spend more than half of a project trying to figure out what the hell is happening and what you think your plan is. This is totally fine!
You don't need a plan. It's okay to be confused. However, this does not mean you should just screw around. Your North Star: gain information and surface area[14] on the problem. Your job is to take actions that maximise information gained per unit time. If you've learned nothing in 2 hours, pivot to another approach. If 2-3 approaches were dead ends, it’s fine to just pick another problem.
I have several research walkthroughs on my YouTube channel that I think demonstrates the mindset of exploration. What I think is an appropriate speed to be moving. E.g. I think you should aim to make a new plot every few minutes (or faster!) if experiments don't take too long to run.
A common difficulty is feeling “stuck” and not knowing what to do. IMO, this is largely a skill issue. Here's my recommended toolkit when this happens:
Other advice:
If exploration goes well, you'll start to form hunches about the problem. E.g. thinking that you are successfully (linearly) probing for some concept. Or that you found a direction that mediates refusal. Or that days of the week are represented as a circle in a 2D subspace.
Once you have this, you want to go to figure out if it's actually true. Be warned, the feeling of “being really convinced that it's true” is very different from actually being true. Part of being a good researcher is being good enough at testing and falsifying your pet hypotheses that, when you fail to falsify one, there’s a good chance that it's true. But you're probably not there yet.
Note: While I find it helpful to think of these as discrete stages, often you'll be flitting back and forth. A great way to explore is coming up with guesses and micro-hypotheses about what's going on, running a quick experiment to test them, and integrating the results into your understanding of the problem, going back to the drawing board.
Your North Star: convince yourself a hypothesis is true or false. The key mindset is skepticism. Advice:
You then want to convert these flaws and alternative hypotheses into concrete experiments. Experiment design is a deep skill. Honestly, I'm not sure how to teach it other than through experience. But one recommendation is to pay close attention to the experiments in papers you admire and analyze what made them so clever and effective. I also recommend that, every time you feel like you’ve (approximately) proven or falsified a hypothesis, adding them to a running doc of “things I believe to be true” with hypotheses, experiments, and results.
In my opinion, coding is one of the domains where LLMs are most obviously useful. It was very striking to me how much better my math scholars were six months ago than 12 months ago, and I think a good chunk of this is attributable by them having much better LLMs to use. If you are not using LLMs as a core part of your coding workflow, I think you're making a mistake.
Feel free to skip to the “what should I do next” part
Things move fast in mechanistic interpretability. Newcomers to the field who've kept up from afar are often pretty out of date. Here's what I think you need to know, again, filtered through my own opinions and biases.
This interlude is particularly important because the field often has fads: lines of research that are very popular for a year or so, make some progress and find many limitations, and then the field moves on. But if you’re new, and catching up on the literature, you might not realise. I often see people new to the field working on older things, that I don’t think are too productive to work on any more. Historical fads include:
We're at the tail end of a fad of incremental sparse autoencoder research[15] (i.e. focusing on simple uses and refinements of the basic technique)
Calling this one a fad is probably more controversial (if only because it's more recent).
The specific thing I am critiquing is the spate of papers, including ones I was involved in, that are about incremental improvements to the sparse autoencoder architecture, or initial demonstrations that you can apply SAEs to do things, or picking some downstream task and seeing what SAEs do on it.
I think this made some sense when it seemed like SAEs could be a total gamechanger for the field, and where we were learning things from each new such paper. I think this moment has passed; I do not think they were a gamechanger in the way that I hoped they might be. See more of my thoughts here.
I am not discouraging work on the following:
Attribution graph-based circuit analysis, which I don't think has played out yet - see a recent overview of that sub-field I co-wrote.
Trying meaningfully different approaches to dictionary learning (eg SPD or ITDA), or things targeted to fix conceptual limitations of current techniques (eg Matryoshka).
Using SAEs as a tool, whether as part of a broader project investigating weird phenomena in model biology, or as a baseline/approach on some downstream task. The key is that the project’s motivation should not just be “what if we used SAEs for X?” unless there’s a good argument
I particularly recommend them for tasks where you don’t know exactly what you’re looking for, e.g. trying to explore some mysterious phenomena
Note that I am putting this after stage 2 because I think that for initial throwaway projects you should not be stressing about novelty and avoiding fads - your goal is just to learn. But as we move into stage 3 you should start to be a bit more mindful about choosing more exciting/impactful projects where possible.
Also, take these as nudges and recommendations, not as instructions. If there's a direction you believe in that fits the things I'm critiquing, maybe I'm just wrong, maybe your thing is an exception, go wild, see what happens.
OK, so those are my hot takes on what not to do. What should you do? I think that some really cool new opportunities have opened up in mech interp over the last year, and newcomers may not have come across these. Here are some of the key themes in my favorite papers over the last year, that I’d love to see readers build on:
Model organisms: The auditing games paper was made possible by the fact that they were able to make a model with a hidden goal[16], a model organism to study. In general, we’re collecting techniques like synthetic document fine-tuning to make really interesting model organisms.
This kind of thing has a lot of potential! If we want to make a lie detector, a core challenge is that we don’t know how to test if it works or not. But if we can insert beliefs or deceptive behaviours into a model, many more projects become possible
A great intro project is playing around with open source model organisms, e.g. from Cywinski et al
See this cross-org blog post for the ongoing follow-on work across the community, and an open problems list I co-wrote![17]
A line of work studying emergent misalignment - why training models on narrowly evil tasks like writing insecure code turns them into Nazis - has found some insights. Wang et al found this was driven by sparse autoencoder latents[18] associated with movie villains, and in Turner et al we found that the model could have learned the narrow solution, but this was in some sense less “efficient” and “stable”
Automated interpretability: Using LLMs to automate interpretability. We saw signs of life on this from Bills et al and Shaham et al, but LLMs are actually good now! It’s now possible to make basic interpretability agents that can do things like solve auditing games[19]. And interpretability agents are the worst they’ll ever be[20].
Reasoning model interpretability: All current frontier models are reasoning models—models that are trained with reinforcement learning to think[21] for a while before producing an answer. In my opinion, this requires a major rethinking of many existing interpretability approaches[22], and calls for exploring new paradigms. IMO this is currently being neglected by the field, but will become a big deal.
In Bogdan et al, we explored what a possible paradigm could look like. Notably, there are far more interesting and sophisticated black box techniques with reasoning models, like resampling the second half of the chain of thought, or every time the model says a specific kind of sentence, deleting and regenerating that sentence.
Attentive readers may notice that the list above focuses on work to do with understanding the more qualitative high-level properties of models, and not ambitious reverse engineering. This is largely because, in my opinion, the former has gone great, while we have not seen much progress towards the fundamental blockers on the latter.
I used to be very excited about ambitious reverse engineering, but I currently think that the dream of completely reverse engineering a model down to something human understandable seems basically doomed. My interpretation of the research so far is that models have some human understandable high-level structure that drives important actions, and a very long tail of increasingly niche and irrelevant heuristics and biases. For pragmatic purposes, these can be largely ignored, but not if we want things like guarantees, or to claim that we have understood most of a model. I think that trying to understand as much as we can is still a reasonable proxy for getting to the point of being pragmatically useful, but think it’s historically been too great a focus of the field, and many other approaches seem more promising if our ultimate goals are pragmatic.
In some ways, this has actually made me more optimistic about interpretability ultimately being useful for AGI safety! Ambitious reverse engineering would be awesome but was always a long shot. But I think we've seen some real results for pragmatic approaches to mechanistic interpretability, and feel fairly confident we are going to be able to do genuinely useful things that are hard to achieve with other methods.
Once you have a few mini-projects done, you should start being more ambitious. You want to think about gaining the deeper (medium/slow) skills, and exploring ideation and distillation.
However, you should still expect projects to often fail, and want to lean into breadth over depth and avoid getting bogged down in an unsuccessful project you can’t bear to give up on. To resolve this tension, I recommend working in 1-2 week sprints. At the end of each sprint, reflect and make a deliberate decision: continue, or pivot? The default should be to pivot unless the project feels truly promising. It’s great to give up on things, if it means you spend your time even better! But if it’s going great, by all means continue.
This strategy should mean that you eventually end up working on something longer-term when you find something good, but don't just get bogged down in the first ambitious idea you tried.
I recommend reviewing the list of skills earlier and just for each one, reflecting for a bit on how on top of it you think you feel and how you could intentionally practice it in your next project. Then after each sprint, before deciding whether to pivot, take an hour or two to do a post-mortem: what did you learn, what progress did you make on different skills, and what would you do differently next time? Your goal is to learn, and you learn much better if you make time to actually process your accumulated data!
One way to decompose your learning is to think about research mindsets: the traits and mindsets a good researcher needs to have, that cut across many of these stages. See my blog post on the topic for more, but here's a brief view of how I'm currently thinking about it.
Skepticism/Truth-seeking: The default state of the world is that your research is false, because doing research is hard. Your north star should always be to find true insights[23]
It generally doesn't come naturally to people to constantly aggressively think about all the ways their work could be false and make a good faith effort to test it. You can learn to do better than this, but it often takes practice.
This is crucial in understanding, somewhat important in exploration, and crucial in distillation.
A common mistake is to grasp at straws to find a “positive” result, thinking that nothing else is worth sharing.
In my opinion, negative or inconclusive results that are well-analyzed are much better than a poorly supported positive result. I’ll often think well of someone willing to release nuanced negative results, and poorly of someone who pretends their results are better than they are.
Productivity[24]: The best researchers I've worked with get more than twice as much done as the merely good ones. Part of this is good research taste and making good prioritization decisions, but part of this is just being good at getting shit done.
Now, this doesn't necessarily mean pushing yourself until the point of burnout by working really long hours. Or cutting corners and being sloppy. This is about productivity integrated over the long term.
For example, sometimes the most productive thing to do is to hold off on starting work, set a 5 minute timer, brainstorm possible things to do next, and then pick the best idea
This takes many forms, and the highest priority for you:
Know when to write good code without bugs, to avoid wasting time debugging later, and when to write a hacky thing that just works.
Know the right keyboard shortcuts to move fast when coding.
Know when to ask for help and have people who can help you get unblocked where appropriate.
Be good at managing your time and tasks so that once you've decided what the highest priority thing to work on is, you in fact go and work on it.
Be able to make time to achieve deep focus on the key problems.
Exercise: Occasionally audit your time. Use a tool like Toggl for a day or two to log what you're doing, then reflect: where did time go? What was inefficient? How could I do this 10% faster next time?
The goal isn't to feel guilty, but to spot opportunities for improvement, like making a utility function for a tedious task.
In distillation, when writing a paper you’re expected to be able to contextualise it relative to existing work (i.e. write a related work section[25]) which is important for other researchers knowing whether to care. And if you don’t know the standard methods of proof, key baselines everyone will ask about, key gotchas to check for etc, no one will believe your work.
On the flip side, many papers are highly misleading/outright false, so please don’t just critically believe them![26]
Okay, so how does this all tie back to the stages of research? Now you're going to be thinking about all four. We'll start by talking about how to deepen your existing skills with exploration and understanding, and then we'll talk about what practicing ideation and actually writing up your work should look like.
You’ll still be exploring and understanding, but with a greater focus on rigor and the slower skills. In addition to the thoughts when discussing mindsets above, here’s some more specific advice
Doing Good Science
This both means others can check if your work is true and makes it more likely people will believe and build on your work[27] because they can see replications that are more likely to exist and because it's now low friction.
Don’t reinvent the wheel: A common mistake in mech interp is doing something that's already been done[28]. We have LLM-powered literature reviews now. You have way less of an excuse. Check first!
Okay, so you want to actually come up with good research ideas to work on. What does this look like? I recommend breaking this down into generating ideas and then evaluating them to find the best ones.
To generate ideas, I'd often start with just taking a blank doc, blocking out at least an hour, and then just writing down as many ideas as you can come up with. Aim for quantity over quality. Go for at least 20.
There are other things you can do to help with generation:
Okay, so now you have a big list. What does finding the best ones look like?
Research Taste Exercises
Gaining research taste is slow because the feedback loops are long. You can accelerate it with exercises that give you faster, proxy feedback. (Credit to Chris Olah for inspiration here)
Regularly paraphrase back to the mentor in your own words what you think they're saying, and then ask them to correct anything you're wrong about[29]
It’s also worth dwelling on what research taste actually is. See my post for more, but I break it down as follows:
At this stage, you should be thinking seriously about how to write up your work. Often, writing up work is the first time you really understand what a project has been about, or you identify key limitations, or experiments you forgot to do. You should check out my blog post on writing ML papers for much more detailed thoughts (which also apply to high-effort blog posts!) but I'll try to summarize them below.
Why aim for public output?
If producing something public is intimidating, for now, you can start by just writing up a private Google Doc and maybe share it with some friends or collaborators. But I heavily encourage people to aim for public output where they can. Generally, your research will not matter if no one reads it. The goal of research is to contribute to the sum of human[30] knowledge. And if no one understands what you did, then it doesn't matter.
Further, if you want to pursue a career in the space, whether a job, a PhD, or just informally working with mentors, public research output is your best credential. It's very clear and concrete proof that you are competent, can execute on research and do interesting things, and this is exactly the kind of evidence people care about seeing if they're trying to figure out whether they should work with you, pay attention to what you're saying, etc. It doesn’t matter if you wrote it in a prestigious PhD program or as a random independent researcher, if it’s good enough then people care.
There are a few options for what this can look like:
An Arxiv paper - much more legible than a blog post, and honestly not much extra effort if you have a high-quality blog post[31]
A workshop paper[32] (i.e. something you submit for peer review to a workshop, typically part of a major ML conference, the bar is much lower than for a conference paper)
A conference paper (the equivalent of top journals in ML, there’s a reasonably high quality bar[33], but also a lot of noise[34])
If this all seems overwhelming, starting out with blog posts is fine, but I think people generally overestimate the bar for arxiv or workshop papers - if you think you learned something cool in a project, this is totally worth turning into a paper!
How to write stuff up?
The core of a paper is the narrative. Readers will not take away more than a few sentences worth of content. Your job is to make sure these are the right handful of sentences and make sure the reader is convinced of them.
You want to distill your paper down into one to three key claims (your contribution), the evidence you provide that the contribution is true, the motivation for why a reader should care about them, and work all of this into a coherent narrative.
Iterate: I'm a big fan of writing things iteratively. You first figure out the contribution and narrative. You then write a condensed summary, the abstract (in a blog post, this should be a TL;DR/executive summary - also very important!). You then write a bullet point outline of the paper: what points you want to cover, what evidence you want to provide, how you intend to build up to that evidence, how you want to structure and order things, etc. If you have mentors or collaborators, the bullet point outline is often the best time to get feedback. Or the narrative formation stage, if you have an engaged mentor. Then write the introduction, and make sure you’re happy with that. Then (or even before the intro) make the figures - figures are incredibly important! Then flesh it out into prose. People spend a lot more time reading the abstract and the intro than the main body, especially when you account for all the people who read the abstract and then stop. So you should spend a lot more time per unit word on those.
LLMs: I think LLMs are a really helpful writing tool. They're super useful for getting feedback, especially if writing in an unfamiliar style like an academic ML paper may be for you. Remember to use anti-sycophanty prompts so you get real feedback. However, it's often quite easy to tell when you're reading LLM written slop. So use them as a tool, but don't just have them write the damn thing for you. But if you e.g. have writer’s block, having an LLM help you brainstorm or produce a first draft for inspiration, and can be very helpful.
Common mistakes
A common theme in the above is that it's incredibly useful to have a mentor, or at least collaborators. Here I'll try to unpack that and give advice about how to go about finding one.
Though it's also worth saying that many mentors are not actually great researchers and may have bad research taste or research taste that's not very well suited to mech interp. What you do about this is kind of up to you.
A good mentor is an incredible accelerator. Dysfunctional as academia is, there is a reason it works under the apprenticeship-like system of PhD students and supervisors. When I started supervising, I was very surprised at how much of a difference a weekly check in could make! Here’s my best attempt to breakdown how a good mentor can add value:
When to pivot: if your research direction isn’t working out, having a mentor to pressure you to pivot can be extremely valuable[35]
Here are some suggested ways to get some mentorship while transitioning into the field. I discuss higher commitment ways, like doing a PhD or getting a research job, below.
Note: whatever you do to find a mentor, having evidence that you can do research yourself, that is, public output that demonstrates ability to self-motivate and put in effort, and ideally demonstrates actually interesting research findings, is incredibly helpful and should be a priority.
Mentoring programs
I think mentoring programs like MATS are an incredibly useful way into the field, you typically do a full-time, several month program where you write a paper, with weekly check-ins with a more experienced researcher. Your experience will vary wildly depending on mentor quality, but at least for my MATS scholars, often people totally new to mech interp can publish a top conference paper in a few months. See my MATS application doc for a bunch more details.
There’s a wide range of backgrounds among people who do them and get value - people totally new to a field, people with 1+ years of interpretability research experience who want to work with a more experienced mentor, young undergrads, mid-career professionals (including a handful of professors), and more.
MATS 9.0 applications are open, due Oct 2 2025, and mine close on Sept 12.
Other programs (which I think are generally lower quality than MATS, but often still worth applying to depending on the mentor)
Cold emails
You can also take matters into your own hands and try to convince someone to be your mentor. Reaching out to people, ideally via a warm introduction, but even just via a cold email, can be highly effective. However, I get lots of cold emails and I think many are not very effective, so here's some advice:
Much easier than finding a mentor is finding collaborators, other people to work on the same project with, or just other people also trying to learn more about mech interp, who you can chat with and give each other feedback:
Staying up to date: Another common question is how to stay up to date with the field. Honestly, I think that people new to the field should not worry that much about this. Most new papers are irrelevant, including the ones that there is hype around. But it's good to stay a little bit in the loop. Note that the community has substantial parts both in academia and outside, which are often best kept up with in different ways.
Applying for grants
For people trying to get into mech interp via the safety community, there are some funders around open to giving career transition grants to people trying to upskill in a new field like mech interp. Probably the best place I know of is Open Philanthropy's Early Career Funding.
Explore Other AI Safety Areas
Mech interp isn't the only game in town! There’s other important areas of safety like Evals, AI Control, and Scalable Oversight, the latter two in particular seem neglected compared to mech interp. The GDM AGI Safety Approach gives an overview of different parts of the field. If you’re doing this for safety reasons, I’d check if there’s other, more neglected subfields, that also appeal to you!
Leaving aside things that apply to basically all roles, like whether this person has a good personality fit (which often just means looking out for red flags), here’s my sense of what hiring managers in interpretability are often looking for.
A useful mental model is that from a hiring manager's perspective, they're making an uncertain bet with little information in a somewhat adversarial environment. Each applicant wants to present themselves as the perfect fit. This means managers need to rely on signals that are hard to fake. But it’s quite difficult to get that much info on a person before you actually go and work with them a bunch.
Your goal as a candidate is to provide compelling, hard-to-fake evidence of your skills. The best way to do that is to simply do good research and share it publicly. If your research track record is good enough, interviews may just act as a check for red flags and to verify that you can actually write code and run experiments well.
Key skills:
Productivity and Conscientiousness: This is a very hard one to interview for, but incredibly important. A public track record of doing interesting things is a good signal, as are strong references from trusted sources[36].
I don't have a PhD (and think I would have had a far less successful career if I had tried to get one) so I'm somewhat biased. But it's a common question. Here are the strongest arguments I’ve heard in favour:
And here are the reasons I think it's often a bad idea:
But with all those caveats in mind, it’s definitely the right option for some! My overall take:
Relevant Academic Labs
I’m a big fan of the work coming out of these two, they seem like great places to work:
Other labs that seem like good places to do interpretability research (note that this is not trying to be a comprehensive list!):
Thanks a lot to Arthur Conmy, Paul Bogdan, Bilal Chughtai, Julian Minder, Callum McDougall, Josh Engels, Clement Dumas, Bart Bussmann for valuable feedback
Note that I mean a full working month here. So something like 200 working hours. If you're only able to do this part-time, it's fine to take longer. If you're really focused on it, or have a head-start, then move on faster.
If you want something even more approachable, one of my past MATS scholars recommends getting GPT-5 thinking to produce coding exercises (eg a Python script with empty functions, and good tests), for an easier way in.
It’s fine for this coding to need a bunch of LLM help and documentation/tutorial looking up, this isn’t a memory test. The key thing is being able to correctly explain the core of each technique to a friend/LLM.
Note: This curriculum aims to get you started on independent research. This is often good enough for academic labs, but the engineering bar for most industry labs is significantly higher, as you’ll need to work in a large complex codebase with hundreds of other researchers. But those skills take much longer to gain.
You want to exclude the first token of the prompt when collecting activations, it’s a weird attention sink and often has high norm/is anomalous in many ways
Gotcha: Remember to try a bunch of coefficients for the vector when adding it. This is a crucial hyper-parameter and steered model behaviour varies a lot depending on its value
Mixture of expert models, where there are many parameters, and only a fraction light up for each token, are a pain for interpretability research. Larger models means you'll need to get more/larger GPUs which is expensive and unwieldy. Favor working with dense models where possible.
You can download then upload the PDF to the model, or just select all and copy and paste from the PDF to the chat window. No need to correct the formatting issues, LLMs are great at ignoring weird formatting artifacts
repo2txt.com is a useful tool for concatenating a Github repo into a single txt file
If you would like other perspectives, check out Open Problems in Mechanistic Interpretability (broad lit review from many leading researchers, recent), or Interpretability Dreams (from Anthropic, 2 years old)
And for reasons we’ll discuss later, now feel much more pessimistic about the ambitious reverse engineering direction
Even if you already have a research background in another field, mechanistic interpretability is sufficiently different that you should expect to need to relearn at least some of your instincts. This stage remains very relevant to you, though you can hopefully learn faster.
The rest of this piece will be framed around approaching learning research like this and why I think it is a reasonable process. Obviously, there is not one true correct way to learn research! When I e.g. critique something as a “mistake”, interpret this as “I often see people do this and think it’s suboptimal for them”, not “there does not exist a way of learning research where this is a good idea
My term for associated knowledge, understanding, intuition, etc.
Read my thoughts on SAEs here. There’s still useful work to be done, but it’s an oversubscribed area, and our bar should be higher. They are a useful tool, but not as promising as I once hoped.
This was using a technique called synthetic document fine-tuning (and some other creativity on top), which basically lets you insert false beliefs into a model by generating a bunch of fictional documents where those beliefs are true and fine-tuning the model on them.
We chose problems we’re excited to see worked on, while trying to avoid fad-like dynamics
Latents refer to the hidden units of the SAE. These were originally termed “features”, but that term is also used to mean “the interpretable concept the latent refers to”, so I use a different term to minimise confusion.
One of my MATS scholars make a working GPT-5 model diffing agent in a day
This is the one line in the post without a “as of early Sept 2025” disclaimer, this feels pretty evergreen
Note: "think" or "chain of thought" are terrible terms. It's far more useful to think of the chain of thought as a scratchpad that a model with very limited short-term memory can choose to use or ignore.
Reasoning models break a lot of standard interpretability techniques because now the computational graph goes through the discrete, non-differentiable, and random operation of sampling thousands of times. Most interpretability techniques focus on studying a single forward pass.
Not just, e.g., ones you can publish on.
I called this moving fast in the blog post, but I think that may have confused some people.
Though often this is done well with just a good introduction
And having a well-known researcher as co-author is not sufficient evidence to avoid this, alas. I’m sure at least one paper I’ve co-authored in the past year or two is substantially false
It's strongly in your interests for people to build on your work because that makes your original work look better, in addition to being just pretty cool to see people engage deeply with your stuff.
Note that deliberately reproducing work, or trying to demonstrate the past work is shoddy, is completely reasonable. You just need to not accidentally reinvent the wheel.
This is generally a good thing to do regardless of whether you’re focused on research taste or not!
And, nowadays, LLM knowledge too I guess?
Note that you’ll need someone who’s written several Arxiv papers to endorse you. cs.LG is the typical category for ML papers.
Note that you can submit something to a workshop and to a conference, so long as the workshop is “non-archival”
A conference paper is a fair bit more effort, and you generally want to be working with someone who understands the academic conventions and shibboleths and the various hoops you should be jumping through. But I think this can be a nice thing to aim for, especially if you're starting out and need credentials, though mech interp cares less about peer review than most academic subfields.
See this NeurIPS experiment showing that half the spotlight papers would be rejected by an independent reviewing council
This is one of the most valuable things I do for my MATS scholars, IMO.
Unfortunately, standard reference culture, especially in the US, is to basically lie, and the amount of lying varies between contexts, rendering references mostly useless unless from a cultural context the hiring manager understands or ideally from people they know and trust. This is one of the reasons that doing AI safety mentoring programs like MATS can be extremely valuable, because often your mentor will know people who might then go on to hire you, which makes you a lower risk hire from their perspective.