This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel.
REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ).
Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field. We also think participants will learn skills valuable for many styles of interpretability research, and also for ML research more broadly.
Apply here by Sunday, November 13th [DEADLINE EXTENDED] to be a researcher in the program. Apply sooner if you’d like to start early (details below) or receive an earlier response.
Some key details:
Feel free to email firstname.lastname@example.org with questions.
Research results. We are optimistic about the research progress you could make during this program (more below). Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field.
Skill-building. We think this is a great way to gain experience working with language models and interpreting/analyzing their behaviors. The skills you’ll learn in this program will be valuable for many styles of interpretability research, and also for ML research more broadly.
Financial support & community. This is a paid opportunity, and a chance to meet and connect with other researchers interested in interpretability.
Research output. We hope this program will produce research that is useful in multiple ways:
Training and hiring. We might want to hire people who produce valuable research during this program.
Experience running large collaborative research projects. It seems plausible that at some point it will be useful to run a huge collaborative alignment project. We’d like to practice this kind of thing, in the hope that the lessons learned are useful to us or others.
See “Is this research promising enough to justify running this program?” and “How useful is this kind of interpretability research for understanding models that might pose an existential risk?”
We think our recent progress in interpretability makes it a lot more plausible for us to reliably establish mechanistic explanations of model behaviors, and therefore get value from a large, parallelized research effort.
A unified framework for specifying and validating explanations. Previously, a big bottleneck on parallelizing interpretability research across many people was the lack of a clear standard of evidence for proposed explanations of model behaviors (which made us expect the research produced to be pretty unreliable). We believe we’ve recently made some progress on this front, developing an algorithm called “causal scrubbing” which allows us to automatically derive an extensive set of tests for a wide class of mechanistic explanations. This algorithm is only able to reject hypotheses rather than confirming them, but we think that this still makes it way more efficient to review the research produced by all the participants.
Improved proofs of concept. We now have several examples where we followed our methodology and were able to learn a fair bit about how a transformer was performing some behavior.
Tools that allow complicated experiments to be specified quickly. We’ve built a powerful library for manipulating neural nets (and computational graphs more generally) for doing intervention experiments and getting activations out of models. This library allows us to do experiments that would be quite error-prone and painful with other tools.
We're most excited about applicants comfortable working with (basic) Python, any of PyTorch/TensorFlow/Numpy, and linear algebra. Quickly generating hypotheses about model mechanisms and testing them requires some competence in these domains.
If you don’t understand the transformer architecture, we’ll require that you go through preparatory materials, which explain the architecture and walk you through building one yourself.
We’re excited about applicants with a range of backgrounds; prior experience in interpretability research is not required. The primary activity will be designing, running, and analyzing results from experiments which you hope will shed light on how a model accomplishes some task, so we’re excited about applicants with experience doing empirical science in any field (e.g. economics, biology, physics). The core skill we’re looking for here, among people with the requisite coding/math background, is something like rigorous curiosity: a drive to thoroughly explore all the ways the model might be performing some behavior and narrow them down through careful experiments.
Mechanistic interpretability is an unusual empirical scientific setting in that controlled experimentation is relatively easy, but there’s relatively little knowledge about the kinds of structures found in neural nets.
Regarding the ease of experimentation:
Regarding the openness of the field:
REMIX participants pursue interpretability research akin to the investigations Redwood has done recently into induction heads, indirect object identification (IOI) in small language models, and balanced parenthesis classification, all of which will be released publically soon. You can read more about behavior selection criteria here.
The main activities will be:
The mechanisms for behaviors we’ll be studying are often surprisingly complex, so careful experimentation is needed to accurately characterize them. For example, the Redwood researchers investigating the IOI behavior found that removing the influence of the circuit they identified as primarily responsible had surprisingly little effect on the model’s ability to do IOI. Instead, other heads in the model substantially changed their behavior to compensate for the excision. As the researchers write, “Both the reason and the mechanism of this compensation effect are still unclear. We think that this could be an interesting phenomenon to investigate in future work.”
Here’s how a Redwood researcher describes this type of research:
It feels a lot of time like you're cycling between: "this looks kind of weird and interesting, not sure what's up with this" and then "I have some vague idea about what maybe this part is doing, I should come up with a test to see if I understand I correctly" and then once you have a test "oh cool, I was kind of right but also kind of wrong, why was I wrong" and then the cycle repeats.Often it's pretty easy to have a hunch about what some part of your model is doing, but finding a way to appropriately test that hunch is hard and often your hunch might be partially correct but incomplete so your test may rule it out prematurely if you're not careful/specific enough.It feels like you're in a lab, with your model on the dissection table, and you're trying to pick apart what's going on with different pieces and using different tools to do so - this feels really cool to me, kind of like trying to figure out what's going on with this alien species and how it can do the things it does.It's really fun to try and construct a persuasive argument for your results: "I think this is what's happening because I ran X, Y, Z experiments that show A, B, C, plus I was able to easily generate adversarial examples based on these hypotheses" - I often feel like there's some sort of imaginary adversary (sometimes not imaginary!) that I have to convince of my results and this makes it extremely important that I make claims that I can actually back up and that I appropriately caveat others.
It feels a lot of time like you're cycling between: "this looks kind of weird and interesting, not sure what's up with this" and then "I have some vague idea about what maybe this part is doing, I should come up with a test to see if I understand I correctly" and then once you have a test "oh cool, I was kind of right but also kind of wrong, why was I wrong" and then the cycle repeats.
Often it's pretty easy to have a hunch about what some part of your model is doing, but finding a way to appropriately test that hunch is hard and often your hunch might be partially correct but incomplete so your test may rule it out prematurely if you're not careful/specific enough.
It feels like you're in a lab, with your model on the dissection table, and you're trying to pick apart what's going on with different pieces and using different tools to do so - this feels really cool to me, kind of like trying to figure out what's going on with this alien species and how it can do the things it does.
It's really fun to try and construct a persuasive argument for your results: "I think this is what's happening because I ran X, Y, Z experiments that show A, B, C, plus I was able to easily generate adversarial examples based on these hypotheses" - I often feel like there's some sort of imaginary adversary (sometimes not imaginary!) that I have to convince of my results and this makes it extremely important that I make claims that I can actually back up and that I appropriately caveat others.
This research also involves a reasonable amount of linear algebra and probability theory. Researchers will be able to choose how deep they want to delve into some of the trickier math we’ve used for our interpretability research (for example, it turns out that one technique we’ve used is closely related to Wick products and Feynman diagrams).
The program will start out with a week of training using our library for computational graph rewrites and investigating model behaviors using our methodology. This week will have a similar structure to MLAB (our machine learning bootcamp), with pair programming and a prepared curriculum. We’re proud to say that past iterations of MLAB have been highly-reviewed – the participants in the second iteration gave an average score of 9.2/10 to the question “How likely are you to recommend future MLAB programs to a friend/colleague?”.
An approximate schedule for week one:
In future weeks, you’ll split your time between investigating behaviors in these models, communicating your findings to the other researchers, and reading/learning from/critiquing other researchers’ findings.
We encourage you to submit an application even if you can’t make the dates; we have some flexibility, and might make exceptions for exceptional applicants. We’re planning to have some participants start as soon as possible to test drive our materials, practice in our research methodology, and generally help us structure this research program so it goes well.
You fill out the form, complete some TripleByte tests that assess your programming abilities, then do an interview with us.
Given this program is a research sprint rather than a purely educational program, and given the fact that we plan to offer stipends for participants, we can’t guarantee sponsorship of the right-to-work visas required for international participants to be in person. If you are international but studying at a US university, we are optimistic about getting a CPT for you to be able to participate.
However, we still encourage international candidates to apply. We’ll try to evaluate on a case-by-case basis and for exceptional candidates depending on your circumstances, there may be alternatives, like trying to sponsor a visa to have your join later or participating remotely for some period.
I would love to say that this project is paid for by the expected direct value of its research output. My inside view is that the expected direct value does in fact pay for the dollar cost of this project, and probably even the time cost of the organizers. However, there are strong reasons for skepticism–this is a pretty weird thing to do, and it’s sort of weird to be able to make progress on things by having a large group of people work together. So the decision to run this program is to some extent determined by more boring, capacity-building considerations, like training people and getting experience with large projects.
This research might end up not being very useful. Here’s Buck’s description of some reasons why this might be the case:
My main concern is that the language model interpretability research we mentioned above was done on model behaviors which were specifically selected because we thought interpreting these behaviors would be easy. (I’ve recently been calling this kind of research “streetlight interpretability”, as in the classic fallacy where you only look for things in the place that’s easiest to look in.) These model behaviors are chosen to be unrepresentatively easy to interpret.
In particular, it’s not at all obvious how to use any existing interpretability techniques (or even how to articulate the goal of interpretability) in situations where the algorithm the model is using is poorly described by simple, human-understandable heuristics. I suspect that tasks like IOI or acronym generation, where the model implements a simple algorithm, are the exception rather than the rule, and models achieve good performance on their training trask mostly by using huge piles of correlations and heuristics. Our preliminary attempts to characterize how the model distinguishes between outputting “is” and “was” indicate that it relies on a huge number of small effects; my guess is that this is more representative of what language models are mostly doing than e.g. the IOI work.
So my guess is that this research direction (where we try to explain model behaviors in terms of human-understandable concepts) is limited, and is more like diving into a part of the problem that I strongly suspect to be solvable, rather than tackling the biggest areas of uncertainty or developing the techniques that might greatly expand the space of model behavior that we can understand. We are also pursuing various research directions that might make a big difference here; I think that research on these improved strategies is quite valuable (plausibly the best research direction), but I think that streetlight interpretability still looks pretty good.
Another concern you might have is that it’s useful to have a few examples of this kind of streetlight interpretability, but there are steeply diminishing marginal returns from doing more work of this type. For what it’s worth, I have so far continued to find it useful/interesting to see more examples of research in this style, but it’s pretty believable that this will slow down after ten more projects of this form or something.
Overall, we think that mechanistic interpretability is one of the most promising research directions for helping prevent AI takeover. Our hope is that mature interpretability techniques will let us distinguish between two ML systems that each behave equally helpfully during training – even having exactly the same input/output behavior on the entire training dataset – but where one does so because it is deceiving us and the other does so “for the right reasons.”
Our experience has been that explaining model behaviors supports both empirical interpretability work – guiding how we engineer interpretability tools and providing practical knowhow – and theoretical interpretability work – for example, leading to the development of the causal scrubbing algorithm. We expect many of the practical lessons that we might learn would generalize to more advanced systems, and we expect that addressing the theoretical questions that we encounter along the way would lead to important conceptual progress.
Currently, almost no interesting behaviors of ML models have been explained – even for models that are tiny compared with frontier systems. We have been working to change this, and we’d like you to help.
For other perspectives on this question, see Chris Olah’s description of the relevance of thorough, small-model interpretability here and Paul Christiano’s similar view here.
Apply here by November 8th to be a researcher in the program, and apply sooner if you want to start ASAP. Sooner applications are also more likely to receive sooner responses. Email email@example.com with questions.
The problem of distinguishing between models which behave identically on the training distribution is core to the ELK problem.
I think this is really exciting and I’m very interested see how it goes. I think the current set of problems and methodologies is solid enough that participants have a reasonable shot at making meaningful progress within a month. I also expect this to be a useful way to learn about language models and to generally be in a better position to think about alignment.
I think we’re still a long way from understanding model behavior well enough that we could e.g. rule out deceptive alignment, but it feels to me like recent work on LM interpretability is making real progress towards that goal, and I can imagine having large teams studying frontier models closely enough to robustly notice deceptive alignment well in advance by the time we have transformative AI.
I'm really excited about this program! Super curious to see what comes out of it - I expect I'll learn a lot whether it goes well, or struggles to get traction. And I want to see more of this kind of ambitious scalable alignment effort!
If you're interested in getting into mechanistic interpretability work, you should definitely apply to it
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no, but I am broadly a bit confused when this is a commitment for.
Also, are people going through as cohorts or will they start with the training week whenever they show up, not necessarily in-sync with anyone else?
Also, is the idea to be doing self-directed research by default, or research in collaboration with Redwood staff by default? I don't know what my default action is day-to-day during this program. Do I have to come in with a bunch of research plans already?
Thanks for the questions :)
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,
but I am broadly a bit confused when this is a commitment for.
Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer".
Not totally settled. We'll probably have most people at a big final cohort in January, and we'll try to have people who arrive earlier show up at synced times so that they can do the training week with others.
The default is to do research directed by Redwood staff. You do not need to come in with any research plans.
Thanks for the answers! :)