I'm writing this post to advertise the second in the sequence of monthly mechanistic interpretability challenges (the first to be posted on LessWrong). They are designed in the spirit of Stephen Casper's challenges, but with the more specific aim of working well in the context of the rest of the ARENA material, and helping people put into practice all the things they've learned so far.

In this post, I'll describe the algorithmic problem & model trained to solve it, as well as the logistics for this challenge and the motivation for creating it. However, the main place you should access this problem is on the Streamlit page, where you can find setup code & instructions (as well as a link to a Colab notebook with all setup code included, if you'd prefer to use this).

Task

The algorithmic task is as follows: the model is presented with a sequence of characters, and for each character it has to correctly identify the first character in the sequence (up to and including the current character) which is unique up to that point.

Explanation: "a" is unique in both the first two characters "ab", it's repeated in the 3rd character so "b" is now the first unique character, and then "b" is repeated in the 5th character which makes "c" the first unique character.

Model

Our model was trained by minimising cross-entropy loss between its predictions and the true labels, at every sequence position simultaneously (including the zeroth sequence position, which is trivial because the input and target are both always "?"). You can inspect the notebook training_model.ipynb in the GitHub repo to see how it was trained. I used the version of the model which achieved highest accuracy over 50 epochs (accuracy ~99%).

The model is is a 2-layer transformer with 3 attention heads, and causal attention. It includes layernorm, but no MLP layers.

Note - although this model was trained for long enough to get loss close to zero (you can test this for yourself), it's not perfect. There are some weaknesses that the model has which might make it vulnerable to adversarial examples, and I've decided to leave these in. The model is still very good at its intended task, and the main focus of this challenge is on figuring out how it solves the task, not dissecting the situations where it fails. However, you might find that the adversarial examples help you understand the model better.

Material equivalent to the following from the ARENA course is highly recommended:

The following material isn't essential, but is also recommended:

If you want some guidance on how to get started, I'd recommend reading the solutions for the July problem - I expect there to be a lot of overlap in the best way to tackle these two problems. You can also reuse some of that code!

Motivation

Neel Nanda's post 200 COP in MI: Interpreting Algorithmic Problems does a good job explaining the motivation behind solving algorithmic problems such as these. I'd strongly recommend reading the whole post, because it also gives some high-level advice for approaching such problems.

The main purpose of these challenges isn't to break new ground in mech interp, rather they're designed to help you practice using & develop better understanding for standard MI tools (e.g. interpreting attention, direct logit attribution), and more generally working with libraries like TransformerLens.

Also, they're hopefully pretty fun, because why shouldn't we have some fun while we're learning? 🙂

Logistics

The solution to this problem will be published on the Streamlit in the first few days of September, at the same time as the next problem in the sequence. There will also likely be another LessWrong post.

If you try to attempt this challenge, you can send your attempt in any of the following formats:

  • Colab notebook,
  • GitHub repo (e.g. with .ipynb or markdown file explaining results),
  • Google Doc (with screenshots and explanations),
  • or any other sensible format.

You can send your attempt to me (Callum McDougall) via any of the following methods:

  • The ARENA Slack group (invite link here), via a direct message to me
  • My personal email: cal.s.mcdougall@gmail.com
  • LessWrong message

I'll feature the names of everyone who sends me a solution on this website, and also give a shout out to the best solutions (with the authors' permission). It's possible that future challenges will also feature a monetary prize, but this is not guaranteed.

Please don't discuss specific things you've found about this model until the challenge is over (although you can discuss general strategies and techniques, and you're also welcome to work in a group if you'd like). The deadline for this problem will be the end of this month, i.e. 31st August.

What counts as a solution?

Reading  the solutions for the first problem in this sequence (on classifying palindromic strings) should give you a good idea of what I'm looking for. In particular, I'd expect you to:

  • Describe a mechanism for how the model solves the task, in the form of the QK and OV circuits of various attention heads (and possibly any other mechanisms the model uses, e.g. the direct path, or nonlinear effects from layernorm),
  • Provide evidence for your mechanism, e.g. with tools like attention plots, targeted ablation / patching, or direct logit attribution.
  • (Optional) Include additional detail, e.g. identifying the subspaces that the model uses for certain forms of information transmission, or using your understanding of the model's behaviour to construct adversarial examples.

 

Best of luck with any attempts, and please feel free to reach out to me if you have any questions! 🎈

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 4:58 PM

I think this is very exciting, and I'll look forward to seeing how it goes!