Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall

32 Interpretability with Sparse Autoencoders (Colab exercises)

29th Nov 2023

4 min read

32

Update (13th October 2024) - these exercises have been significantly expanded on. Now there are 2 exercise sets: the first one dives deeply into theoretical topics related to superposition, while the second one (much larger) includes a streamlined version of the first one, as well as most of the actual SAE material. This post mostly focuses on the second one (although we do give an overview of both).

This is a linkpost for some exercises on superpostion & sparse autoencoders, which were created for the 3rd iteration of the ARENA program (and greatly expanded on during the 4th iteration). Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum.

In the ARENA material, these exercises are 1.3.1 and 1.3.2 respectively. The "1" is the transformer interpretability chapter; the "1.3" is the SAEs & Superposition subsection. Although 1.3.1 covers a lot of interesting theoretical topics related to superposition, for most people we recommend 1.3.2 as a fully self-contained introduction to superposition and SAEs.

Links to Colabs for 1.3.1: Exercises, Solutions.

Links to Colabs for 1.3.2: Exercises, Solutions.

Summary of material (1.3.2)

Abbreviations: TMS = "Toy Models of Superposition", SAE = "Sparse Autoencoder".

The diagram below shows an overview of section 1.3.2. It's split into 5 parts, each of which covers a different group of topics related to SAEs. You can also see a map of the material in much more detail here.

0️⃣ Toy Models of Superposition is a streamlined version of exercises 1.3.1, with most of the non-crucial stuff cut out (e.g. feature geometry and deep double descent), although you can still probably skip it if you want to get straight to working with SAEs on real language models.

1️⃣ Intro to SAE interpretability is by far the longest section, and covers most of the core material you'll need if you want to work with SAEs. It starts by introducing the SAELens library as well as neuronpedia, and shows you how to load different SAE releases and run them alongside their associated TransformerLens models. There are 2 major chunks of exercises in this section: in the first one we replicate the individual components that go into SAE dashboards, and in the second one we learn techniques for feature-finding, applied to attention SAEs & the indirect object identification circuit.

2️⃣ SAE circuits contains material on finding and interpreting circuits in our SAEs. We cover how to calculate gradients between SAE latents, as well as doing interpretability on transcoders (which can make circuit analysis a lot easier)

3️⃣ Training & evaluating SAEs shows you how to use SAELens for training, and how to interpret wandb-logged evaluation metrics during training. We also look at several case studies of training SAEs, including training on the MLP output of TinyStories-1L, the attention output of attn-only 2L models, the residual stream of Gemma-2B and the MLP layer of OthelloGPT.

Summary of material (1.3.1)

We include a summary of 1.3.1 here too, if people are interested (although as mentioned, we expect that most people would get more benefit from 1.3.2). We constructed 1.3.2 by taking only sections 1️⃣ and 5️⃣ from the material listed below (and cutting out a few other unnecessary bits).

1️⃣ TMS: Superposition in a Nonprivileged Basis: This section introduces Anthropic's toy model for superposition, where a simple neural network is trained to map a set of features into a lower-dimensional space then reconstruct it. You'll learn about how superposition works & see how it can be visualised, as well as how properties like feature sparsity affect the learned solutions.

2️⃣ TMS: Correlated / Anticorrelated Features: In this section, you'll keep exploring the idea of superposition by seeing how the model's learned solutions change when features are correlated or anticorrelated. Most features learned by real models are anticorrelated simply as a consequence of the fact that any given model input (e.g. images or passages of text) will only contain a limited number of features.

3️⃣ TMS: Superposition in a Privileged Basis: In this section, the toy model setup is changed so that it has a privileged basis. If the previous sections were analogues for superposition in the residual stream, this section is an analogue for superposition in the MLP layer. We'll also explore how computation can be performed in superposition.

4️⃣ Feature Geometry: Here, we take a deeper dive into the ways features can organize into different geometric structures, when we increase the hidden dimension past the point when we can easily visualise it.

5️⃣ SAEs in Toy Models: We take the toy models from Anthropic's Toy Models of Superposition paper (which there are also exercises for), and train sparse autoencoders on the representations learned by these toy models. These exercises culminate in using neuron resampling to successfully recover all the learned features from the toy model of bottleneck superposition:

Animation of the training process for SAEs in Anthropic's toy model of superposition. Red = resampled latents. All instances eventually converge to accurately representing all 5 features learned by the original model.

6️⃣ Bonus: We cover some extension material here, including a replication of Deep Double Descent & Superposition, a paper which explores the idea that double descent happens when models transition from a memorizing solution (representing datapoints in superposition) to a generalizing solution (representing features in superposition).

How to use this material

The Colab notebooks are fully self-contained, you can work through the exercises Colab and check your answers by comparing them to the solutions Colab (which should also have all expected output displayed inline).

If you don't like working in Colabs, then you can clone the repo and work through them in VSCode. You have 2 options here: either go through the notebooks like normal (you can find Jupyter notebooks mirroring the structure of the Colabs at chapter1_transformer_interp/exercises/part32_interp_with_saes), or you can use a blank notebook / Python file and work through the exercises as shown on the Streamlit page.

Note that if you don't want to work through the material as exercises, then you can just use the solutions Colab / noteook as a source of reference code!

Please reach out to me if you have any questions or suggestions about these exercises (either by email at cal.s.mcdougall@gmail.com, or a LessWrong private message / comment on this post). Happy coding!

Sparse Autoencoders (SAEs)Exercises / Problem-SetsInterpretability (ML & AI)SuperpositionAI

Frontpage

Mentioned in

41A Selection of Randomly Selected SAE Features

38Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

New Comment

Moderation Log