Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found that the technique is also effective with cross-layer features.
This post documents our methodology. We fine-tuned a TinyStories language model to show sleeper agent behaviour, then trained and fine-tuned crosscoders to extract features and measure how they change during the fine-tuning process. Running all training and experiments takes under an hour on a single RTX 4090 GPU.
We release code for training and analysing sleeper agents and crosscoders, along with a set of trained models, on GitHub here.
Problem: Sparse crosscoders are powerful tools for compressing neural network representations into interpretable features. However, we don’t understand how features interact.
Perspective: We need systematic procedures to measure and rank nonlinear feature interactions. This will help us identify which interactions deserve deeper interpretation. Success can be measured by how useful these metrics are for applications like model diffing and finding adversarial examples.
Starting contribution: We develop a procedure based on compact proofs. Working backwards from the assumption that features are linearly independent, we derive mathematical formulations for measuring feature interactions in ReLU and softmax attention.
Training a sparse crosscoder (henceforth simply crosscoders) can be thought of as stacking the activations of the residual stream across all layers and training an SAE on the stacked vector. This gives us a mapping between model activations and compressed,...
We recently released a paper on using mechanistic interpretability to generate compact formal guarantees on model performance. In this companion blog post to our paper, we'll summarize the paper and flesh out some of the motivation and inspiration behind our work.
...In this work, we propose using mechanistic interpretability – techniques for reverse engineering model weights into human-interpretable algorithms – to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of- task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover,