AI ALIGNMENT FORUM
AF

[Redwood Research] Causal Scrubbing

Oct 26, 2022 by Lawrence Chan

In this sequence, we introduce causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key insight behind this work is that mechanistic interpretability hypotheses can be thought of as defining what activations inside a neural network can be resampled without affecting behavior. Accordingly, causal scrubbing tests interpretability hypotheses via behavior-preserving resampling ablations---converting hypotheses into distributions over activations that should preserve behavior, and checking if behavior is actually preserved.

 We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced. 

Besides the main post, which covers the most important content, there are three additional posts with information of less general interest. The first is a series of appendices to the content of this post. Then, a pair of posts covers the details of what we discovered while applying causal scrubbing to a paren-balance checker and induction in a small language model.

103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas
3y
29
11Causal scrubbing: Appendix
Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas
3y
1
22Causal scrubbing: results on induction heads
Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Tao Lin, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas
3y
0
20Causal scrubbing: results on a paren balance checker
Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Tao Lin, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas
3y
2