Summary: We explain the similarities and differences between three recent approaches to testing interpretability hypotheses: causal scrubbing, Geiger et al.'s causal abstraction-based method, and locally consistent abstractions. In particular, we show that all of these methods accept some hypotheses rejected by some of the others.
Acknowledgements: Thanks to Dylan Xu and Joyee Chen for many conversations related to this post while they were working on their SPAR project! And thanks to Atticus Geiger, Nora Belrose, and Lawrence Chan for discussions and feedback!
An important question for mechanistic interpretability (and other topics) is: what type of thing is a mechanistic explanation of a certain neural network behavior? And what does it mean for such an explanation to be correct?
Recently, several strands of work have (mostly independently) developed similar answers: