x

Egor Zverev

Subscribe

Message

65

1

3y

Egor Zverev

Subscribe

Message

65

1

3y

A comparison of causal scrubbing, causal abstractions, and related methods

35

Erik Jenner, Adrià Garriga-alonso, Egor Zverev

3y

Summary: We explain the similarities and differences between three recent approaches to testing interpretability hypotheses: causal scrubbing, Geiger et al.'s causal abstraction-based method, and locally consistent abstractions. In particular, we show that all of these methods accept some hypotheses rejected by some of the others.

Acknowledgements: Thanks to Dylan Xu and Joyee Chen for many conversations related to this post while they were working on their SPAR project! And thanks to Atticus Geiger, Nora Belrose, and Lawrence Chan for discussions and feedback!

Introduction

An important question for mechanistic interpretability (and other topics) is: what type of thing is a mechanistic explanation of a certain neural network behavior? And what does it mean for such an explanation to be correct?

Recently, several strands of work have (mostly independently) developed similar answers:

Atticus Geiger and his collaborators

...

(Continue Reading - 6281 more words)