x

Michael Pearce

Subscribe

Message

87

4

2y

Michael Pearce

Subscribe

Message

87

4

2y

Showing SAE Latents Are Not Atomic Using Meta-SAEs

35

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey, Neel Nanda

2y

Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback!

TL;DR:

Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model’s computation.
We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E-
We do this by training meta-SAEs, an SAE trained to reconstruct the decoder directions of a normal SAE.
We argue that, conceptually, there’s

...

(Continue Reading - 5697 more words)