Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback!
TL;DR:
- Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model’s computation.
- We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E-
- We do this by training meta-SAEs, an SAE trained to reconstruct the decoder directions of a normal SAE.
- We argue that, conceptually, there’s
...