AI ALIGNMENT FORUM
AF

1061
Attention Output SAEs

Attention Output SAEs

Jun 21, 2024 by Arthur Conmy

We train sparse autoencoders (SAEs) on the output of attention layers, extending the existing work that trained them on MLP layers and the residual stream.

We perform a qualitative study of the features computed by attention layers, and find multiple feature families: long-range context, short-range context and induction features, for example.

More importantly, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit, validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit.

Finally, we open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention SAEs.

35Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
2y
3
28Attention SAEs Scale to GPT-2 Small
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
2y
0
34We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
robertzk, Connor Kissane, Arthur Conmy, Neel Nanda
2y
0
18Attention Output SAEs Improve Circuit Analysis
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
1y
0