TL;DR: We use SGD to find sparse connections between features; additionally a large fraction of features between the residual stream & MLP can be modeled as linearly computed despite the non-linearity in the MLP. See linear feature section for examples.
Special thanks to fellow AISST member, Adam Kaufman, who originally thought of the idea of learning sparse connections between features & to Jannik Brinkmann for training these SAE’s.
Sparse AutoEncoders (SAE)’s are able to turn the activations of an LLM into interpretable features. To define circuits, we would like to find how these features connect to each other. SAE’s allowed us to scalably find interpretable features using SGD, so why not use SGD to... (read 2735 more words →)