Top postsTop post
Jacob Dunefsky
223
Ω
62
5
7
Summary * We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the...
Epistemic status: preliminary/exploratory. Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2023-2024) Research Sprint. TL;DR: We develop a method for understanding how sparse autoencoder features in transformer models are computed from earlier components, by taking a local linear approximation to MLP sublayers. We study both how the...