x

AI ALIGNMENT FORUM

AF

Jacob Dunefsky — AI Alignment Forum

Jacob Dunefsky

Top postsTop post

Jacob Dunefsky

Message

223

Ω

62

5

7

3y

Jacob Dunefsky

223

Ω

62

3y

Transcoders enable fine-grained interpretable circuit analysis for language models

Summary * We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the...

Apr 30, 2024•75

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Epistemic status: preliminary/exploratory. Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2023-2024) Research Sprint. TL;DR: We develop a method for understanding how sparse autoencoder features in transformer models are computed from earlier components, by taking a local linear approximation to MLP sublayers. We study both how the...

Jan 14, 2024•24