Philippe Chlenski — AI Alignment Forum

Transcoders enable fine-grained interpretable circuit analysis for language models

Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs

I see. I was in fact misunderstanding this detail in your training setup. In this case, only engineering considerations really remain: these boil down to incorporating multiple transcoders simultaneously and modeling shifting MLP behavior with transcoders. These seem like tracta... (read more)

Transcoders enable fine-grained interpretable circuit analysis for language models

Philippe Chlenski2y10

This sounds like it could work. I can think of a few reasons why this approach could be challenging, however:
1. We don't really know how transcoders (or SAEs, to the best of my knowledge) behave when they're being trained to imitate a model component that's still updating
2. Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions. Substituting one transcoder at a time would require restarting the forward pass at each layer.
3. If the transcoders are used to predict next tok... (read more)

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

Summary

We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered).

...

(Continue Reading - 4945 more words)

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan, Neel Nanda

Epistemic status: preliminary/exploratory.

Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2023-2024) Research Sprint.

TL;DR: We develop a method for understanding how sparse autoencoder features in transformer models are computed from earlier components, by taking a local linear approximation to MLP sublayers. We study both how the feature is activated on specific inputs, and take steps towards finding input-independent explanations via examining model weights. We demonstrate this method with several deep-dive case studies to interpret the mechanisms used by simple transformers (GELU-1L and GELU-2L) to compute some specific features, and validate that it agrees with the results of causal methods.

Introduction

A core aim of mechanistic interpretability is tackling the curse of dimensionality, decomposing the high-dimensional activations and parameters of a neural network into individually understandable pieces. Sparse...

(Continue Reading - 12521 more words)