Summary

Sparse Autoencoder (SAE) errors are empirically pathological: when a reconstructed activation vector is distance $ϵ$ from the original activation vector, substituting a randomly chosen point at the same distance changes the next token prediction probabilities significantly less than substituting the SAE reconstruction^[1] (measured by both KL and loss). This is true for all layers of the model (~2x to ~4.5x increase in KL and loss over baseline) and is not caused by feature suppression/shrinkage. Assuming others replicate, these results suggest the proxy reconstruction objective is behaving pathologically. I am not sure why these errors occur but expect understanding this gap will give us deeper insight into SAEs while also providing an additional metric to guide methodological progress.

Introduction

As the interpretability community allocates more resources and increases reliance on SAEs, it is important to understand the limitation and potential flaws of this method.

SAEs are designed to find a sparse overcomplete feature basis for a model's latent space. This is done by minimizing the joint reconstruction error of the input data and the L1 norm of the intermediate activations (to promote sparsity):

min S A E ∥ x - S A E (x) ∥_{2}^{2} + λ ∥ S A E (x) ∥_{1} .

However, the true goal is to find a faithful feature decomposition that accurately captures the true causal variables in the model, and reconstruction error and sparsity are only easy-to-optimize proxy objectives. This begs the questions: how good of a proxy objective is this? Do the reconstructed representations faithfully preserve other model behavior? How much are we proxy gaming?

Naively, this training objective defines faithfulness as L2. But, another natural property of a "faithful" reconstruction is that substituting the original activation with the reconstruction should approximately preserve the next-token prediction probabilities. More formally, for a set of tokens $T$ and a model $M$ , let $P = M (T)$ be the model's true next token probabilities. Then let $Q_{S A E} = M (T | d o (x \leftarrow S A E (x)))$ be the next token probabilities after intervening on the model by replacing a particular activation $x$ (e.g. a residual stream state or a layer of MLP activations) with the SAE reconstruction of $x$ . The more faithful the reconstruction, the lower the KL divergence between $P$ and&nb...

Posts

Wikitag Contributions

Comments