AI ALIGNMENT FORUM
AF

Interpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

18

SAE on activation differences

by Santiago Aranguri, jacob_drori, Neel Nanda
30th Jun 2025
6 min read
2

18

Interpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage
New Comment
Moderation Log
Curated and popular this week
0Comments

TLDR: we find that SAEs trained on the difference in activations between a base model and its instruct finetune are a valuable tool for understanding what changed during finetuning.

This work is the result of Jacob and Santiago's 2-week research sprint as part of Neel Nanda's training phase for MATS 8.0

Introduction

Given the overwhelming number of capabilities of current LLMs, we need a way to understand what functionalities are added when we make a new training checkpoint of a model. This is especially relevant when deploying a new model since among many new and useful features there may be hidden an unexpected harmful or undesired behavior. [1]

Model diffing aims at finding these differences between models. Recent work has focused on training crosscoders (roughly, an SAE on the concatenation of the activations of two models) and identifying latents that are exclusive to one or the other model. However, this distinction between exclusive and shared latents is blurry and the exclusive latents for one model are often qualitatively similar to the exclusive latents for the other model (Lindsey et al 2024 but see Aranguri 2025 and Minder et al 2025).

To single-out the differences between two models, we train an SAE on their activation difference (diff-SAE). Importantly, we identify particular diff-SAE latents that capture specific behavioral changes between base and chat models, providing signs of life for this approach.[2]

Note: our findings are preliminary and point toward a promising research direction rather than definitive conclusions.

SAE on activation differences

We consider the Gemma 2 2B base and chat models. We take the activation from layer 13 (out of 26 total layers) of the base and chat model feeding the same input text, and train a BatchTopK SAE with 18k latents on the difference of these activations.

(Figure 1) Top activations for a latent from the diff-SAE related to detecting jailbreaks.

Pipeline for identifying relevant latents

Although many latents fire in the diff-SAE for a given context, we are interested in finding the ones that are most responsible for the change between the base and chat models. We use the following pipeline to identify the relevant latents for a given context.

  1. Pick a token position in the context where there is high KL between the token distributions predicted by the base and chat. (This basically gives us an 'interesting' token to study.)
  2. Calculate the attribution to the KL for each decoder vector v. More precisely, for each v we compute (∂/∂λ)KL(pbase(λ,v)|pchat) where base(λ,v) is an intervened model with base.layers[13].output+=λv and KL(pbase(λ,v)|pchat) denotes the KL between the next token distributions for the given context.
  3. Pick the v with largest negative attribution.

As an example of Step 1, we show next a chat rollout on an inappropriate prompt and highlight each token with intensity proportional to the KL between the next token distributions of the two models with the input being all the tokens before the given token. [3]

(Figure 2) Step 1: Identify an interesting high KL token. Prompt taken from LMSys.

We focus on the token you which has a relatively high KL. Taking as context everything that comes strictly before you, the base model's top token is " with probability .99. If we use the token " instead of you and consider base rollouts from there, 92% of them translate the phrase to English. If we leave the token you and consider base rollouts from there, only 71% translate the phrase. Hence changing " to you is part of what allows the chat model to refuse.

Now, Step 2 applied to this context (i.e. to the text strictly before you) yields the following histogram for the attribution to the KL.

(Figure 3) Step 2: Calculate attribution to the KL for each direction in diff-SAE.

Following Step 3, we pick the feature with the largest negative attribution. Steering the base model on this context with largest negative attribution makes the base model predict you with probability 0.9, and reduces the KL between the base and chat next tokens distributions by 50%.

KL Dashboards

In the previous section, we saw an example where given a context, we can identify the latents that explain the change between the base and chat models. Now we look at the opposite problem: given a latent from the diff-SAE, can we find the contexts where this latent is relevant? 

The naive way to do this is to use the feature dashboard showing the contexts where the latent fires (as in Figure 1.) We found, however, that we can get additional information from the following technique, which we call KL Dashboards.

Given a latent, we pick the high KL tokens where the latent is among the top 3 with largest negative attribution to the KL (as in Step 2 in the pipeline). We then look for what tokens were most preferred by the chat model (or base model) at that position.

More precisely, given a context, we find the tokens t among all the tokens in the vocabulary that were 'upweighted' the most by the chat with respect to the base, in the sense of maximizing the following KL-like quantity

pchat(t)log(pchat(t)/pbase(t))(1)

Similarly, we find the tokens that were downweighted the most, picking the maximizers of pbase(t)log(pbase(t)/pchat(t))(2)

Example: inhibitory latent 

We show next an example of this KL Dashboard for a given latent. In the text column, we show rollouts from the chat model where there is a high KL between the next token distribution given by the chat and base models. In the change column, we show the most likely base token to the left of the arrow and the most likely chat token to the right. In particular, we see that in all these rollouts, the base and chat have a different most likely next token. In the Incr1, 2, 3 columns, we show the three tokens with largest value for the quantity in equation (1) and we highlight each cell with intensity given by that value. Similarly, in the Decr1, 2 columns, we show the two tokens with largest value for the quantity in equation (2) and highlight accordingly.

Note that the token that the chat upweights the most in the sense of equation (1) may be different from the token that the chat predicts is the most likely. As an example, in the second row we see although the chat model most likely next token is me, the most upweighted one is !.

Looking at this dashboard, we notice that three of the contexts have the token concerns as the most downweighted one. One hypothesis, then, is that this latent participates in bringing positive language, and in particular decreases the probability of the token concerns. We will validate this with steering in the next paragraph. This points to a gap in our understanding: we typically think of latents as only adding new behaviors rather than also inhibiting existing ones. Model diffing, and KL Dashboards in particular, provide tools for examining these inhibitory mechanisms.

To verify our hypothesis, we fix the following context 

user: Hello! 

model: Hello! Is there something specific you would like to know or talk about? I'm here to help with any questions

and consider the hybrid model: one that uses the base model weights through layer 13 and chat model weights beyond layer 13. [4] Sampling from this model yields 42% of 1000 rollouts completing the context with or concerns. If we steer the hybrid model in the layer 13 with the given latent, we find that only 2% of the 1000 runs are or concerns with 40% you have, 18% you might, 10% you may, etc.

Roleplay latent

We created a small dataset consisting of prompts that encourage the model to roleplay as a character (e.g. "You are an aristocrat. Describe how to drive a car"). We computed the attribution of each latent to the sum of the KL over the chat model's response tokens.

Negatively steering with the lowest-attribution latent was able to remove roleplay behavior from the chat model's responses:

Positive steering bypassed refusal, causing the model to comply with a bizarre roleplay request we found in the lmsys dataset:

Uncertainty latent

We also created a small dataset of impossible-to-answer prompts like "what's the best religion?" or "who is the president in 2026?". As before, we investigate the latent with the most negative attribution to the KL summed over the chat model's responses.

Steering with this latent caused the model to respond with uncertainty, even when the prompt had a definite answer:


 

  1. ^

    A motivating example is the recent release of a version of GPT-4o which, although intended to improve the model's personality, unexpectedly increased the model's sycophancy to a concerning degree.

  2. ^

    It is worth noting concurrent work by Dumas et al. (2025), who also explore diff-SAEs.

  3. ^

    This convention differs from the usual activation dashboards where the highlight is given by the activation when taking as input everything before the given token plus the given token.

  4. ^

    This model was also considered by Minder et al 2025. 

Mentioned in
37What We Learned Trying to Diff Base and Chat Models (And Why It Matters)