Connor Kissane

Sparse Autoencoders Work on Attention Layer Outputs

This post is the result of a 2 week research sprint project during the training phase of Neel Nanda’s MATS stream. Executive Summary * We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs). * Specifically, rather than training our SAE on attn_output, we train our SAE on “hook_z” concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature’s weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only briefly explore that in this work. * We open source our SAE, you can use it via this Colab notebook. * Shallow Dives: We do a shallow investigation to interpret each of the first 50 features. We estimate 82% of non-dead features in our SAE are interpretable (24% of the SAE features are dead). * See this feature interface to browse the first 50 features. * Deep dives: To verify our SAEs have learned something real, we zoom in on individual features for much more detailed investigations: the “‘board’ is next by induction” feature, the local context feature of “in questions starting with ‘Which’”, and the more global context feature of “in texts about pets”. * We go beyond the techniques from the Anthropic paper, and investigate the circuits used to compute the features from earlier components, including analysing composition with an MLP0 SAE. * We also investigate how the features are used downstream, and whether it's via MLP1 or the direct connection to the logits. * Automation: We automatically detect and quantify a large “{token} is next by induction” feature family. This represents ~5% of the living features in the SAE. * Thoug

85Jan 16, 2024

Connor Kissane

Message

441

203

White Box Control at UK AISI - Update on Sandbagging Investigations

Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...

Jul 10, 202580

SAEs are highly dataset dependent: a case study on the refusal direction

This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel. Executive Summary * Problem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most...

Nov 7, 202467

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Intro Anthropic recently released an exciting mini-paper on crosscoders (Lindsey et al.). In this post, we open source a model-diffing crosscoder trained on the middle layer residual stream of the Gemma-2 2B base and IT models, along with code, implementation details / tips, and a replication of the core results...

Oct 27, 202449

Base LLMs refuse too

Executive Summary * Refusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless) * Further, for both Qwen 1.5 0.5B...

Sep 29, 202461

SAEs (usually) Transfer Between Base and Chat Models

This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive Summary * We train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly...

Jul 18, 202467

Attention Output SAEs Improve Circuit Analysis

This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive Summary * In a previous post we trained Attention Output Sparse Autoencoders (SAEs) on every layer of GPT-2 Small. * Following that work, we...

Jun 21, 202433

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Executive Summary * In a previous post we trained attention SAEs...

Mar 6, 202463

Load More (7/9)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Connor Kissane

Connor Kissane

Connor Kissane

Sparse Autoencoders Work on Attention Layer Outputs

White Box Control at UK AISI - Update on Sandbagging Investigations

Attention SAEs Scale to GPT-2 Small

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane

White Box Control at UK AISI - Update on Sandbagging Investigations

SAEs are highly dataset dependent: a case study on the refusal direction

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Base LLMs refuse too

SAEs (usually) Transfer Between Base and Chat Models

Attention Output SAEs Improve Circuit Analysis

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

White Box Control at UK AISI - Update on Sandbagging Investigations

SAEs are highly dataset dependent: a case study on the refusal direction

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Base LLMs refuse too

SAEs (usually) Transfer Between Base and Chat Models

Attention Output SAEs Improve Circuit Analysis

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

Sparse Autoencoders Work on Attention Layer Outputs

White Box Control at UK AISI - Update on Sandbagging Investigations

Attention SAEs Scale to GPT-2 Small

SAEs are highly dataset dependent: a case study on the refusal direction