wesg — AI Alignment Forum

Refusal in LLMs is mediated by a single direction

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina...

Apr 27, 2024255

SAE reconstruction errors are (empirically) pathological

Summary Sparse Autoencoder (SAE) errors are empirically pathological: when a reconstructed activation vector is distance ϵ from the original activation vector, substituting a randomly chosen point at the same distance changes the next token prediction probabilities significantly less than substituting the SAE reconstruction[1] (measured by both KL and loss). This...

Mar 29, 2024108

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Abstract > Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on...

May 3, 202333

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Wes Gurnee

Wes Gurnee

Wes Gurnee

Wes Gurnee

Refusal in LLMs is mediated by a single direction

SAE reconstruction errors are (empirically) pathological

Finding Neurons in a Haystack: Case Studies with Sparse Probing