Vikrant Varma

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure

We published the MONA paper about three months ago. Since then we’ve had many conversations about the work, and want to share some of the main updates that people make after talking to us: 1. The realism of our model organisms 2. What does "approval" mean 3. Isn't this just...

Apr 12, 202531

JumpReLU SAEs + Early Access to Gemma 2 SAEs

New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan! We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high...

Jul 19, 202455

Improving Dictionary Learning with Gated Sparse Autoencoders

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders! Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over...

Apr 25, 202463

[Full Post] Progress Update #1 from the GDM Mech Interp Team

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order. Activation Steering...

Apr 19, 202480

[Summary] Progress Update #1 from the GDM Mech Interp Team

Introduction This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team’s excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our...

Apr 19, 202473

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about...

Dec 18, 2023149

Explaining grokking through circuit efficiency

This is a linkpost for our paper Explaining grokking through circuit efficiency, which provides a general theory explaining when and why grokking (aka delayed generalisation) occurs, and makes several interesting and novel predictions which we experimentally confirm (introduction copied below). You might also enjoy our explainer on X/Twitter. Abstract One...

Sep 8, 2023102

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Vikrant Varma

Vikrant Varma

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clarifying AI X-risk

Explaining grokking through circuit efficiency

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vikrant Varma

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clarifying AI X-risk

Explaining grokking through circuit efficiency

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Improving Dictionary Learning with Gated Sparse Autoencoders

[Full Post] Progress Update #1 from the GDM Mech Interp Team

[Summary] Progress Update #1 from the GDM Mech Interp Team

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Explaining grokking through circuit efficiency