lewis smith

[Paper] Difficulties with Evaluating a Deception Detector for AIs

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...

Dec 3, 202530

How Can Interpretability Researchers Help AGI Go Well?

Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...

Dec 1, 202567

A Pragmatic Vision for Interpretability

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...

Dec 1, 2025136

Towards data-centric interpretability with sparse autoencoders

Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project. Check out our updated paper here: https://arxiv.org/abs/2512.10092. TL;DR * We use sparse autoencoders (SAEs) for four textual data analysis tasks—data diffing, finding correlations, targeted clustering, and retrieval. * We care especially about gaining insights...

Aug 15, 202557

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which...

Mar 26, 2025117

A Problem to Solve Before Building a Deception Detector

TL;DR: If you are thinking of using interpretability to help with strategic deception, then there's likely a problem you need to solve first: how are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)? We discuss this problem and try to outline some constructive directions....

Feb 7, 202578

The ‘strong’ feature hypothesis could be wrong

NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. > “It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input....

Aug 2, 2024235

lewis smith

lewis smith

The ‘strong’ feature hypothesis could be wrong

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

lewis smith

The ‘strong’ feature hypothesis could be wrong

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Towards data-centric interpretability with sparse autoencoders

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

A Problem to Solve Before Building a Deception Detector

The ‘strong’ feature hypothesis could be wrong