New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...
Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...
Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project. Check out our updated paper here: https://arxiv.org/abs/2512.10092. TL;DR * We use sparse autoencoders (SAEs) for four textual data analysis tasks—data diffing, finding correlations, targeted clustering, and retrieval. * We care especially about gaining insights...
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which...
TL;DR: If you are thinking of using interpretability to help with strategic deception, then there's likely a problem you need to solve first: how are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)? We discuss this problem and try to outline some constructive directions....
NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. > “It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input....