This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused...
This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here. TL;DR * It is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models....
New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...
Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...
Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive. Abstract: > AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient,...
TL;DR: This paper brings together ~30 mechanistic interpretability researchers from 18 different research orgs to review current progress and the main open problems of the field. This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences....