This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here. TLDR: Via adapting the methods of Marks et al and Li et al, we train Gemini 3 Flash to...
This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here. Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from...
This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused...
This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here. TL;DR * It is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models....
This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. TL;DR It's often assumed that models will act more aligned when they can tell they're being evaluated. But we find that Gemini can take “undesired” actions in...
Authors: Daria Ivanova, Riya Tyagi, Josh Engels, Neel Nanda Daria and Riya are co-first authors. This work was done during Neel Nanda’s MATS 9.0. Claude helped write code and suggest edits for this post. Most of our tasks fall in 3 categories: predicting future actions, detecting the effect of an...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. * There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training. * If we can robustly train complex and nuanced values into a...