Arthur Conmy

Open Distillation of Hereditary Traits

TL;DR * Josh and Neel show that distillation from a teacher model to a base pretrained student model transfers some of the teacher model’s traits (such as displaying negative emotion in the Gemma Needs Help evals) * On its own this is pretty unsurprising, but Josh and Neel additionally show...

Jul 1439

How transparent is DiffusionGemma (and why it matters)

by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah, and Neel Nanda

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and...

Jun 2086

Synthetic document finetuning for instilling positive traits

by CallumMcDougall, Arthur Conmy, and Neel Nanda

This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here. Thanks to Chloe Li for feedback on this post! TLDR: Via adapting the methods of Marks et al and...

Jun 1661

SFT Drives Gemini’s Safety Properties

by Josh Engels, Arthur Conmy, bilalchughtai, and Neel Nanda

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused...

Jun 1389

AIs will be used in “unhinged” configurations

Writing up a probably-obvious point that I want to refer to later, with significant writing LLM writing help. TL;DR: 1) A common critique of AI safety evaluations is that they occur in unrealistic settings, such as excessive goal conflict, or are obviously an evaluation rather than “real deployment”.[1] I argue...

Mar 1162

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

by Riya Tyagi, daria, Arthur Conmy, and Neel Nanda

Authors: Riya Tyagi, Daria Ivanova, Arthur Conmy, Neel Nanda Riya and Daria are co-first authors. This work was largely done during a research sprint for Neel Nanda’s MATS 9.0 training phase. 🖥️ Deployment code ⚙️ Interactive demo 🐦 Tweet thread TL;DR * We believe that more research effort should go...

Jan 1352

Announcing Gemma Scope 2

by CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan, and Neel Nanda

TLDR * The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family * Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here [1] * Key features of this relative...

Dec 22, 202596

Arthur Conmy

Arthur Conmy

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Announcing Gemma Scope 2

Arthur Conmy

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Announcing Gemma Scope 2

Open Distillation of Hereditary Traits

How transparent is DiffusionGemma (and why it matters)

Synthetic document finetuning for instilling positive traits

SFT Drives Gemini’s Safety Properties

AIs will be used in “unhinged” configurations

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Announcing Gemma Scope 2