Writing up a probably-obvious point that I want to refer to later, with significant writing LLM writing help. TL;DR: 1) A common critique of AI safety evaluations is that they occur in unrealistic settings, such as excessive goal conflict, or are obviously an evaluation rather than “real deployment”.[1] I argue...
Authors: Riya Tyagi, Daria Ivanova, Arthur Conmy, Neel Nanda Riya and Daria are co-first authors. This work was largely done during a research sprint for Neel Nanda’s MATS 9.0 training phase. 🖥️ Deployment code ⚙️ Interactive demo 🐦 Tweet thread TL;DR * We believe that more research effort should go...
TLDR * The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family * Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here [1] * Key features of this relative...
Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan** * primary contributors ** advice and mentorship TL;DR We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable....
Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...
Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels** * equal primary contributor, order determined via coin flip ** equal advice and mentorship, order determined via coin flip > “Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my...