Josh Engels

Brief Explorations in LLM Value Rankings

Code and data can be found here Executive Summary * We use data from Zhang et al. (2025) to measure LLM values. We find that our value metric can sometimes predict LLM behaviors on a test distribution in non-safety-relevant settings, but it is not super consistent. * In Zhang et...

Jan 1237

Steering RL Training: Benchmarking Interventions Against Reward Hacking

This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post...

Dec 29, 202554

Can we interpret latent reasoning using current mechanistic interpretability tools?

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan** * primary contributors ** advice and mentorship TL;DR We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable....

Dec 22, 202537

How Can Interpretability Researchers Help AGI Go Well?

Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...

Dec 1, 202566

A Pragmatic Vision for Interpretability

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...

Dec 1, 2025131

Current LLMs seem to rarely detect CoT tampering

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels** * equal primary contributor, order determined via coin flip ** equal advice and mentorship, order determined via coin flip > “Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my...

Nov 19, 202555

Interim Research Report: Mechanisms of Awareness

Summary Reproducing a result from recent work, we study a Gemma 3 12B instance trained to take risky or safe options; the model can then report its own risk tolerance. We find that: * Applying LoRA to a single MLP is enough to reproduce the behavior * The single LoRA...

May 2, 202543

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Josh Engels

Josh Engels

A Pragmatic Vision for Interpretability

How Can Interpretability Researchers Help AGI Go Well?

Current LLMs seem to rarely detect CoT tampering

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Josh Engels

A Pragmatic Vision for Interpretability

How Can Interpretability Researchers Help AGI Go Well?

Current LLMs seem to rarely detect CoT tampering

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Brief Explorations in LLM Value Rankings

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Can we interpret latent reasoning using current mechanistic interpretability tools?

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Current LLMs seem to rarely detect CoT tampering

Interim Research Report: Mechanisms of Awareness