Message

PhD @ EPFL with Robert West. MATS 7 Scholar with Neel Nanda. Interested in mechanistic interpretability and the what the process of finetuning does to models.

Julian Minder

PhD @ EPFL with Robert West. MATS 7 Scholar with Neel Nanda. Interested in mechanistic interpretability and the what the process of finetuning does to models.

Julian Minder — AI Alignment Forum

Julian Minder

Message

PhD @ EPFL with Robert West. MATS 7 Scholar with Neel Nanda. Interested in mechanistic interpretability and the what the process of finetuning does to models.

387

Julian Minder

PhD @ EPFL with Robert West. MATS 7 Scholar with Neel Nanda. Interested in mechanistic interpretability and the what the process of finetuning does to models.

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, and Owain_Evans

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...

Dec 18, 2025•154

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback. Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon. TL;DR Claim: Narrow finetunes leave clearly readable...

Sep 5, 2025•54

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

by Clément Dumas, Julian Minder, and Neel Nanda

This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps. This work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Bart Bussman for their useful feedback. Could...

Jun 30, 2025•106