AI ALIGNMENT FORUM
AF

2314
Wikitags

Model Diffing

Edited by Clément Dumas last updated 30th Jun 2025

Model diffing is the study of mechanistic changes introduced during fine-tuning - essentially, understanding what makes a fine-tuned model different from its base model internally.

Subscribe
Discussion
Subscribe
Discussion
Posts tagged Model Diffing
42What We Learned Trying to Diff Base and Chat Models (And Why It Matters)
Clément Dumas, Julian Minder, Neel Nanda
4mo
0
23Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Julian Minder, Clément Dumas, Stewy Slocum, Neel Nanda
1mo
0
23Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
1y
3
18SAE on activation differences
Santiago Aranguri, jacob_drori, Neel Nanda
4mo
0
8Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]
Jason Gross, rajashree
9mo
0
7[Replication] Crosscoder-based Stage-Wise Model Diffing
Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, Jason Gross
7mo
0
4Tied Crosscoders: Explaining Chat Behavior from Base Model
Santiago Aranguri
7mo
0
3[Research sprint] Single-model crosscoder feature ablation and steering
Thomas Read
6mo
0
Add Posts