The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback.
Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon.
TL;DR
Claim: Narrow finetunes leave clearly readable traces: activation differences between base and finetuned models on the first few tokens of unrelated text reliably reveal the finetuning domain.
Results:
- Simple interpretability tools (Patchscope) on per-position average differences surface highly relevant tokens.
- Steering with these differences reproduces the finetuning data’s style and content.
- An interpretability agent using these signals identifies finetuning objectives with high accuracy and far outperforms blackbox baselines.
- Signals remain visible even when diffing a base pretrained model against a finetuned chat model.
- Mixing unrelated chat data or reducing finetuning set size
...