The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback.

Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon.

TL;DR

Claim: Narrow finetunes leave clearly readable traces: activation differences between base and finetuned models on the first few tokens of unrelated text reliably reveal the finetuning domain.

Results:

Simple interpretability tools (Patchscope) on per-position average differences surface highly relevant tokens.
Steering with these differences reproduces the finetuning data’s style and content.
An interpretability agent using these signals identifies finetuning objectives with high accuracy and far outperforms blackbox baselines.
Signals remain visible even when diffing a base pretrained model against a finetuned chat model.
Mixing unrelated chat data or reducing finetuning set size

...

(Continue Reading - 2047 more words)