Stewy Slocum
112
1
The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback.
Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon.
Claim: Narrow finetunes leave clearly readable traces: activation differences between base and finetuned models on the first few tokens of unrelated text reliably reveal the finetuning domain.
Results: