TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...
The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback. Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon. TL;DR Claim: Narrow finetunes leave clearly readable...
This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps. This work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Bart Bussman for their useful feedback. Could...
Work completed during a two-month internship supervised by @Jobst Heitzig. Thanks to Phine Schikhof for her invaluable conversations and friendly support during the internship, and to Jobst Heitzig, who was an amazing supervisor. Epistemic Status: I dedicated two full months to working on this project. I conducted numerous experiments to...