Abstract > We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an...
TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...
Subhash and Josh are co-first authors on this work done in Neel Nanda’s MATS stream. We recently released a new paper investigating sparse probing that follows up on a post we put up a few months ago. Our goal with the paper was to provide a single rigorous data point...
I (Subhash) am a Masters student in the Tegmark AI Safety Lab at MIT. I am interested in recruiting for full time roles this Spring - please reach out if you're interested in working together! TLDR This blog post accompanies the paper "Language Models Use Trigonometry to Do Addition." Key...
Subhash and Josh are co-first authors. Work done as part of the two week research sprint in Neel Nanda’s MATS stream Update February 2025: We have recently expanded this post into a full paper: https://arxiv.org/abs/2502.16681 Our results are now substantially more negative. We find that SAE probes do not consistently...