Subhash Kantamneni — AI Alignment Forum

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Abstract > We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an...

May 7215

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, and Owain_Evans

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...

Dec 18, 2025154

Takeaways From Our Recent Work on SAE Probing

by Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan, and Neel Nanda

Subhash and Josh are co-first authors on this work done in Neel Nanda’s MATS stream. We recently released a new paper investigating sparse probing that follows up on a post we put up a few months ago. Our goal with the paper was to provide a single rigorous data point...

Mar 3, 202530

Language Models Use Trigonometry to Do Addition

I (Subhash) am a Masters student in the Tegmark AI Safety Lab at MIT. I am interested in recruiting for full time roles this Spring - please reach out if you're interested in working together! TLDR This blog post accompanies the paper "Language Models Use Trigonometry to Do Addition." Key...

Feb 5, 202580

SAE Probing: What is it good for?

Subhash and Josh are co-first authors. Work done as part of the two week research sprint in Neel Nanda’s MATS stream Update February 2025: We have recently expanded this post into a full paper: https://arxiv.org/abs/2502.16681 Our results are now substantially more negative. We find that SAE probes do not consistently...

Nov 1, 202434