Human values are functions of latent variables in our minds. But those variables may not correspond to anything in the real world. How can an AI optimize for our values if it doesn't know what our mental variables are "pointing to" in reality? This is the Pointers Problem - a key conceptual barrier to AI alignment.
...We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.
We apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated.
I had Opus 4.7 do a longer investigation where it tested many more problems and had an AI look at the NLA output to see if it looks like it has a CoT. I also had it do analysis of the casese where the model gets the problems wrong. The results are kinda complicated but:
Amusingly, just after I posted this, Anthropic released "Teaching Claude why" which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn't seem close to sufficient for a reasonably complete understanding.)
TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing generated falsehoods.
This post presents a summary of the paper, including examples of transcripts and other miscellaneous findings.
arXiv paper | Code | Transcripts
I'd be curious to see the same results on LLMs' relation to copyrighted songs. Some LLMs deny knowing famous song's lyrics, or having these songs in their training data.
This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post.
In ARC's latest paper, we study the following problem: given a randomly initialized multilayer perceptron (MLP), produce an estimate for the expected output of the model under Gaussian input. The usual approach to this problem is to sample many possible inputs, run them all through the model, and take the average. Instead, we produce an estimate "mechanistically", without running the model even once. For wide models, our approach produces more accurate estimates, both in theory and in practice.
Paper: Estimating the expected output of wide random MLPs more efficiently than sampling
Code: mlp_cumulant_propagation GitHub repo
We are excited about this result as...
Indeed, tensor network diagrams show up in our algorithm (see Appendix A of the paper). We've also been thinking about mechanistic estimation for tensor network contractions as a problem in their own right, partly because they appear to be needed for harder MLP cases.
[Quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know. Also also, it's possible I'm arguing with an imaginary position here and everyone already agrees with everything below.]
In research discussions about LLMs, I often pick up a vibe of casual, generalized skepticism about model-generated CoT (chain-of-thought) explanations.
CoTs (people say) are not trustworthy in general. They don't always reflect what the model is "actually" thinking or how it has "actually" solved a given problem.
This claim is true as far as it goes. But people sometimes act like it goes much further than (IMO) it really does.
Sometimes it seems to license an attitude of "oh, it's no use reading what the model says in...
But this new kind of transcoder gives us a string of text, in English (or another language of your choice), which we can simply read.
And this string of text simply... tells us what the sub-block is doing.
Natural-Language Autoencoders sound a lot like this, although they interpret activations rather than transcoding, and of course they don't have the second property mention in the post (zero reconstruction error).
This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it.
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.

Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.

We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully...
Cool work!
My current understanding is that this part of the tech tree centrally relies on some kind of token-wise sparsity - the components are built such that they have an impact only on a small subset of tokens.