Current Interpretability results suggest that roughly the first half of the layers in an LLM correspond to understanding the context at increasingly abstract levels, and the second half to figuring out what to say and turning that back from abstractions into concrete tokens. It's further been observed that in the second half, figuring out what to say generally seems to occur in stages: first working out the baseline relevant facts, then figuring out how to appropriately slant/color those in the current context, then converting these into the correct language, and last getting the nitty-gritty details of tokenization right.

How do we know this? This claim seems plausible, but also I did not know that mech-interp was advanced enough to verify something like this. Where can I read more?

Reply

[-]RogerDearnaley2y10

An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I'm afraid I don't recall the location, nor was I able to find it when I was writing this, which is why I didn't add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I've seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):

In summary, the general pattern of observations across layers suggests a rough layout where early layers "de-tokenize," mapping tokens to fairly concrete concepts (phrases like “machine learning” or words when used in a specific language), the middle of the network deals in more abstract concepts such as “any clause that describes music," and the later portions of the network "re-tokenize," converting concrete concepts back into literal tokens to be output. All of this is very preliminary and requires much more detailed study to draw solid conclusions. However, our experience in vision was that having a sense of what kinds of features tend to exist at different layers was very helpful as high-level orientation for understanding models (see especially ). It seems promising that we may be developing something similar here.

which is not quite the same thing, though there is some resemblance. There's also a relation to the encoding and decoding concepts of sections 2 and 3 of the recent more theoretical paper White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?, though that doesn't make it clear that equal numbers of layers are required. (That also explains why the behavior of so-called "decoder-only" and "encoder-decoder" transformer models are so similar.)

The "baseline before applying bias" part was I think from one of the papers on lie detection, latent knowledge extraction and/or bias, of which there have been a whole series this year, some from Paul Christiano's team and some from others.

On where to read more, I'd suggest starting with the Anthropic research blog where they discuss their research papers for the last year or so: roughly 40% of those are on mechanistic interpretability, and there's always a blog post summary for a science-interested-layman reader with a link to the actual paper. There's also some excellent work coming from other places, such as Neel Nanda, who similarly has a blog website, and the ELK work under Paul Christiano. Overall we've made quite a bit of progress on interpretability in the last 18 months or so, though there's still a long way to go.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

13

Interpreting the Learning of Deceit

13

LLMs Learn Deceit from Us

When Deceit Becomes Seriously Risky

Deceit Learning During Stochastic Gradient Decent vs. Reinforcement Learning

How to Catch an LLM in the Act of Repurposing Deceitful Behaviors