Fabien Roger

Wiki Contributions


It's exciting to see a new research direction which could have big implications if it works!

I think that Hypothesis 1 is overly optimistic:

Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
[...] 1024 remaining perspectives to distinguish between

A few thousand of features is the optimistic number of truth-like features. I argue below that it's possible and likely that there are 2^hundredths of truth-like features in LLMs.

Why it's possible to have 2^hundredths of truth-like features
Let's say that your dataset of activation is composed of d-dimensional one hot vectors and their element-wise opposites. Each of these represent a "fact", and negating a fact gives you the opposite vector. Then any features in  is truth-like (up to a scaling constant): for each "fact"  (a one hot vector multiplied by -1 or 1), , and for its opposite fact . This gives you  features which are all truth-like.

Why it's likely that there are 2^hundredths of truth-like features in real LLMs
I think the encoding described above is unlikely. But in a real network, you might expect the network to encode groups of facts like "facts that Democrat believe but not Republicans", "climate change is real vs climate change is fake", ... When late in training it finds out ways to use "the truth", it doesn't need to build a new "truth-circuit" from scratch, it can just select the right combination of groups of facts.

(An additional reason for concern is that in practice you find "approximate truth-like directions", and there can be much more approximate truth-like directions than truth-like directions.)

Even if hypothesis 1 is wrong, there might be ways to salvage the research direction. Thousands of bits of information would be able to distinguish between the 2^thousands truth-like features.

I'm surprised you put the emphasis on how Gaussian your curves are, while your curves are much less Gaussian that you would naively expect if you agreed with the "LLM are a bunch of small independent heuristic" argument.

Even ignoring outliers, some of your distributions don't look like Gaussian distributions to me. In Geogebra, exponential decays fit well, Gaussians don't.

I think your headlines are misleading, and that you're providing evidence against "LLM are a bunch of small independent heuristic".

I agree, this wasn't very clear. I'll add a few words.

It also surprised me! It's so slow to run that I wasn't able to experiment with it a lot, but it's definitely interesting that it performs so well. Also, earlier experiments showed that RLACE isn't very consistent and running it multiple times yielded different results (while CDE is much more consistent), so what's happening at layer 7 might be a fluke, RLACE getting unlucky. I'll de-emphasize the "CDE outperforming RLACE" claims.

Looking at matrix weights through the de-embedding matrix looks interesting!

I'm unsure what kind of "matrix action" you're hoping to capture with SVD.

In the case of symmetric square matrices, the singular directions are the eigenvectors, which are the vectors along which the matrix only multiplies them by a constant value. If the scaling factor is positive, this is what I would call "inaction". On the other hand, even a symmetric square matrix can "stretch" vectors in interesting ways. For example, if you take , I would say that the "interesting action" is not done to the singular directions (one of which is sent to zero, and the other one is kept intact), but something interesting is going on with  and  , they both get sent to the same vector.

So I'm unsure what interesting algorithm could be captured only by looking at singular directions. But maybe you're onto something, and there are other quantities computed in similar ways which could be more significant! Or maybe my intuition about square symmetric matrices is hiding me the interesting things that SVD's singular directions represent. What do you think?

Thank you for the post!

I found it interesting to think about how self-supervised learning + RL can lead to human-like value formation, however I'm not sure how much predictive power you gain out of the shards. The model of value formation you present feels close to the Alpha Go setup:

You have an encoder E, an action decoder D, and a value head V. You train D°E with something close to self-supervised learning (not entirely accurate, but I can imagine other RL systems trained with D°E doing exactly supervised learning), and train V°E with hard-coded sparse rewards. This looks very close to shard theory, except that you replace V with a bunch of shards, right? However, I think this later part doesn't make predictions different from "V is a neural network", because neural networks often learn context-dependent things, and I expect Alpha Go V-network to be very context dependent.

Is sharding a way to understand what neural networks can do in human understandable terms? Or is it a claim about what kind of neural network V is (because there are neural networks which aren't very "shard-like")?

Or do you think that sharding explains more than "the brain is like Alpha Go"? For example, maybe it's hard for different part of the V network to self-reflect. But that feels pretty weak, because human don't do that much either. Did I miss important predictions shard theory does and the classic RL+supervised learning setup doesn't?