Transformers Represent Belief State Geometry in their Residual Stream

Is it accurate to summarize the headline result as follows?

Train a Transformer to predict next tokens on a distribution generated from an HMM.
One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in, and perform Bayesian updating on each new token. That is, it maintains .
Key result: A linear probe on the residual stream is able to reconstruct $p (hidden state = H_{i})$ .

(I don't know what Computational Mechanics or MSPs are so this could be totally off.)

EDIT: Looks like yes. From this post:

Part of what this all illustrates is that the fractal shape is kinda… baked into any Bayesian-ish system tracking the hidden state of the Markov model. So in some sense, it’s not very surprising to find it linearly embedded in activations of a residual stream; all that really means is that the probabilities for each hidden state are linearly represented in the residual stream.

[-]Adam Shai2y73

That is a fair summary.

[-]habryka2y41

Promoted to curated: Formalizing what it means for transformers to learn "the underlying world model" when engaging in next-token prediction tasks seems pretty useful, in that it's an abstraction that I see used all the time when discussing risks from models where the vast majority of the compute was spent in pre-training, where the details usually get handwaived. It seems useful to understand what exactly we mean by that in more detail.

I have not done a thorough review of this kind of work, but it seems to me that also others thought the basic ideas in the work hold up, and I thought reading this post gave me crisper abstractions to talk about this kind of stuff in the future.

[-]cousin_it2y*30

I have maybe a naive question. What information is needed to find the MSP image within the neural network? Do we have to know the HMM to begin with? Or could it be feasible someday to inspect a neural network, find something that looks like an MSP image, and infer the HMM from it?

	Data Generating Process	Belief State Process
States belong to	The data generating mechanism	The observer of the outputs of the data generating process
States are	Sets of sequences that constrain the future in particular ways	The observer's beliefs over the states of the data generating process
Sequences of hidden states emit	Valid sequences of data	Valid sequences of data
Interpretation of emissions	The observables/tokens the data generating process emits	What the observer sees from the data generating process

^{^}

PIBBSS is hiring! I wholeheartedly recommend them as an organization.

^{^}

One way to conceptualize this is to think of "the world" as having some hidden structure (initially unknown to you), that emits observables. Our task is then to take sequences of observables and infer the hidden structure of the world - maybe in the service of optimal future prediction, but also maybe just because figuring out how the world works is inherently interesting. Inside of us, we have a "world model" that serves as the internal structure that let's us "understand" the hidden structure of the world. The term world model is contentious and nothing in this post depends on that concept much. However, one motivation for this work is to formalize and make concrete statements about peoples intuitions and arguments regarding neural networks and world models - which are often handwavy and ill-defined.

^{^}

Technically speaking, the term process refers to a probability distribution over infinite strings of tokens, while a presentation refers to a particular HMM that produces strings according to the probability distribution. A process has an infinite number of presentations.

^{^}

Any HMM defines a probability distribution over infinite sequences of the emissions.

^{^}

Our initial belief distribution, in this particular case, is the uniform distribution over the 3 states of the data generating process. However this is not always the case. In general the initial belief distribution is given by the stationary distribution of the data generating HMM.

^{^}

You can find the answer in section IV of this paper by @Paul Riechers.

^{^}

There is work in Computational Mechanics that studies non-optimal or near-optimal prediction, and the tradeoffs one incurs when relaxing optimality. This is likely relevant to neural networks in practice. See Marzen and Crutchfield 2021 and Marzen and Crutchfield 2014.

^{^}

This process is called the mess3 process, and was defined in a paper by Sarah Marzen and James Crutchfield. In the work presented we use x=0.05, alpha=0.85.

^{^}

We've also run another control where we retain the ground truth fractal structure but shuffle which inputs corresponds to which points in the simplex (you can think of this as shuffling the colors in the ground truth plot). In this case when we run our regression we get that every residual stream activation is mapped to the center point of the simplex, which is the center of mass of all the points.

		`1`	`1`	`0`	`1...`
P( $S_{0}$ )	$\frac{1}{3}$	$\frac{1}{3}$	$1$	$0$	$0...$
P( $S_{1}$ )	$\frac{1}{3}$	$0$	$0$	$1$	$0...$
P( $S_{R}$ )	$\frac{1}{3}$	$\frac{2}{3}$	$0$	$0$	$1...$

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

145

Transformers Represent Belief State Geometry in their Residual Stream

145

Introduction

Theoretical Framework

Do Transformers Learn a Model of the World?

The Structure of Belief State Updating

The Mixed-State Presentation

Experiment and Results

Experimental Design

The Data-Generating Process and MSP

The Results!

Limitations and Next Steps

Limitations

Next Steps