Actually, Othello-GPT Has A Linear Emergent World Representation

[-]TurnTrout3y85

Overall, I really like this post. I think it's a cool, self-contained insight with real updates for interp. I also admire how quickly you got these results. It makes me want to hack more things, quickly, and get more cool results, quickly.

Models can be deeply understood: More fundamentally, this is further evidence that neural networks are genuinely understandable and interpretable, if we can just learn to speak their language.

I agree that this is evidence, but I have some sense of "there's going to be low-hanging, truly-understandable circuits, and possibly a bunch of circuits we don't understand and can't even realize are there. And we keep doing interp work and understanding more and more of models, but often we won't know exactly what we don't know." Are you sympathetic to this concern?

(Ofc you don't need to understand a full net for interp to be amazingly useful, and other such caveats)

Also, what does "Translate by X" mean in your intervention plots?

[-]Neel Nanda3y32

Thanks! I also feel more optimistic now about speed research :) (I've tried similar experiments since, but with much less success - there's a bunch of contingent factors around not properly hitting flow and not properly clearing time for it though). I'd be excited to hear what happens if you try it! Though I should clarify that writing up the results took a month of random spare non-work time...

Re models can be deeply understood, yes, I think you raise a valid and plausible concern and I agree that my work is not notable evidence against. Though also, idk man, it seems basically unfalsifiable. My intuition is that there may be some threshold of "we cannot deeply interpret past this", but no one knows where it is (and most people assumed "we cannot deeply interpret at all"! Or something similar). And that every interpretability win is evidence that boundary is further on (or non-existent).

Fuzzy intuition: It doesn't distinguish between the boundary being far away vs non-existent, but IMO the correct prior before seeing mech interp work at all was to have some distribution over the point where we hit a wall, and some probability on never hitting a wall. The longer we go without hitting a wall, the higher the posterios probability on never hitting a wall should be.

Translate by X is bad notation - it means "take the coordinate in the "mine vs their's" direction, and set it to -X times its original value". It should really be flip and scale by X or something (it came from an initial iteration of the method).

[-]Circuitrinos2y30

Regarding this quote "we see that the model trained to be good at Othello seems to have a much worse world model"

What if for LLMs trained to play games like Othello, chess, go, etc..., instead of directly training models to play the best moves, we first train them to play legal moves like in this paper to have it construct a good world model.

Then once it has a world model, we "freeze" those weights and add on additional layers and train just those layers to play the game well.

Wouldn't this force the play-well model to include the good world model? (a model we can probe/understand).

Wouldn't that also force the play-well layers of the model to learn something much easier to probe and understand?

From there, we could potentially probe the play-well layers to learn something about what the optimal strategy of the game actually is.

[-]Neel Nanda2y20

That might work, though you could easily end up with the final model not actually faithfully using its world model to make the correct moves - if there's more efficient/correct heuristics, there's no guarantee it'll use the expensive world model, or not just forget about it.

[-]gwern2y*3-1

I would expect it to not work in the limit. All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing. (You don't need to model edge-cases or weird scenarios which don't ever come up while pursuing the optimal policy, and the optimal 'world-model' can be arbitrarily tinier and unfaithful to the full true world dynamics.*) Simply hardwiring a world model doesn't change this, any more than feeding in the exact board state as an input would lead to it caring about or paying attention to the irrelevant parts of the board state. As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.

* I'm sure Nanda knows this but for those whom this isn't obvious or haven't seen other discussions on this point (some related to the 'simulators' debate): a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward. For a complicated world or incomplete maximization, this may induce a very rich world-model inside the agent, but the final converged optimal agent may have an arbitrarily impoverished world model. In this case, imagine a version of Othello where at the first turn, the agent may press a button labeled 'win'. Obviously, the optimal agent will learn nothing at all beyond learning 'push the button on the first move' and won't learn any world-model at all of Othello! No matter how rich and fascinating the rest of the game may be, the optimal agent neither knows nor cares.

[-]TurnTrout2y*20

All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing.

Strong claim! I'm skeptical (EDIT: if you mean "in the limit" to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn't matter? (Also, of course, even one single optimal policy can be represented by multiple different network parameterizations which induce the same semantics, with eg some using the WM and some using heuristics.)

I think the more relevant question is "given a frozen initial network, what are the circuit-level inductive biases of the training process?". I doubt one can answer this via appeals to RL convergence results.

(I skimmed through the value equivalence paper, but LMK if my points are addressed therein.)

a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward.

As a side note, I think this "agent only wants to maximize reward" language is unproductive (see "Reward is not the optimization target", and "Think carefully before calling RL policies 'agents'"). In this case, I suspect that your language implicitly equivocates between "agent" denoting "the RL learning process" and "the trained policy network":

As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.

[-]gwern2y62

if you mean "in the limit" to apply to practically relevant systems we build in the future.

Outside of simple problems like Othello, I expect most DRL agents will not converge fully to the peak of the 'spinning top', and so will retain traces of their informative priors like world-models.

For example, if you plug GPT-5 into a robot, I doubt it would ever be trained to the point of discarding most of its non-value-relevant world-model - the model is too high-capacity for major forgetting, and past meta-learning incentivizes keeping capabilities around just in case.

But that's not 'every system we build in the future', just a lot of them. Not hard to imagine realistic practical scenarios where that doesn't hold - I would expect that any specialized model distilled from it (for cheaper faster robotic control) would not learn or would discard much more of its non-value-relevant world-model compared to its parent, and that would have potential safety & interpretability implications. The System II distills and compiles down to a fast efficient System I. (For example, if you were trying to do safety by dissecting its internal understanding of the world, or if you were trying to hack a superior reward model, adding in safety criteria not present in the original environment/model, by exploiting an internal world model, you might fail because the optimized distilled model doesn't have those parts of the world model, even if the parent model did, as they were irrelevant.) Chess end-game databases are provably optimal & very superhuman, and yet, there is no 'world-model' or human-interpretable concepts of chess anywhere to be found in them; the 'world-model' used to compute them, whatever that was, was discarded as unnecessary after the optimal policy was reached.

I think the more relevant question is "given a frozen initial network, what are the circuit-level inductive biases of the training process?". I doubt one can answer this via appeals to RL convergence results.

Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping. (It's not like stopping forgetting is hard. Of course you can stop forgetting by changing the problem to be solved, and simply making a representation of the world-state part of the reward, like including a reconstruction loss.) In this case, however, Othello is simple enough that the superior agent has already apparently discarded much of the world-model and provides a useful example of what end-to-end reward maximization really means - while reward is sufficient to learn world-models as needed, full complete world-models are neither necessary nor sufficient for rewards.

As a side note, I think this "agent only wants to maximize reward" language is unproductive (see "Reward is not the optimization target", and "Think carefully before calling RL policies 'agents'").

I've tried to read those before, and came away very confused what you meant, and everyone who reads those seems to be even more confused after reading them. At best, you seem to be making a bizarre mishmash of confusing model-free and policies and other things best not confused and being awestruck by a triviality on the level of 'organisms are adaptation-executers and not fitness-maximizers', and at worst, you are obviously wrong: reward is the optimization target, both for the outer loop and for the inner loop of things like model-based algorithms. (In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not 'optimize the reward'?)

[-]TurnTrout2y20

Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.

LLMs aren't trained to convergence because that's not compute-efficient, so early stopping seems like the relevant baseline. No?

everyone who reads those seems to be even more confused after reading them

I want to defend "Reward is not the optimization target" a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don't think it's true. For some reason, some people really get a lot out of the post; others think it's trivial; others think it's obviously wrong, and so on. See Rohin's comment:

(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya's recent post for similar reasons. I don't think that the people I'm explaining it to literally don't understand the point at all; I think it mostly hasn't propagated into some parts of their other reasoning about alignment. I'm less on board with the "it's incorrect to call reward a base objective" point but I think it's pretty plausible that once I actually understand what TurnTrout is saying there I'll agree with it.)

You write:

In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not 'optimize the reward'?

These algorithms do optimize the reward. My post addresses the model-free policy gradient setting... [goes to check post] Oh no. I can see why my post was unclear -- it didn't state this clearly. The original post does state that AIXI optimizes its reward, and also that:

For point 2 (reward provides local updates to the agent's cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates.

However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.

I don't know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you -- I'm happy to answer more specific questions.

[-]TurnTrout3y*31

Not a huge deal for the overall post, but I think your statement here isn't actually known to be strictly true:

Literally the only thing Othello-GPT cares about is playing legal move

I think it's probably true in some rough sense, but I personally wouldn't state it confidently like that. Even if the network is supervised-trained to predict legal moves, that doesn't mean its internal goals or generalization mirrors that.

[-]Neel Nanda3y40

Er, hmm. To me this feels like a pretty uncontroversial claim when discussing a small model on an algorithmic task like this. (Note that the model is literally trained on uniform random legal moves, it's not trained on actual Othello game transcripts). Though I would agree that eg "literally all that GPT-4 cares about is predicting the next token" is a dubious claim (even ignoring RLHF). It just seems like Othello-GPT is so small, and trained on such a clean and crisp task that I can't see it caring about anything else? Though the word care isn't really well defined here.

I'm open to the argument that I should say "Adam only cares about playing legal moves, and probably this is the only thing Othello-GPT is "trying" to do".

To be clear, the relevant argument is "there are no other tasks to spend resources on apart from "predict the next move" so it can afford a very expensive world model"

[-]TurnTrout3y32

I'm open to the argument that I should say "Adam only cares about playing legal moves, and probably this is the only thing Othello-GPT is "trying" to do".

This statement seems fine, yeah!

(Rereading my initial comment, I regret that it has a confrontational tone where I didn't intend one. I wanted to matter-of-factly state my concern, but I think I should have prefaced with something like "by the way, not a huge deal overall, but I think your statement here isn't known to be strictly true." Edited.)

[-]TurnTrout3y30

Rather than just learning surface level statistics about the distribution of moves, it learned to model the underlying process that generated that data. In my opinion, it's already pretty obvious that transformers can do something more than statistical correlations and pattern matching, see eg induction heads, but it's great to have clearer evidence of fully-fledged world models!

This updated me slightly upwards on "LLMs trained on text learn to model the underlying world, without needing multimodal inputs to pin down more of the world's e.g. spatial properties." I previously had considered that any given corpus could have been generated by a large number of possible worlds, but I now don't weight this objection as highly.

[-]Neel Nanda3y10

I previously had considered that any given corpus could have been generated by a large number of possible worlds, but I now don't weight this objection as highly.

Interesting, I hadn't seen that objection before! Can you say more? (Though maybe not if you aren't as convinced by it any more). To me, it'd be that there's many worlds but they all share some commonalities and those commonalities are modelled. Or possibly that the model separately simulates the different worlds.

[-]TurnTrout3y20

So, first, there's an issue where the model isn't "remembering" having "seen" all of the text. It was updated by gradients taken over its outputs on the historical corpus. So there's a subtlety, such that "which worlds are consistent with observations" is a wrongly-shaped claim. (I don't think you fell prey to that mistake in OP, to be clear.)

Second, on my loose understanding of metaphysics (ie this is reasoning which could very easily be misguided), there exist computable universes which contain entities training this language model given this corpus / set of historical signals, such that this entire setup is specified by the initial state of the laws of physics. In that case, the corpus and its regularities ("dogs" and "syntax" and such) wouldn't necessarily reflect the world the agent was embedded in, which could be anything, really. Like maybe there's an alien species on a gas giant somewhere which is training on fictional sequences of tokens, some of which happen to look like "dog".

Of course, by point (1), what matters isn't the corpus itself (ie what sentences appear) but how that corpus imprints itself into the network via the gradients. And your post seems like evidence that even a relatively underspecified corpus (sequences of legal Othello moves) appears to imprint itself into the network, such that the network has a world model of the data generator (i.e. how the game works in real life).

Does this make sense? I have some sense of having incommunicated poorly here, but hopefully this is better than leaving your comment unanswered.

[-]lukaemon1y00

In hindsight, I should have trained on layer 6, which is the point where the board state is fully computed and starts to really be used.

You mean layer 4?

[-]ws27a3y-11

Nice work. But I wonder why people are so surprised that these models and GPT would learn a model of the world. Of course they learn a model of the world. Even the skip-gram and CBOW word vectors people trained ages ago modelled the world, in the sense that for example named entities in vector space would be highly correlated with actual spatial/geographical maps. It should be 100% assumed that these models which have many orders of magnitude more parameters are learning much more sophisticated models of the world. What that tells us about their "intelligence" is an entirely different question whatsoever. They are still statistical next token predictors, it's just the statistics are so complicated it essentially becomes a world model. The divide between these concepts is artificial.

[-]Neel Nanda3y21

I tried to be explicit in the post that I don't personally care all that much about the world model angle - Othello-GPT clearly does form a world model, it's very clear evidence that this is possible. Whether it happens in practice is a whole other question, but it clearly does happen a bit.

They are still statistical next token predictors, it's just the statistics are so complicated it essentially becomes a world model. The divide between these concepts is artificial.

I think this undersells it. World models are fundamentally different from surface level statistics, I would argue - a world model is an actual algorithm, with causal links and moving parts. Analogous to how an induction head is a real algorithm (given a token A, search the context for previous occurences of A, and predict that the next token then will come next now), while something that memorises a ton of bigrams such that it can predict B after A is not.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

71

Actually, Othello-GPT Has A Linear Emergent World Representation

71

Overview

Introduction

Background

Naive Implications for Mechanistic Interpretability

My Findings

Takeaways

How do models represent features?

Conceptual Takeaways

Probing

Technical Setup

Results

Intervening

Citation Info