(Thanks to Lawrence Chan and Buck Shlegeris for comments. Thanks to Nate Thomas for many comments and editing)
Despite appreciating and agreeing with various specific points made in the Simulators post, I broadly think that the term ‘simulator’ and the corresponding frame probably shouldn’t be used. Instead, I think we should just directly reason about predictors and think in terms of questions such as ‘what would the model predict for the next token?’
In this post, I won’t make arguments that I think are strong enough to decisively justify this claim, but I will argue for two points that support it:
- The word ‘simulation’ as used in the Simulators post doesn’t correspond to a single simulation of reality, and a ‘simulacrum’ doesn’t correspond to an approximation of a single agent in reality. Instead a ‘simulation’ corresponds to a distribution over processes that generated the text. This distribution in general contains uncertainty over a wide space of different agents involved in those text generating processes.
- Systems can be very good at prediction yet very bad at plausible generation – in other words, very bad at ‘running simulations’.
The rest of the post elaborates on these claims.
I think the author of the Simulators post is aware of these objections. I broadly endorse the perspective in ‘simulator’ framing and confusions about LLMs, which also argues against the simulator framing to some extent. For another example of prior work on these two points, see this discussion of models recognizing that they are generating text due to generator discriminator gaps in the Conditioning Predictive Models sequence.
Simulators, ‘simulator’ framing and confusions about LLMs, Conditioning Predictive Models
Language models are predictors, not simulators
My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”.
Let’s go through an example: Suppose we prompt the model with “
<|endoftext|>NEW YORK—After John McCain was seen bartending at a seedy nightclub”. I’d claim the model's next token prediction will involve uncertainty over the space of all the different authors which could have written this passage, as well as all the possible newspapers, etc. It presumably can’t internally represent the probability of each specific author and newspaper, though I expect bigger models will latently have an estimate for the probability that text like this was written by particularly prolific authors with particularly distinctive styles as well as a latent estimate for particular sites. In this case, code-davinci-002 is quite confident this prompt comes from The Onion.
In practice, I think it’s tempting to think of a model as running a particular simulation of reality, but performing well at the objective of next-token prediction doesn’t result in the output you would get from a single, particular simulation. In the previous example, the model might be certain that the piece is from The Onion after it’s generated many tokens, but it’s presumably not sure which author at the Onion wrote it or what the publication date is.
Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.
I’ve seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn’t change the conclusion of the post). I think the author of the Simulators post understands the point I make in this section – for instance, they state:
This is true for GPT, where the text “initial condition” of most training samples severely underdetermines the real-world process which led to the choice of next token.
Regardless, this issue makes this terminology misleading.
Good prediction doesn’t imply good generation
You might have thought that good predictive loss constraints generations to be plausible; however, this isn’t true – models with very good predictive loss can output extremely unlikely sequences with high probability.
I’ll use a simple example to illustrate this: Suppose I take an existing language model and add an extra token that doesn’t exist in the corpus; call this token
<|i_am_generating|>. Now, suppose we adjust the language model so that this token always is predicted to have 1% probability (proportionally down-weighting other predictions), except that it’s assigned 100% probability if
<|i_am_generating|> has already occurred earlier in the sequence.
This language model gets exactly additional loss on the prediction task. Despite this relatively small change in loss, if you generate using this model, long generations will typically end up having zero probability measure on the ‘true’ distribution due to including the
<|i_am_generating|> token which never occurs in the corpus. (For example, generations of length 1000 have probability 0.99996 of containing this token.) Instead of getting the model to output
<|i_am_generating|> forever after the first such token, we could train the model to have any behavior we want after encountering the
<|i_am_generating|> token, without changing the next-token prediction loss.
The previous example required making the loss a bit worse, but there are cases where a model can virtually always generate anything we choose without affecting prediction loss at all. This is possible in cases where verification is easier than generation.
For instance, suppose that the first 30 tokens of every sequence constitute a cryptographic hash of the next 30 tokens, which are uniformly distributed. It might be very easy to check that the first 60 tokens are consistent, but extremely difficult to generate the correct distribution over the first 60 tokens for a next-token predictor. Thus, the model could check if it’s generating in a way that leaves its generative behavior totally unconstrained (and therefore potentially arbitrarily far from the original distribution).
The considerations in this section only apply if the model is imperfect; if the model has perfectly optimal next-token prediction loss, its generations must have the exactly same distribution as the data distribution. Of course, all LLMs that we’ll ever care about will be imperfect in this sense.
These considerations (at least morally) contradict the following quote from Simulators:
Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples
Okay, but what do you see in practice?
In practice, the divergence between generation and prediction over long sequences is blatant. Try just generating sequences from a language model at t=1. It’s common for the model to get caught in repetitive traps that are extremely unlikely in the original distribution. (I’m not confident this gap between prediction and generation is real and that I’m interpreting this observation correctly.)
I expect language models will often generate (say, at t=1 and with sequence length greater than maybe around a few thousand) easy-to-distinguish outputs until the singularity (though I’m not confident).
Appendix: Some other agreements and disagreements with Simulators
I agree that the Oracle/Genie/Tool/Agent categories don't properly contain models trained on self-supervised objectives. Further, I think these aren’t particularly useful categories for current alignment research – we should just talk about how exactly we trained the model and what that behaviorally incentivizes.
I agree that it’s worth thinking about how self-supervised learning generalizes to very low probability sequences – specifically, sequences with low enough probability that the model’s behavior could be anything on sequences this unlikely without affecting next-token prediction loss. However, I’m skeptical that first-principles reasoning would yield much beyond basic conclusions here. Because this behavior is unconstrained by the training objective, reasoning about this must involve reference to the architecture and optimizer. I think experiments would be interesting but probably aren’t overall very leveraged until AGI is near. I think building a rough practical model based on empirics could go pretty far.
I disagree that it’s worth thinking about the limit of self-supervised learning. An AI that is extremely close to theoretically optimal loss on next-token prediction is an eldritch horror that we won’t encounter prior to the situation radically changing. The same goes for any training objective that requires extreme superintelligence to achieve within-epsilon-of-optimal loss (which is basically all prediction tasks on infinite datasets sampled from reality). Long before we get this, we’ll either succeed or fail at avoiding AI takeover.
See the appendix. ↩︎
Insofar as it’s useful to try to reason about what exact actions the pre-training objective incentives in particular cases. I’m not sold on this being considerably useful in most cases. ↩︎
Note that I disagree with quite a bit of the framing and emphasis of Conditioning Predictive Models. Don’t take this link as an endorsement! ↩︎
I think it’s about 90% sure based on doing some quick samples. ↩︎
This is also discussed in the Conditioning Predictive Models sequence ↩︎
I basically agree with this, and a lot of these are the sorts of reasons we went with "predictor" over "simulator" in "Conditioning Predictive Models."
I was a bit unsure whether to tag your posts with Simulator Theory. Do you endorse that or not?
Yeah, I endorse that. I think we are very much trying to talk about the same thing, it's more just a terminological disagreement. Perhaps I would advocate for the tag itself being changed to "Predictor Theory" or "Predictive Models" or something instead.
I broadly agree with the points being made here, but allow me to nitpick the use of the word "predictive" here, and argue for the key advantage of the simulators framing over the prediction one:
The simulators frame does make it very clear that there's a distinction between the simulator/GPT-3 and the simulacra/characters or situations it's making predictions about! On the other hand, using "prediction" can obscure the distinction, and end up with confused questions like "is GPT just an agent that just wants to minimize predictive loss?"
I think the biggest pitfall of the "simulator" framing is that it's made people (including Beth Barnes?) think it's all about simulating our physical reality, when exactly because of the constraints you mention (text not actually pinpointing the state of the universe, etc.), the abstractions developed by a predictor are usually better understood in terms of treating the text itself as the state, and learning time-evolution rules for that state.
Thinking about the state and time evolution rules for the state seems fine, but there isn't any interesting structure with the naive formulation imo. The state is the entire text, so we don't get any interesting Markov chain structure. (you can turn any random process into a Markov chain where you include the entire history in the state! The interesting property was that the past didn't matter!)
Hm, I mostly agree. There isn't any interesting structure by default, you have to get it by trying to mimic a training distribution that has interesting structure.
And I think this relates to another way that I was too reductive, which is that if I want to talk about "simulacra" as a thing, then they don't exist purely in the text, so I must be sneaking in another ontology somewhere - an ontology that consists of features inferred from text (but still not actually the state of our real universe).
Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting.
The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.)
I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does clarify that it's not literally physical reality), e.g.:
I think insofar as people end up thinking the simulation is an exact match for physical reality, the problem was not in the simulators frame itself, but instead the fact that the word physics was used 47 times in the post, while only the first few instances make it clear that literal physics is intended only as a metaphor.