How LLMs are and are not myopic

janus

Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture.

TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text.

Summary

Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness).
Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents.
LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game.
LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws.
LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier.
When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It’s not clear how this affects the training goal theoretically or in practice.
A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed

Introduction

The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work.

Types of Myopia

1. Cognitive Myopia

One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabilities handicap, as many tasks require some degree of forward planning or anticipation of future events.

LLM pretraining is not cognitively-myopic. Even though LLMs like GPT are optimized for next-token prediction and use causal masking which hides the future from current predictions, there is still a direct incentive to think about the future because it can be useful for immediate prediction accuracy. In a game of chess, efficiently computing the best move likely involves reasoning about how your opponent will respond multiple moves into the future. In other words, we should expect GPT to think about the future because it’s instrumental for predicting the present.

2. Value/Prediction Myopia

Value myopia refers to agents that place no value on future states or rewards. In the case of a predictive model, we mean the model cares only about the accuracy of its next prediction. Intuitively, it seems like this would rule out deceptive alignment and treacherous turns, because a value-myopic agent should be unwilling to sacrifice immediate value for long-term value. Unfortunately, value-myopic agents might still have non-myopic incentives for reasons like anthropic uncertainty or acausal trade. (see Open Problems with Myopia for more details)

In any case, LLM training actually incentivizes value non-myopia. Even though next-token prediction accuracy is a nominally myopic objective, the transformer architecture actually causes the model to be optimized for accuracy over the entire sequence, not just the next token.

In a causal-masked transformer, attention layers can query the previous layers' activations from any column in the context window. Gradients flow through the attention connections, so each previous layer is optimized not just to improve prediction accuracy for the next token, but also to produce values that are useful for future columns to attend to when predicting their token.

The activations of the red block are used by each orange attention block and optimized to be useful for all of them. For simplicity, only attention layers are shown. Thanks to Nicholas Kees Dupuis for the diagram.

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. In particular, we should expect trained models to find an efficient tradeoff between current prediction accuracy and full-sequence prediction accuracy. When predicting relatively easy tokens, they will likely spend most of their computation budget preparing for the future. If the model is given a hard problem that it will need to solve later in the sequence, it will likely work on the problem ('in the back of its head') throughout the intermediate predictions.

Experiments

Interpretability

Several interpretability results including ROME confirm this type of non-myopic computation in LLMs. ROME shows that LLMs recall factual information about nouns when the noun first appears, even though this information is only used later when predicting the answer to a question about the noun. This information would be irrelevant and thus wasted computation for the purpose of predicting only the next token. For example, if the model sees the text "The Eiffel Tower", it immediately begins retrieving information about the Eiffel Tower like where it is located even though that's not necessary to predict the next token which is almost certainly "is".

Enforcing Myopia

It is possible to modify the transformer architecture to enforce value (prediction accuracy) myopia by placing stop gradients in the attention layers. This effectively prevents past activations from being directly optimized to be more useful for future computation. We ran several informal experiments on models like these while at Conjecture. Unfortunately, we do not have quantitative results to share here. The experiments were preliminary and we moved on to other aspects of the project, so don’t take this as strong evidence.

Specifically, we trained a set of four traditional and four myopic transformers ranging from 117M to 1.5B parameters (equivalent to GPT-2 Small to GPT-XL). Each model was trained on the same data but training hyperparameters were tuned to each architecture individually using maximal update parameterization.

We found the performance reduction from myopia was minimal at 117M parameters, but the performance cost increased with scale, i.e. myopic transformers have worse scaling laws.

3. Consequence-blindness

A third type of myopia to consider is consequence-blindness, where a model chooses actions completely independent of any effect of its actions on the future. This is similar to the goal of Counterfactual Oracles.

A causal diagram of the supervised learning process from Modeling AGI Safety Frameworks with Causal Influence Diagrams — *A causal diagram of the supervised learning process from* *Modeling AGI Safety Frameworks with Causal Influence Diagrams*

Consequence-blindness should rule out most types of instrumental convergence and concerns about self-fulfilling prophecies. A model which completely ignores the effects of its actions has no instrumental incentive to pursue traditional instrumental goals, like trying to accumulate resources to become more powerful, trying to prevent its own shutdown, or pretending to be aligned in order to defect later. However, consequence-blindness does not actually constrain the behavior of a model, because the model can pursue any instrumental goal as a terminal value.

A consequence-blind simulator that predicts power-seeking agents (like humans) will still predict actions which seek power, but these actions will seek power for the simulated agent, not the simulator itself. I usually think about problems like this as simulator vs simulacra alignment. If you successfully build an inner aligned simulator, you can use it to faithfully simulate according to the rules it learns and generalizes from its training distribution. However you are still left with the problem of extracting consistently aligned simulacra.

In theory, consequence-blindness doesn't rule out any capabilities, because a consequence-blind predictor could learn to predict any behavior. However, in practice using a consequence-blind training goal like pure imitation learning may be uncompetitive compared to methods like RL (or imitation + RL finetuning, the current dominant paradigm).

Consequence-blind agents (with a causal decision theory) can be seen as implementing a Lonely Causal Decision Theory (LCDT). An LCDT agent assumes that every other decision node of agents in the world (including its future decisions) are causally independent of its actions. This means it has no incentive to take actions which help its future itself or other agents for instrumental reasons.

Unlike the other forms of myopia above, the training goal for LLMs trained with self-supervised learning (SSL) is theoretically consequence-blind. In supervised or self-supervised learning, the training data already exists and is assumed to be causally independent from the model’s decisions. This means a model’s prediction should be based only on the likelihood of the output appearing in the training data. In particular, the model’s prediction should be independent of any effect from making the prediction itself, including whether or not the prediction would make the model more likely to predict or control the future correctly when run autoregressively.

The distinction between optimizing prediction accuracy and steering the distribution to be easier to predict is one of the most common sources of confusion about LLM myopia. Even though the LLM training goal is not value-myopic and optimizes for prediction accuracy across entire training examples, LLMs are not incentivized to predict tokens that make the future easier to predict.

Unfortunately, the analysis of consequence-blindness story becomes more complicated when the training data is not actually causally independent from the model’s prediction.

Self-reference and Model Generated Data

In practice, the assumption that the supervised learning distribution is causally independent from the model’s predictions may be violated. This can happen whenever a model is trained on data correlated with its behavior, or if the model thinks the correct prediction is correlated with its behavior.

Another causal diagram from Modeling AGI Safety Frameworks with Causal Influence Diagrams — *Another causal diagram from* *Modeling AGI Safety Frameworks with Causal Influence Diagrams*

This is concerning because it's not at all clear what a model that is predicting itself should output. It breaks many of the intuitions of why it should be safe to use LLMs as simulators of text distributions.

This problem is already apparent in current LLMs and is likely to become worse for several reasons.

Increasingly large portions of LLM pretraining and finetuning datasets will be generated or influenced by LLMs themselves
LLMs are often trained or used in ways designed to make them more situationally aware that they are an AI assistant responding to a user’s prompt
In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.

We could try to combat this problem by excluding model generated text from all future training data, but it is still possible for a model trained on human-written text to drift off distribution and become situationally aware. You could see this as a training goal (outer alignment) problem: the decision-myopic base objective has to generalize somehow to self-referential predictions which are fundamentally not consequence-blind.

We could also try the opposite of actually increasing the amount of model generated text in the training dataset. This could allow us to shape the LLMs prior on how LLMs (and thus itself) should behave. However it is still possible for a model to realize (or believe) that it is different from the types of models in the training data and should thus act differently.

In this vein, there have been several great posts that discuss how the process of repeatedly training a model on its own predictions might converge. There has also been work considering the general implications of training a simulator in a closed loop. However the broader implications for generalization and alignment are unclear.

Myopic Training Goals vs Myopic Models

It is also important to note that even if a training goal is designed to be myopic in some way, the resulting model may not be myopic. Inner alignment failures can lead to non-myopic models emerging from myopic training goals. Finding a solution to inner alignment, or getting inner alignment by default does seem relatively likely for predictive SSL over other training goals, but it is not guaranteed. Many researchers believe the cognitive structures that are required to predict the answers to hard consequentialist problems will fundamentally be non-myopic, especially if these structures become situationally aware. Some examples.

A meme about non-myopia as an inner alignment failure — *Credit to* *Leo Gao* *for the meme*

It would be a huge success if we could find some way to enforce or verify that a model's internal computation satisfies some myopic criteria (or any criteria…) during or after training. However, it's not clear how we would go about this.

Meta

The ideas in the post are from a human, but most of the text was written by Chat GPT-4 with prompts and human curation using Loom. I endorse the post as technically correct and accurately phrased according to my understanding. Here is the second of two Loom trees used to generate most of the post before final edits.

A Loom tree used to generate a draft for this post with GPT-4

Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).

Say you have a bunch of triplets . Your want to train a model that predicts $z_{1}$ from $x$ and $z_{2}$ from $x, z_{1}$ .

Your model consists of three components: $f, g_{1}, g_{2}$ . It makes predictions as follows:
$y = f (x)$
$z_{1} = g_{1} (y)$
$z_{2} = g_{2} (y, z_{1})$

(Why have such a model? Why not have two completely separate models, one for predicting $z_{1}$ and one for predicting $z_{2}$ ? Because it might be more efficient to use a single $f$ both for predicting $z_{1}$ and for predicting $z_{2}$ , given that both predictions presumably require "interpreting" $x$ .)

So, intuitively, it first builds an "inner representation" (embedding) of $x$ . Then it sequentially makes predictions based on that inner representation.

Now you train $f$ and $g_{1}$ to minimize the prediction loss on the $(x, z_{1})$ parts of the triplets. Simultaneously you train $f, g_{2}$ to minimize prediction loss on the full $(x, z_{1}, z_{2})$ triplets. For example, you update $f$ and $g_{1}$ with the gradients
$\nabla_{θ_{0}, θ_{1}} l (z_{1}, g_{1}^{θ_{1}} (f^{θ_{0}} (x))$

and you update $f$ and $g_{2}$ with the gradients

$\nabla_{θ_{0}, θ_{2}} l (z_{2}, g_{2}^{θ_{2}} (z_{1}, (f^{θ_{0}} (x)))$ .
(The $z_{1}$ here is the "true" $z_{1}$ , not one generated by the model itself.)

This training pressures $g_{1}$ to be myopic in the second and third sense described in the post. In fact, even if we were to train $θ_{0}, θ_{2}$ with the $z_{1}$ predicted by $g_{1}$ rather than the true $z_{1}$ , $g_{1}$ is pressured to be myopic.

Type 3 myopia: Training doesn't pressure $g_{1}$ to output something that makes the $z_{2}$ follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data $z_{1} = 0$ implies $z_{2} = 0$ , while under $z_{1} = 1$ , $z_{2}$ follows some distribution that depends in complicated ways on $x$ . Then $g_{1}$ will not try to predict $z_{1} = 0$ more often.
Type 2 myopia: $g_{1}$ won't try to provide useful information to $g_{2}$ in its output, even if it could. For example, imagine that the $z_{1}$ s are strings representing real numbers. Imagine that $x$ is always a natural number, that $z_{1}$ is the $x$ -th Fibonacci number and $z_{2}$ is the $x + 1$ -th Fibonacci number. Imagine further that the model representing $g_{1}$ is large enough to compute the $x$ -th Fibonacci number, while the model representing $g_{2}$ is not. Then one way in which one might think one could achieve low predictive loss would be for $g_{1}$ to output the $x$ -th Fibonacci number and then encode, for example, the $x - 1$ -th Fibonacci number in the decimal digits. (E.g., $g_{1} (10) = 55.0000000000034$ .) And then $g_{2}$ computes the $x + 1$ -th Fibonacci number from the $x$ -th decimal. But the above training will not give rise to this strategy, because $g_{2}$ gets the true $z_{1}$ as input, not the one produced by $g_{1}$ . Further, even if we were to change this, there would still be pressure against this strategy because $g_{1}$ ( $θ_{1}$ ) is not optimized to give useful information to $g_{2}$ . (The gradient used to update $θ_{1}$ doesn't consider the loss on predicting $z_{2}$ .) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on $z_{1}$ .

Of course, $g_{1}$ still won't be pressured to be type-1-myopic. If predicting $z_{1}$ requires predicting $z_{2}$ , then $g_{1}$ will be trained to predict ("plan") $z_{2}$ .

(Obviously, $g_2$ is pressured to be myopic in this simple model.)

Now what about $f$ ? Well, $f$ is optimized both to enable predicting $z_{1}$ from $f (x)$ and predicting $z_{2}$ from $f (x), z_{1}$ . Therefore, if resources are relevantly constrained in some way (e.g., the model computing $f$ is small, or the output of $f$ is forced to be small), $f$ will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for $f$ (and thus in some sense the overall model) can and will sacrifice accuracy on $z_{1}$ to achieve better accuracy on $z_{2}$ . In particular, we should expect trained models to find an efficient tradeoff between accuracy on $z_{1}$ and accuracy on $z_{2}$ . When $z_{1}$ is relatively easy to predict, $f$ will spend most of its computation budget on predicting $z_{2}$ .

So, $f$ is not "Type 2" myopic. Or perhaps put differently: The calculations going into predicting $z_{1}$ aren't optimized purely for predicting $z_{2}$ .

However, $f$ is still "Type 3" myopic. Because the prediction made by $g_{1}$ isn't fed (in training) as an input to $g_{2}$ or the loss, there's no pressure towards making $f$ influence the output of $g_{1}$ in a way that has anything to do with $z_{2}$ . (In contrast to the myopia of $g_{1}$ , this really does hinge on not using $g_{2} (f (x), g_{1} (f (x)))$ in training. If $g_{2} (f (x), g_{1} (f (x)))$ mattered in training, then there would be pressure for $f$ to trick $g_{1}$ into performing calculations that are useful for predicting $z_{2}$ . Unless you use stop-gradients...)

* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.

I'm quite skeptical of this claim on face value, and would love to see examples.

I'd be very surprised if current models, absent the default prompts telling them they are an LLM, would spontaneously output text predicting they are an LLM unless steered in that direction.

I can vouch that I have had the same experience (but am not allowed to share outputs of the larger model I have in mind). First encountered via curation without intentional steering in that direction, but I would be surprised if this failed to replicate with an experimental setup that selects completions randomly without human input. Let me know if you have such a setup in mind that you feel is sufficiently rigorous to act as a crux.

If you can come up with an experimental setup that does that it would be sufficient for me.

Many users of base models have noticed this phenomenon, and my SERI MATS stream is currently working on empirically measuring it / compiling anecdotal evidence / writing up speculation concerning the mechanism.