You could describe the behavior of untuned GPT-like model[1] using a (peculiar) utility function. The fact that the loss function and training didn't explicitly involve a reward function doesn't mean a utility function can't represent what's learned, after all.
Coming from the opposite direction, you could also train a predictor using RL: choose a reward function and an update procedure which is equivalent to approximating the supervised loss function's gradient with numerical sampling. It'll tend to be much less efficient to train (and training might collapse sometimes), but it should be able to produce an equivalent result in the limit.
And yet... trying to interpret simulators as agents with utility functions seems misleading. Why?
Instrumentality is why some agents seem more "agenty"
An RL-trained agent that learns to play a game that requires thousands of intermediate steps before acquiring reward must learn a policy that is almost entirely composed of intermediate steps in order to achieve high reward.
A predictor that learns to predict an output distribution which is graded immediately takes zero external intermediate steps. There is no apparent incentive to develop external instrumental capabilities that span more than a single prediction.
I'm going to call the axis spanning these two examples instrumentality: the degree to which a model takes instrumental actions to achieve goals.
I think this is at the heart of why the utility-maximizing agent lens seems so wrong for simulators: agents seem agenty when they take intermediate steps, which they do not value intrinsically, to accomplish some goal. Things like simulators don't- the action is the goal.
The "nice utility function you got there; would be a shame if someone... maximized it" framing runs into the reality that simulator-like models aren't learning to take actions to secure a higher probability of correct predictions or other instrumental behaviors. Interpreting their apparent utility function as implying the model "values" correct predictions doesn't fit because the model never visibly performs instrumental actions. That simply isn't what the model learned to implement, and so far, that fact seems natural: for all existing pure predictor implementations that I'm aware of, the loss basin that the model fell into corresponded to minimal instrumentality rather than more conventional forms of agentic "values" matching the reward function. This is evidence that these behaviors fall into different loss basins and that there are sufficient obstacles between them that SGD cannot easily cross.[2]
Instrumentality is a spectrum and a constraint
Training objectives that map onto utility functions with more degrees of freedom for instrumental action- like RL applied to a long task in an extremely open-ended environment- will tend to be subject to more surprising behavior after optimization.
Instrumentality is not binary; it is space in which the optimizer can run free. If you give it enough space, argmax will find something strange and strangely capable.
The smaller that space, the fewer options the optimizer has for breaking things.[3]
So, for a given capability, it matters how the capability is trained. You could approximate the prediction loss gradient with numerical sampling in RL and get an inefficiently trained version of the same model, as in the introduction, but what happens if you add in more space for instrumental actions? A couple of examples:
Instead of only grading the prediction of the next token, predict a long series of tokens, measure the quality of the generation against some critic[4], and include the later prediction rewards in the earlier token predictions. The choice of early tokens becomes partially instrumental, and the model is incentivized to pick early tokens which maximize the probability that the critic will like its later tokens.
Assuming the model is predicting against a fixed ground truth, what if predicting the next token is a final step in a lengthy episode where the model has to navigate a physical environment, collect clues, trick gatekeepers, and fight monsters before finally outputting a distribution, and failing to output a prediction is punished?
This model clearly takes instrumental actions which are more consistent with "valuing" predictions. Would it take instrumental actions to make good predictions outside of an episode? Maybe! Hard to say! Sure seems far more likely to fall into that basin than the direct output-then-evaluation version.
Even with a "good" reward function, the learned path to the reward is still not an ignorable detail in an agent. It can be the most important part!
Instrumentality and the naturalness of agentic mesaoptimizers
In the pathological case, some degree of instrumentality is still possible for predictors. For example, predicting tokens corresponding to a self-fulfilling prophecy when that prediction happens to be one of a set of locally loss-minimizing options is not fought by the loss function, and a deceptive model that performs very well on the training distribution could produce wildly different results out of distribution.
I think one of the more important areas of research in alignmentland right now is trying to understand how natural those outcomes are.
I think simulators will exhibit some form of mesaoptimization at sufficient capability. It seems like a convergent, and practically necessary, outcome.
You could describe some of these mesaoptimizers as "agentic," at least in the same sense that you can describe the model as a whole as agentic by coming up with a contrived utility function.
I doubt that a minimal-instrumentality training objective will naturally lead to highly externally-instrumental agentic mesaoptimizers, because I don't understand how that type of capability reliably and accessibly serves training loss in use cases similar to today's untuned LLMs.
In this context, how would the process of developing instrumentality that reaches outside the scope of a forward pass pay rent? What gradients exist that SGD can follow that would find that kind of agent? That would seem to require that there is no simpler or better lower instrumentality implementation that SGD can access first.
I am not suggesting that misaligned instrumental agentic mesaoptimizers are impossible across the board. I want to see how far away they are and what sorts of things lead to them with higher probability. Understanding this better seems critical.
Internal instrumentality
Requiring tons of intermediate steps before acquiring reward incentivizes instrumentality. The boundaries of a single forward pass aren't intrinsically special; each pass is composed of a bunch of internal steps which must be at least partly instrumental. Why isn't that concerning?
Well, as far as I can tell, it is! There does not appear to be a deep difference in kind. It seems to come down to what the model is actually doing, what kind of steps are being incentivized, and the span across which those steps operate.
A predictor mapping input tokens to an output distribution for the next token, at scales like our current SOTA LLMs, is pretty constrained- the space between the inputs and the output is small. It's not obviously considering a long episode of traditionally agentic behaviors within each forward pass, because it doesn't have reason to. The internal steps seem like they map onto something like raw machinery.
If I had to guess, the threshold past which you see models trained with a prediction objective start to develop potentially concerning internal instrumentality is very high. It's probably well beyond the sweet spot for most of our intuitive use cases, so we haven't seen any sign of it yet.
What happens if you tried to go looking for it? What if, rather than a short jump mapping input tokens to the next token, you set up a contrived pipeline that directly incentivized an internal simulation of long episodes?
My guess is that current models still don't have the scale to manage it, but in the future, it doesn't seem out of the realm of possibility that some sort of iterated distillation process that tries to incrementally predict more and more at once could yield a strong predictor that learns a worrying level of internal instrumentality. Could that be a path to the nasty type of agentic mesaoptimizer in simulators? Can you avoid this by being clever in how you choose the level of simulation[5]? I'm guessing yes, but I would like to do better than guessing!
There is some overlap![6] In this framing, a model which is fully myopic in terms of outputs/actions (regardless of whether it considers the future in some way) is minimally instrumental. The degree of instrumentality exhibited by a model is potentially independent of how myopic a model's perception or cognition is.
Myopia is sometimes interpreted (reasonably, given its usual visual meaning) as implying a model is, well, short-sighted. I have seen comments that GPT-like models aren't myopic because (for example) they frequently have to model the aspects of the future beyond the next token to successfully predict that next token, or because training may encourage earlier tokens to share their computational space with the predictions associated with later tokens. Both of these things are true, but it's valuable to distinguish that from myopic (minimally instrumental) outputs. It seems like having fully separate words for these things would be helpful.
Why is this framing useful?
The less instrumentality a model has, the less likely it is to fight you. Making your internal reasoning opaque to interpretation is instrumental to deception; a noninstrumental model doesn't bother. Instrumentality, more than the other details of the utility function, is the directly concerning bit here.
This doesn't imply that noninstrumental models are a complete solution to alignment, nor that they can't have major negative consequences, just that the path to those negative consequences is not being adversarially forced as an instrumental act of the model. The model's still doing it, but not in service of a goal held by the model.
That's cold comfort for the person who oopsies in their use of such a model, or who lives in a world where someone else did, but I think there remains serious value there. An adversarial high-instrumentality model actively guides the world towards its malign goal, the noninstrumental model doesn't.
Without the ability to properly design larger bounds for an optimizer to work within, low instrumentality appears to be a promising- or necessary- part of corrigibility.
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
It's important that the reward is not just a ground truth for encouraging instrumental action: if there was a fixed ground truth being predicted, then the predictor has fewer options for making later predictions easier, because the expected later tokens do not change. (Notably with a fixed ground truth, there is still space for worsening an early token if doing so is predicted to help later tokens enough to compensate for the loss; it's just less space.)
For example, probably don't try to predict the output of an entire civilization in one go. Maybe try predicting reasoning steps or something else auditable and similarly scaled.
You could describe the behavior of untuned GPT-like model[1] using a (peculiar) utility function. The fact that the loss function and training didn't explicitly involve a reward function doesn't mean a utility function can't represent what's learned, after all.
Coming from the opposite direction, you could also train a predictor using RL: choose a reward function and an update procedure which is equivalent to approximating the supervised loss function's gradient with numerical sampling. It'll tend to be much less efficient to train (and training might collapse sometimes), but it should be able to produce an equivalent result in the limit.
And yet... trying to interpret simulators as agents with utility functions seems misleading. Why?
Instrumentality is why some agents seem more "agenty"
An RL-trained agent that learns to play a game that requires thousands of intermediate steps before acquiring reward must learn a policy that is almost entirely composed of intermediate steps in order to achieve high reward.
A predictor that learns to predict an output distribution which is graded immediately takes zero external intermediate steps. There is no apparent incentive to develop external instrumental capabilities that span more than a single prediction.
I'm going to call the axis spanning these two examples instrumentality: the degree to which a model takes instrumental actions to achieve goals.
I think this is at the heart of why the utility-maximizing agent lens seems so wrong for simulators: agents seem agenty when they take intermediate steps, which they do not value intrinsically, to accomplish some goal. Things like simulators don't- the action is the goal.
The "nice utility function you got there; would be a shame if someone... maximized it" framing runs into the reality that simulator-like models aren't learning to take actions to secure a higher probability of correct predictions or other instrumental behaviors. Interpreting their apparent utility function as implying the model "values" correct predictions doesn't fit because the model never visibly performs instrumental actions. That simply isn't what the model learned to implement, and so far, that fact seems natural: for all existing pure predictor implementations that I'm aware of, the loss basin that the model fell into corresponded to minimal instrumentality rather than more conventional forms of agentic "values" matching the reward function. This is evidence that these behaviors fall into different loss basins and that there are sufficient obstacles between them that SGD cannot easily cross.[2]
Instrumentality is a spectrum and a constraint
Training objectives that map onto utility functions with more degrees of freedom for instrumental action- like RL applied to a long task in an extremely open-ended environment- will tend to be subject to more surprising behavior after optimization.
Instrumentality is not binary; it is space in which the optimizer can run free. If you give it enough space, argmax will find something strange and strangely capable.
The smaller that space, the fewer options the optimizer has for breaking things.[3]
So, for a given capability, it matters how the capability is trained. You could approximate the prediction loss gradient with numerical sampling in RL and get an inefficiently trained version of the same model, as in the introduction, but what happens if you add in more space for instrumental actions? A couple of examples:
This model clearly takes instrumental actions which are more consistent with "valuing" predictions. Would it take instrumental actions to make good predictions outside of an episode? Maybe! Hard to say! Sure seems far more likely to fall into that basin than the direct output-then-evaluation version.
Even with a "good" reward function, the learned path to the reward is still not an ignorable detail in an agent. It can be the most important part!
Instrumentality and the naturalness of agentic mesaoptimizers
In the pathological case, some degree of instrumentality is still possible for predictors. For example, predicting tokens corresponding to a self-fulfilling prophecy when that prediction happens to be one of a set of locally loss-minimizing options is not fought by the loss function, and a deceptive model that performs very well on the training distribution could produce wildly different results out of distribution.
I think one of the more important areas of research in alignmentland right now is trying to understand how natural those outcomes are.
In this context, how would the process of developing instrumentality that reaches outside the scope of a forward pass pay rent? What gradients exist that SGD can follow that would find that kind of agent? That would seem to require that there is no simpler or better lower instrumentality implementation that SGD can access first.
I am not suggesting that misaligned instrumental agentic mesaoptimizers are impossible across the board. I want to see how far away they are and what sorts of things lead to them with higher probability. Understanding this better seems critical.
Internal instrumentality
Requiring tons of intermediate steps before acquiring reward incentivizes instrumentality. The boundaries of a single forward pass aren't intrinsically special; each pass is composed of a bunch of internal steps which must be at least partly instrumental. Why isn't that concerning?
Well, as far as I can tell, it is! There does not appear to be a deep difference in kind. It seems to come down to what the model is actually doing, what kind of steps are being incentivized, and the span across which those steps operate.
A predictor mapping input tokens to an output distribution for the next token, at scales like our current SOTA LLMs, is pretty constrained- the space between the inputs and the output is small. It's not obviously considering a long episode of traditionally agentic behaviors within each forward pass, because it doesn't have reason to. The internal steps seem like they map onto something like raw machinery.
If I had to guess, the threshold past which you see models trained with a prediction objective start to develop potentially concerning internal instrumentality is very high. It's probably well beyond the sweet spot for most of our intuitive use cases, so we haven't seen any sign of it yet.
What happens if you tried to go looking for it? What if, rather than a short jump mapping input tokens to the next token, you set up a contrived pipeline that directly incentivized an internal simulation of long episodes?
My guess is that current models still don't have the scale to manage it, but in the future, it doesn't seem out of the realm of possibility that some sort of iterated distillation process that tries to incrementally predict more and more at once could yield a strong predictor that learns a worrying level of internal instrumentality. Could that be a path to the nasty type of agentic mesaoptimizer in simulators? Can you avoid this by being clever in how you choose the level of simulation[5]? I'm guessing yes, but I would like to do better than guessing!
How is this different from myopia?
There is some overlap![6] In this framing, a model which is fully myopic in terms of outputs/actions (regardless of whether it considers the future in some way) is minimally instrumental. The degree of instrumentality exhibited by a model is potentially independent of how myopic a model's perception or cognition is.
Myopia is sometimes interpreted (reasonably, given its usual visual meaning) as implying a model is, well, short-sighted. I have seen comments that GPT-like models aren't myopic because (for example) they frequently have to model the aspects of the future beyond the next token to successfully predict that next token, or because training may encourage earlier tokens to share their computational space with the predictions associated with later tokens. Both of these things are true, but it's valuable to distinguish that from myopic (minimally instrumental) outputs. It seems like having fully separate words for these things would be helpful.
Why is this framing useful?
The less instrumentality a model has, the less likely it is to fight you. Making your internal reasoning opaque to interpretation is instrumental to deception; a noninstrumental model doesn't bother. Instrumentality, more than the other details of the utility function, is the directly concerning bit here.
This doesn't imply that noninstrumental models are a complete solution to alignment, nor that they can't have major negative consequences, just that the path to those negative consequences is not being adversarially forced as an instrumental act of the model. The model's still doing it, but not in service of a goal held by the model.
That's cold comfort for the person who oopsies in their use of such a model, or who lives in a world where someone else did, but I think there remains serious value there. An adversarial high-instrumentality model actively guides the world towards its malign goal, the noninstrumental model doesn't.
Without the ability to properly design larger bounds for an optimizer to work within, low instrumentality appears to be a promising- or necessary- part of corrigibility.
For the rest of the post, I'll refer to this approximately by using the terms simulator or predictor.
At least not yet.
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
It's important that the reward is not just a ground truth for encouraging instrumental action: if there was a fixed ground truth being predicted, then the predictor has fewer options for making later predictions easier, because the expected later tokens do not change. (Notably with a fixed ground truth, there is still space for worsening an early token if doing so is predicted to help later tokens enough to compensate for the loss; it's just less space.)
For example, probably don't try to predict the output of an entire civilization in one go. Maybe try predicting reasoning steps or something else auditable and similarly scaled.
Especially because myopia refers to a lot of different things in different contexts!