Written under the supervision of Lionel Levine. Thanks to Owain Evans, Aidan O’Gara, Max Kaufmann, and Johannes Treutlein for comments.
This post is a synthesis of arguments made by other people. It provides a collection of answers to the question, "Why would an AI become non-myopic?" In this post I’ll describe a model as myopic if it cares only about what happens in the current training episode. This form of myopia is called episodic myopia. Typically, we expect models to be myopic because the training process does not reward the AI for outcomes outside of its training episode. Non-myopia is interesting because it indicates a flaw in training – somehow our AI has started to care about something we did not design it to care about.
One reason to care about non-myopia is that it can cause a system to manipulate its own training process. If an ML system wants to affect what happens after its gradient update, it can do so through the gradient update itself. For instance, an AI might become deceptively aligned, behaving as aligned as possible in order to minimize how much it is changed by stochastic gradient descent (SGD). Or an AI could engage in exploration hacking, avoiding certain behaviors that it does not want to engage in because they will be rewarded and subsequently reinforced. Additionally, non-myopic AI systems could collude in adversarial setups like AI safety via debate. If debates between AI systems are iterated, they are analogous to a prisoners dilemma. If systems are non-myopic they could cooperate.
This post will outline six different routes to non-myopia:
This post uses a stamp collecting AI as a running example. This hypothetical AI is trained in some deep reinforcement learning (RL) setup. The AI’s reward depends on how many stamps it collects on a given day. The stamp collector is trained myopically. It is rewarded at the end of each day for the stamps collected on that day.
A model could develop long-term goals by directly simulating a person. You could imagine asking a powerful LLM how Elon Musk would run a business. Provided that the LLM can continue this simulation indefinitely, it would simulate Elon Musk with all of his non-myopia. Jailbreaks show that LLMs will violate their training finetuning objectives in order to more faithfully simulate text. Future models might retain a tendency to simulate characters. An LLM finetuned on some myopic task, might lapse into simulating a non-myopic character, such as an unaligned AI that wants to escape its training process. Gwern depicts such a scenario in his clippy story.
The training process might also favor long term goals because they are simpler. An AI trained using reinforcement learning from human feedback (RLHF) to obey user instructions could learn the goals “If the user asks you to write a poem, write a poem” and “If the user asks you a question, answer the question” separately, but it could also learn the goal “be obedient.” The latter is simpler and would generalize to new tasks better. Similarly, long-term goals are simpler because they do not require a duration to be specified. For example, “maximize stamps indefinitely” is simpler than “maximize stamps up until 11:59 pm today.”
Meta-learning works through an inner loop and an outer loop. The inner loop trains a population of models on some task, typically using gradient descent. The outer loop then updates some global parameters based on how each model in the population is performing. Whereas gradient descent can only reinforce model properties that cause greater performance within the same episode, the outer loop can select for properties that help across episodes. In other words, the inner loop selects for myopia but the outer loop does not.
Imagine our stamp collector is trained using meta-learning. 100 stamp collectors are trained in parallel and the inner loop, which uses gradient descent, updates their weights every 10 days. Every 50 days, the outer loop takes the 50 best-performing stamp collectors and copies their weights over to the 50 worst-performing stamp collectors. In doing so, the outer loop selects non-myopic models that maximize stamps over all days.
Krueger et al., empirically demonstrate this effect in Hidden Incentives for Auto-Induced Distributional Shift. They show that adding meta-learning loops sometimes induces non-myopia in both supervised learning and RL setups. [ETA: Incidentally, paper shows that same thing occuring for non-meta-learning Q-learners.]
More broadly, we can view the optimization that engineers do to training setups (such as hyperparameter tuning) as a form of meta-learning. Just like formal meta-learning setups these actions can select for non-myopia. Compared to meta-learning, these optimizations may be fairly weak and select among fewer agents. It is unclear whether training process optimization has enough bits of optimization to causes non-myopia. It would be interesting to study this empirically.
A model could have strictly short-term goals but act as if it is pursuing long-term goals because it is trading with another entity. For instance, someone could offer our stamp collector stamps today if it can credibly ensure that the person will be paid tomorrow. In order to honor this trade, the AI would then take action to ensure that money will appear on the account tomorrow. Additionally, the AI could cooperate with copies of itself (or other agents), through acausal trade. That is, it might cooperate with other agents because it thinks its decisions are correlated with theirs.
Acausal trade could occur if today's stamp collector thinks to itself, “I just care about stamps today and yesterday’s stamp collector only cared about stamps yesterday. Yesterday’s stamp collector is similar to me, so it will make similar decisions. Therefore, if I do things to increase the number of stamps tomorrow, this indicates that yesterday’s stamp collector will have done things to increase the number of stamps today. Therefore, I should produce more stamps tomorrow, because that means that yesterday's stamp collector produced more stamps today.” Essentially, the stamp collectors would be engaging in a scheme where every participant wins except the first stamp collector – it does not benefit because previous iterations did not participate. Although the first stamp collector does not benefit, it might still participate because it assigns some probability to the existence of past versions that participated.
While acausal trade is rational under some decision theories, it would still be selected against by a myopic training process. That is, gradient descent has an incentive to prevent a model from engaging in acausal trade that leads to lower reward in the short-term. There are some exceptions: If the model is outside of training then it does not get modified, and if an acausal trade makes the model receive higher reward (e.g. because it causes it to be deceptively aligned) then it is incentivized by SGD.
Mark Xu discusses similar possibilities involving acausal trade in Open Problems with Myopia.
Richard Ngo raises this point in Arguments against myopic training.
Many objectives might take into account questions like:
All these signals involve estimating the consequences of the model’s actions beyond the training episode. To score highly on these signals, the model might develop its own estimate of the consequences of its actions and execute plans that score highly based on this estimate. As a result, the AI is choosing actions which it expects to have certain consequences beyond the training episode. In other words it has non-myopic goals.
Consider an AI that is trained to make changes to some codebase. In each episode it adds some new feature. The reward signal includes a score supplied by a human who judges how useful the feature is. In order to estimate a feature’s usefulness, the programmer guesses how many additional users it would bring to the company. In the course of training, the agent develops the ability to simulate what would happen if a given feature were deployed, and then count how many users the company would have in that scenario. If the AI's estimate matches the estimates that the human judge gives, this would allow the agent to find features that score highly, so this behavior is reinforced during training. Thus, the RL algorithm has caused the AI to optimize for long-term user growth, even though the training process is myopic.
Evan Hubinger makes this argument in How likely is deceptive alignment?
Consider an AI that has all the properties of a deceptively aligned model, except that it is myopic. This AI is myopically pursuing some misaligned proxy objective, some objective that is correlated but not identical to the training objective. It knows that its objective differs from the training objective; it also knows that, if it pursues its proxy objective in training, it will be modified by gradient descent. If this system went from being myopic to being non-myopic, it would care about what happens after the training episode. As a result, it might pursue the training objective directly in order to avoid being modified by stochastic gradient descent (SGD). This would allow it to preserve its goals and execute a treacherous turn later on. By pursuing the training objective rather than its misaligned proxy objective, the AI would get higher reward. Thus, behaving non-myopically leads the model to achieve higher rewards, and so, by optimizing for higher rewards, SGD could make the AI non-myopic.
Let’s say our stamp collector has the proxy objective of collecting only blue stamps. Focusing only on blue stamps leads to fewer stamps in total, as the stamp collector passes up good opportunities for other stamps in favor of collecting blue stamps. If the AI were made to care about collecting stamps indefinitely, it might become deceptively aligned, collecting stamps of all colors during training in order to later escape and turn the world into blue stamps, or whatever. Thus, SGD might modify the AI into caring about blue stamps indefinitely as that leads to higher training reward. In other words, SGD might modify a misaligned AI to become non-myopic because that would make it deceptively aligned, which would get more reward in the short term.
Instead of making the AI non-myopic in order to make it deceptively aligned, why does SGD not simply modify it into caring about the training objective? SGD might favor making the AI non-myopic because doing so is easier than correcting the objective. SGD has two ways to make the AI care about the training objective. First, it can hard code the training objective into the AI. However doing so might be difficult if the training objective is complex. Second, if the AI knows what the training objective is, SGD could set the AI's goal to "pursue whatever you think the training objective is." The second option is attractive because it does not require specifying a potentially complex objective within the AI's weights. "Do whatever your operator wants" is easier to specify than "Help your operator perform well at their job; make sure they stay healthy; remind them to water the plants; etc." The second option might be quite complex as well: It requires making the objective point to the part of the AI's world model that contains the training objective. Doing so could require extensive modification of the training objective. On the other hand, making the AI non-myopic could be an easy fix.
Several works attempt to pinpoint the concept of myopia in AI systems. Defining Myopia provides several possible definitions; LCDT, A Myopic Decision Theory specifies what a myopic decision theory could look like. The stories in this post are inspired by previous work:
Other work on myopia includes How complex are myopic imitators? and How LLMs are and are not myopic. For discussions of self-fulfilling prophecies, see The Parable of Predict-O-Matic, Underspecification of Oracle AI, Conditioning Predictive Models: Outer alignment via careful conditioning, Proper scoring rules don’t guarantee predicting fixed points, and Stop-gradients lead to fixed point predictions.
A variation of non-myopia can occur through self-fulfilling prophecies: if an AI is rewarded for predicting the future and its predictions influence the future, then it has an incentive to steer the future using its predictions. In other words, an AI that wants to predict the world accurately also wants to steer it. AIs that do not care about the consequences of their predictions are called consequence-blind. Myopia and consequence-blindness both aim to restrict the domain that an AI cares about. In myopia, we want to prevent models from caring about what happens after a training episode. In consequence-blindness we want to prevent them from caring about the consequences of their predictions.
Training episodes only make sense in reinforcement learning, but there are analogues in supervised learning. For instance, you might call a language model non-myopic if it attempts to use its predictions of one document to influence its performance on another document. For example, an LLM might be in a curriculum learning setup where its performance determines what documents it is shown later. This LLM might be able to improve its overall performance by doing worse early on in order to be shown easier documents later.
See for example Valle-Pérez et al. 2018.
See The Parable of Predict-O-Matic for an accessible explanation of this point.