Note that this post has been edited to clarify the difference between explicitly assigning a reward to an action based on its later consequences, versus implicitly reinforcing an action by assigning high reward during later timesteps when its consequences are observed. I'd previously conflated these in a confusing way; thanks to Rohin for highlighting this issue.
A number of people seem quite excited about training myopic reinforcement learning agents as an approach to AI safety (for instance this post on approval-directed agents, proposals 2, 3, 4, 10 and 11 here, and this paper and presentation), but I’m not. I’ve had a few detailed conversations about this recently, and although I now understand the arguments for using myopia better, I’m not much more optimistic about it than I was before. In short, it seems that evaluating agents’ actions by our predictions of their consequences, rather than our evaluations of the actual consequences, will make reinforcement learning a lot harder; yet I haven’t been able to identify clear safety benefits from doing so. I elaborate on these points below; thanks to Jon Uesato, Evan Hubinger, Ramana Kumar and Stephan Wäldchen for discussion and comments.
I’ll define a myopic reinforcement learner as a reinforcement learning agent trained to maximise the reward received in the next timestep, i.e. with a discount rate of 0. Because it doesn’t assign credit backwards over time, in order to train it to do anything useful, that reward function will need to contain an estimate of how valuable each (state, action, next state) transition will be for outcomes many steps later. Since that evaluation will need to extrapolate a long way forward anyway, knowing the next state doesn’t add much, and so we can limit our focus to myopic agents trained on reward functions R which ignore the resulting state: that is, where for some M. I'll call M the approval function; we can think of such agents as being trained to take actions that their supervisor approves of at the time the action is taken, without reference to how the rest of the trajectory actually plays out. This definition can also include imitation learners, for which the approval function is calculated based on the agent’s divergence from the supervisor’s policy.
Although it’s not the standard interpretation of MDPs, I’ll consider the agent’s thoughts during a timestep as part of their action, so we can give feedback on those too in this framework. Note also that I'm talking about myopia with respect to "world time", not with respect to "agent time". For example, in Debate, agents make a series of arguments and then update their policies based on the final reward (so they're nonmyopic in agent time), but this whole process occurs without the agents being rewarded for the consequences of their actions in their wider environment, so it still qualifies as myopic by my definition above. In this post, though, I'm focusing on agents which are trained to take actions in the world, not ones which are just trained to give language outputs.
Supervising based on predictions not observations is a significant handicap
Firstly, I want to emphasise how much more difficult it is for the supervisor to try and evaluate the values of actions immediately, without being able to give rewards after observing long-term outcomes of those actions. In order to do so, the supervisor needs to be able to predict in advance all the mechanisms which they want the agent to learn to use. In other words, the supervisor needs to be more intelligent than the agent - perhaps by a significant margin. Contrast this with the standard RL paradigm, in which we merely need to be able to recognise good outcomes, and the agent will learn by itself which of its actions led to them. It’s the difference between a simple programmatic reward function for training AlphaGo, versus needing to understand the advantages of each of AlphaGo’s moves before training it to that level - which would have set the development of AlphaGo back by years or decades.
One thing that makes this less than totally implausible is the possibility of the agent's knowledge being used by the supervisor. Schemes such as iterated amplification attempt to do this via recursive decomposition of the evaluation problem. I think iterated amplification is an important research direction, but I don’t see what value there is in making the supervisor output approval values to train a myopic agent on, rather than rewards to train a nonmyopic agent on. I’ll give more concrete examples and arguments about this later on; for now, it’s worth noting that I expect almost all nonmyopic training to happen in simulation, where rollouts are cheap. But once an agent is competent enough to deploy in the real world, we can also continue nonmyopic training over longer time periods. For example, once an agent has been deployed for a week or month, we can give it rewards for consequences which result from actions taken a week or a month ago (although in many cases I expect the relevant consequences to be apparent in a matter of minutes - e.g. the agent taking a wrong turn then backtracking).
This assumes that agents have good mechanisms for credit assignment. If it turns out that long-term credit assignment is a major bottleneck for RL, that would shift me towards thinking that myopia is more competitive. Further, although it's not standard practice, it seems plausible that we could wait until after observing a whole trajectory to give approval feedback on any actions. I'll call this semi-myopia, because the training algorithm itself is still myopic, but evaluations are based on the actual consequences of actions (not just their predicted consequences). Compared with myopia, semi-myopia incurs less disadvantage from removing automatic credit assignment over time, so it seems like an interesting direction to explore (although its benefits over nonmyopia still depend on arguments about manipulation which I attempt to rebut in a later section).
It’s also true that there are some types of evaluation which benefit much less from the propagation of credit from later rewards to earlier actions. I've already mentioned that my arguments are less applicable for agents which merely propose high-level ideas (like scientific theories). For agents which do take actions in the world, one example where we can evaluate immediately is when determining whether an agent is “thinking the right types of thoughts” - for example, whether it’s making manipulative plans or not. In theory this doesn’t depend on its future thoughts or actions - but in practice it'd still be useful to assign credit to thoughts partly based on later observations. For example, if its plan changes greatly when the supervisor starts paying more attention, that’s evidence that its original plan was manipulative, and so it would be useful to give a negative reward which discourages it from making that plan in the first place. Yet more importantly, supervision to prevent the wrong types of thoughts from arising will not be sufficient to train a highly competent AI. We also need supervision to encourage actions that lead to good consequences - which brings us back to the arguments from the previous paragraphs.
Of course, if we want to train agents that make plans on the scale of years or decades, waiting to give rewards later on will take prohibitively long, and so our feedback to them will need to involve predictions of future effects of their actions. So there’ll be some element of foresight required either way. But trying to evaluate agent actions without seeing any of their effects on the world would require a significant shift from the current trajectory of reinforcement learning. Given this bar, we should expect compelling reasons to believe that myopic training will actually be important for safety - which, I will argue, we don’t yet have.
Myopic training doesn’t inherently prevent dangerous long-term behaviour
It seems intuitive that if agents are never rewarded for the long-term consequences of their actions, they won’t make dangerous long-term plans - but in fact myopic training itself doesn’t make any inherent difference. Let’s compare an agent trained using standard RL on a reward function R, with an agent trained using myopic RL where its approval function is the optimal Q-function of R. The same types of cognition will lead to high-scoring agents in both cases. This is clearest in the case of Q-learning, where the Q-functions converge to the same values in the limit in both cases. The intuition here is: if the standard RL agent benefits from planning to figure out how to get to future states with high reward, then the myopically trained agent benefits from planning to figure out how to get to future states with high reward in order to choose actions with high Q-values. So a “myopic” training regime may lead to agents which make long-term plans, and generally may display the same dangerous behaviour as standard RL agents, for the same reasons; later in this post I’ll discuss in more detail how such behaviour might arise during myopic training.
The equivalence doesn’t just hold in a few cases - in fact, every function M which myopically assigns a value to all state-action pairs is the optimal Q-function for some reward function. So for any myopic training setup, there’s some equivalent nonmyopic training setup - specifically, the one with reward function . Notice that this equation is just the definition of the optimal Q-function, with M in place of Q, and rearranged so that we can calculate R in terms of M. It’s not always straightforward to calculate that reward function starting from the approval function (e.g. in continuous action spaces, where we can't check all the actions) but in practice I expect that given the capability to calculate a Q-function which takes into account all relevant future effects, we can come up with a reward function that’s very nearly equivalent. If this is a crux for proponents of myopia, I’d be interested to hear why.
I should note that so far I’ve been talking about myopia as a property of a training process. This is in contrast to the cognitive property that an agent might possess, of not making decisions directly on the basis of their long-term consequences; an example of the latter is approval-directed agents. (Throughout the rest of this post all references to “myopic agents” will refer to the way those agents were trained, not to how they make decisions). “Myopic thinking” has never been particularly well-specified - we have some intuitive sense of what it looks like in human psychology, but there are a lot of missing details, especially in translating it to the context of AI. I’d be excited to see these details filled in, since I think a major constraint in safety is our lack of understanding of the possible ways that minds might work. For the purposes of this blog post, though, what’s relevant is that proposals to build approval-directed agents or agents which “think myopically” tend to outline myopic training processes intended to produce them without actually justifying why the myopic training is necessary. But based on the arguments in this post I expect that, whatever the most reasonable interpretations of “approval-directed” or “myopic” cognition turn out to be, they could be developed in nonmyopic training regimes just as well as (or better than) in myopic training regimes.
What might this look like in practice? Consider the example of an agent trained myopically on the approval of HCH. To make this nonmyopic in a trivial sense, we merely need to convert that approval into a reward using the formula I gave above. However, after just the trivial change, myopic training will outperform nonmyopic training (because the latter requires the agent to do credit assignment across timesteps). To make it nonmyopic in an interesting and advantageous sense, HCH will need to notice when its earlier evaluations were suboptimal, and then assign additional rewards to correct for those errors. That would make evaluation significantly more accurate on many tasks - or, at least, allow HCH to put less effort into each evaluation (which is important given that evaluation will be the most expensive part of amplification by far). Nonmyopic training allows us to judge how good a speech is after having seen the audience’s reaction, not just from the draft; or how good a startup idea is after it’s been implemented, not just when first proposed.
Why might we not want this ability? The main objection is that seeing how things play out won’t just give us more information, but also bias our evaluation, if the agent is able to tamper with or manipulate the training process. I’ll discuss this possibility in the upcoming sections.
Myopic training won't prevent tampering in practice
The equivalence arguments I made above only apply if the training setup is uninfluenceable by the agent. If an agent can tamper with the implementation of the training algorithm during training, then myopia and nonmyopia can diverge sharply. In particular, it depends on what type of tampering is possible. If the agent is able to increase the feedback signal during some timesteps, then this will lead to large changes in the standard RL agent’s behaviour (since it will then be incentivised to reach those timesteps) and small changes in the myopic agent’s behaviour (it will learn that the tampering action is good in some states, but shouldn’t systematically seek them out). Of course it’s still possible in theory for myopic agents to learn to tamper consistently during training, but it seems like that’d require some very specific mechanisms for tampering (e.g. tampering with the approval function to change its evaluations of a closely-related subset of states) rather than easier mechanisms for tampering (e.g. overriding current reward signal with high values).
My main objection here is simply that we shouldn’t expect positive feedback during training to be an important reason why agents end up tampering with their training algorithm. Figuring out how to access and then modify the code that's being run in the training setup will be hard if we put in even basic precautions; and so it seems very unlikely that an agent will stumble upon it during random exploration. So if an agent finds a tampering action during training, that’ll be because it knew what it was doing, and wanted to tamper. Further, in order for this to actually be an important problem the agent would need to figure out how to prevent humans from shutting it down, and pursue a range of other convergent instrumental goals. It’s these intentions which we want to prevent from arising in the first place.
Even if the agent doesn’t ever get positive reinforcement for tampering, though, the difference between myopic and nonmyopic training will presumably change the agent’s observations at the point where it learns that it’s in a training process. In theory at least, this might influence its intentions. By analogy, after humans learned about evolution, a bunch of people started making the “appeal to nature” fallacy by justifying certain behaviours because they increase evolutionary fitness.
But I think that we shouldn’t expect AI goals to be very influenced just by the knowledge of their training process, for the same reason that such arguments aren’t very important for humans: by the time an agent is intelligent enough to understand the details of its own training process, it’ll already be generally intelligent and its goals will be already largely set. At that point it's probably too late for big updates to those goals. And even if they do make big updates, if their motivations end up depending sensitively on philosophical-style reasoning, then we’re in a pretty bad position either way.
I do want to flag this whole line of investigation as very speculative, though; I think a more thorough treatment would be very valuable (and I’m hoping to release a post on this soon).
Myopic training doesn't prevent manipulation of supervisors
The argument that agents won’t get rewarded for tampering during training fails if we consider manipulating humans to be “tampering”. Presumably there are a lot of different ways to manipulate humans who are giving reward feedback, and once a nonmyopic agent starts doing so a little bit, then it’ll get higher reward for doing so, and that behaviour will be reinforced, and its long-term goals will be affected.
I find this argument somewhat persuasive, but I am currently leaning towards thinking that this is just as much a problem for myopic agents, conditional on having competitive myopic agents. My core argument:
- Assume that there’s some sequence of outputs which an agent could use to manipulate its supervisor into giving it higher reward, without the supervisor realising. (If not, then neither myopic nor nonmyopic agents are incentivised to manipulate their supervisors).
- For myopic agents to be competitive on long-term tasks, their objective function needs to be set by a supervisor which is able to accurately predict how well their actions fulfil long-term goals.
- If that supervisor is accurately predicting and evaluating the future outputs of this myopic agent as well, then it can still be manipulated by those inputs.
- If the supervisor isn’t accurately predicting those outputs, then the myopic agent won’t exceed its supervisor, because it will never learn to take the first step of any plan that the supervisor can’t predict will be good.
(Note: Evan argues below that this argument doesn't apply to imitative amplification, which in theory could train a superhuman agent without the supervisor needing to predict the consequences of the agent's actions. This is a reasonable point, although I don't expect imitative amplification to be competitive with other ML techniques, so it doesn't change my conclusions overall.)
I find it helpful to think about this using a concrete example. Suppose that a myopic agent M is trained to write bestselling novels, but is restricted to writing one paragraph at a time, to avoid giving it an incentive to manipulate us (assuming it can’t do so in a single paragraph). M plans a sequence of N paragraphs which make up a good novel - but which, when combined, contains a bunch of pro-AI propaganda. A supervisor S observes M’s first output in that sequence. In order to understand how good this is, S then predicts M’s next N -1 outputs, allowing it to read M’s pro-AI propaganda. S is influenced by that propaganda, and gives M slightly more reward for its original action, therefore reinforcing M’s cognitive process of “make a plan to manipulate the supervisor”.
Perhaps you’re thinking: well, S could just predict the fact that M will output pro-AI propaganda, without predicting its contents in detail. But how does S know that the pro-AI propaganda isn’t part of what makes M’s output a bestseller? Remember that, by assumption, S can’t predict what inputs it will be manipulated by (otherwise myopia doesn’t buy us any advantage). So it will need to analyse the propaganda well enough to understand its downstream effects. If that seems implausible, remember that most of M’s training will be in simulation, and so it may be possible for S to spin up another copy of the simulation to see how the actions play out. (Evaluating the same actions in a different simulation is really stretching the definition of “myopia”, but provides a useful intuition for why manipulation can happen either way).
Even given that, it’s quite possible that S won’t be able to predict M’s outputs in enough detail to recreate the propaganda. But then S will never be able to reward M for plans which rely on outputs as detailed as what’s required to manipulate S. If S can’t predict the details of M’s outputs over many timesteps, then it can’t evaluate plans which rely on those details. Perhaps manipulative plans are more “brittle” than regular plans - that is, imprecision in S’s predictions will prevent S from being manipulated while still allowing S to accurately evaluate regular plans. But in general engineering is more brittle than persuasion. Suppose S wants M to build a rocket in simulation. S needs to be able to predict which specific types of nuts and bolts the rocket needs when M creates them (simulated) months in advance, because otherwise S can’t give positive feedback for creating those particular nuts and bolts. I’m not ruling out the possibility that S’s predictions are high-fidelity enough to supervise rocket construction, while also not high-fidelity enough to recreate manipulative outputs - but it seems pretty unlikely.
I’ll rephrase this larger point, since it’s the core of my high-level concern with myopia: arguments for myopic training lean on the intuition that it’s safer for evaluation to not directly depend on the actual long-term consequences of agents’ actions, without acknowledging that supervision powerful enough to train competitive myopic agents will effectively be doing that evaluation anyway. Insofar as the evaluation isn’t accurate enough to recreate the manipulative effects, then it’s also not accurate enough to train agents which are competitive at achieving outcomes in the world.
Conclusion
In short, I don’t think any arguments so far have succeeded in making the case that we should attempt to develop AGIs which are trained primarily via myopic reinforcement learning. Further, while myopic training is a well-defined technical concept, the types of thinking we want our agents to develop under such training are not very clear, and the links between the two remain speculative.
Given the lack of clarity around myopia’s benefits, and the likely difficulty of supervising agents to become competent at achieving large-scale outcomes without the supervisor examining what outcomes the agents’ actions actually lead to, I’m more excited about research which aims to add on some component of myopic supervision to a largely nonmyopic process. For now, we should consider myopic training an interesting tool that’s being developed for the safety toolbox - but one which will likely incur significant disadvantages unless deployed alongside nonmyopic training regimes. The core idea of training agents which aren’t long-term consequentialists is a different thing, which will require other approaches and insights.
Things I agree with:
1. If humans could give correctly specified reward feedback, it is a significant handicap to have a human provide approval feedback rather than reward feedback, because that requires the human to compute the consequences of possible plans rather than offloading it to the agent.
2. If we could give perfect approval feedback, we could also provide perfect reward feedback (at least for a small action space), via your reduction.
3. Myopic training need not lead to myopic cognition (and isn't particularly likely to for generally intelligent systems).
But I don't think these counteract what I see as the main argument for myopic training:
While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.
(I'm using "incentivize" here to talk about outer alignment and not inner alignment.)
In other words, the point is that humans are capable of giving approval / myopic feedback (i.e. horizon = 1) with not-terrible incentives, whereas humans don't seem capable of giving reward feedback (i.e. horizon = infinity) with not-terrible incentives. The main argument for this is that most "simple" reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that's what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)
I'll rephrase your objections and then respond:
Response: Someone has to predict which action leads to good long-term effects, since we can't wait for 100 years to give feedback to the agent for a single action. In a "default" training setup, we don't want it to be the agent, because we can't trust that the agent selects actions based on what we think is "good". So we either need the human to take on this job (potentially with help from the agent), or we need to figure out some other way to trust that the agent selects "good" actions. Myopia / approval direction takes the first option. We don't really know of a good way to achieve the second option.
This doesn't seem to be true -- if you want, you can collect a full trajectory to see the consequences of the actions, and then provide approval feedback on each of the actions individually when computing gradients.
I agree that if you take the approval feedback that a human would give, apply this transformation, and then train a non-myopic RL agent on it, that would also not incentivize catastrophic outcomes. But if you start out with approval feedback, why would you want to do this? With approval feedback, the credit assignment problem has already been solved for the agent, whereas with the equivalent reward feedback, you've just undone the credit assignment and the agent now has to redo it all over again. (Like, instead of doing Q-learning, which has a non-stationary target, you could just use supervised learning to learn the fixed approval signal, surely this would be more efficient?)
On the tampering / manipulation points, I think those are special cases of the general point that it's easier for humans to provide non-catastrophe-incentivizing approval feedback than to provide non-catastrophe-incentivizing reward feedback.
I want to reiterate that I agree with the point that myopic training probably does not lead to myopic cognition (though this depends on what exactly we mean by "myopic cognition"), and I don't think of that as a major benefit of myopic training.
Typo:
I think you mean γ instead of λ
I think this is a really important point, thanks.
Did you mean "There's no difference between approval feedback and reward feedback"?