Importance of foresight evaluations within ELK

Jonathan Uesato

This post makes a few basic observations regarding the importance of foresight-based evaluations within ELK [Christiano et al 2021]:

Without foresight-based evaluations, (narrow) ELK is insufficient to avoid subtle manipulation (we’d require a more ambitious version)
Hindsight creep can occur even with seemingly foresight-based evaluations. One way this could happen is if human evaluators primarily defer to evaluations made by future humans in the predicted outcomes.
Because of both points, even if we are counting on a technique like narrow ELK, improving the quality of foresight-based evaluations is very valuable.

The points here are largely covered in ELK appendices on narrow elicitation, defining a utility function, and subtle manipulation. The aim here is to call out some of these points explicitly, and facilitate further focused discussion.

Subtle manipulation

The ELK report focuses on what it calls narrow ELK, which does not aim to address subtle manipulation via ELK:

[In narrow ELK,] We won’t ask our AI to tell us anything at all about subtle manipulation.
…
A mundane [example of subtle manipulation] is that they may have talked with someone who cleverly manipulates them into a position that they wouldn’t really endorse, or emotionally manipulated them in a way that will change their future conclusions.

Subtle manipulation is clearly problematic. Within the overall RL training approach of the ELK report, foresight evaluations are necessary to avoid subtle manipulation. That is, even if we solve narrow ELK, we still need foresight-based evaluations to avoid subtle manipulation.

With hindsight-based policy evaluations, narrow ELK is insufficient

The ELK report provides a simple policy training algorithm based on approval-in-foresight combined with hindsight-based outcomes predictors:

Still, even though humans can’t directly follow along with the actions, they can evaluate the predicted consequence of an action sequence:
[image]
We can then train a model to predict these human evaluations, and search for actions that lead to predicted futures that look good.

This uses a hybrid of foresight and hindsight feedback.^[1] The outcomes predictor is trained in hindsight: trying actions, observing what outcomes result, and training a model to predict those outcomes. Conditioned on these predictions, the policy feedback on the action uses foresight — we don’t wait until the outcomes have happened to decide which action is better (we use the predicted outcomes instead).^[2]

The use of approval-in-foresight here is crucial for avoiding subtle manipulation. Consider, what happens if we make the policy feedback also hindsight-based? That is, we have the human choose which outcomes are best after the decision is made. Rather than optimizing for “actions that lead to predicted futures that look good [in foresight]” (emphasis mine), we would optimize for “actions that lead to [realized] futures that look good [in hindsight]”.^[3]

In this case, the futures which look best in hindsight are going to be ones where the future humans have been manipulated into liking and/or saying good things about the affairs that arise. The basic point is that the foresight-based policy feedback seems to be necessary with narrow ELK; for a purely hindsight-based scheme we should expect subtle manipulation.

This is the same point the ELK report makes with the paperclip example (bolding mine):

“For example, suppose I watch a 10 second ad that is carefully chosen by a brilliant paperclip-maximizing-marketer. Five years after watching this ad, I decide that paperclips are great so I dedicate my time to making lots of them, and if you evaluate outcomes using my conclusions-after-deliberation then you’ll conclude that this was a great outcome and the AI should help (e.g. if we evaluate using the utility function defined in Appendix: utility function). I’m not able to look at the process of deliberation and notice anything bad happening, and so it seems I can’t incentivize my AI to warn me about this ad or prevent me from watching it.
But from my perspective in advance, there are many possible ads I could have watched. Because I don’t understand how the ads interact with my values, I don’t have very strong preferences about which of them I see. If you asked me-in-the-present to delegate to me-in-the-future, I would be indifferent between all of these possible copies of myself who watched different ads. And if I look across all of those possible copies of me, I will see that almost all of them actually think the paperclip outcome is pretty bad, there’s just this one copy (the one who sees the actual ad that happens to exist in the real world) who comes up with a weird conclusion.”

If we evaluate the action (whether to show this ad) in foresight, we will see that the human ends up doing something we don’t like, and so reject the action. If we evaluate it in hindsight, the action looks great to the future humans within this trajectory.

Hindsight creep can occur even with seemingly foresight-based evaluations

Here are two ways hindsight creep could happen, which are very similar:

The foresight evaluations defer to the hindsight evaluations
Pushing complexity from reward function (RF) to RF input function [Everitt et al 2019], particularly by using high-level RF inputs.

In either case, even though the algorithm looks foresight-based, we get the same failure mode as with hindsight-based feedback.

Hindsight creep from foresight evaluations deferring to hindsight evaluations

The paperclipping example is particularly simple in that even in foresight, it is easy to see something has gone wrong (the paperclip example is not intended to capture the full difficulty of the problem). Human evaluators can observe the trajectory and realize that even though the future humans are happy and think the situation is great, we don’t want everyone obsessed with paperclips.

However, realistic futures involving transformative AI are probably much more complicated. For example, maybe there are crazy technologies, or new ideologies, or new social norms, or differently structured societies. Lots of aspects of this world would just look totally strange to us, including the preferences of these future people, and it might not be clear whether they’re happy because things have gone well, or because their preferences were manipulated (to be happy either in general, or with these particular outcomes which we would not endorse).

In such a case, it might be tempting to just defer to the future people, and to rank predicted outcomes T_i based on how happy the people seem to be in each T_i. But if we do that, we have the same problem as the previous section: The paperclip ad does result in future people who are happy. More generally, manipulation is also a way to make people happy.

The ELK report makes this same point here:

“But it might be much easier to manipulate our future selves into being really happy with the outcome then it is to actually maximize option value (which may require e.g. trying to make money in a competitive economy). So we should worry about the possibility that our AI will manipulate us instead of helping us.”

Hindsight creep from shifting complexity into RF input function

To put the same basic point slightly differently: if the RF inputs are high-level inputs (like human-written reports) rather than low-level inputs (like camera images), our evaluations also become more reliant on hindsight feedback.

This is most easily understood by taking this to the limit. In an extreme case, rather than training the trajectory predictor to predict camera inputs, we train it to solely predict the result of a human within the trajectory writing an evaluation. In this case, the RF has no choice but to defer to the hindsight evaluations (because that’s all the RF sees). If the humans are manipulated into always painting an overly rosy picture of their situation, we'd have no way to distinguish this from things actually going well.

More generally, if evaluations of predicted trajectories depend significantly on human-written reports of events within that trajectory, we have the same issue. For example, if the cameras are recording future humans writing evaluations about their world, and the only thing the reward function merely reads these evaluations out from the camera videos, nothing has changed. And more generally, as pointed out in the previous subsection, if much of the future world is bewildering to the current evaluators (other than questions like whether the future humans look and say they’re happy), the current evaluators may have to rely significantly on hindsight evaluations of the external world within each trajectory.

A couple quick side points:

One hybrid approach might be to use decoupled feedback by asking future humans to evaluate alternate future trajectories, as opposed to their current trajectory so far, and then deferring to these judgments. But these future humans face essentially the same challenge as the naive foresight evaluations: they potentially have to evaluate future worlds very different from their own. So making this hybrid approach competitive seems to require similar techniques to those needed to make naive foresight evaluations competitive.

For researchers familiar with the decoupling line of work, this point in this subsection can be compactly expressed using concepts from decoupling. The extreme case above pushes essentially all the complexity into the RF input function (trajectory predictor) rather than RF (preferences across trajectories), while the less extreme case does a softer version of this. Decoupling provides a way to avoid RF tampering, but not RF-input tampering, so if “most” of the complexity is in the RF input function, we will run into trouble.

Improving foresight evaluations is valuable

To avoid the issues of hindsight creep, we need strong foresight evaluations in order for this combination of foresight evaluation + narrow ELK to effectively avoid subtle manipulation.^[4] If the foresight evaluator can’t understand anything about the future worlds, they have no choice but to defer to hindsight feedback.

Alternatively, the evaluations can be decoupled, but not purely foresight-based (see Appendix: Decoupling if this terminology is confusing). This is the approach used by the indirect normativity approach to defining a utility function in the ELK report. But the core challenges in making the decoupled feedback procedure competitive are similar to those for the purely foresight-based approach.

Finally, another approach to avoiding the problem of “deferring to hindsight evaluations” is to only use outcomes-based predictors for various well-scoped observables, and to primarily rely on augmented human understanding (e.g. via amplification and/or debate) to provide feedback on the best actions. The competitiveness of this approach is a long discussion beyond the present scope – my view is that while this approach is not indefinitely scalable, there is a good chance it remains competitive for very long, and past the onset of Transformative AI. In any case, the viability of such an approach depends even more on foresight-based evaluations.

Thanks to the ELK authors for discussion of the ELK report, and to Paul Christiano for discussion and feedback on this post.

Appendix: Policy training pseudocode

This appendix provides pseudocode to clarify the feedback procedure for policy training.

We start with the dynamics model f, trained to predict future trajectories t = f(s, a_1:t) from state s and action sequence a_1:t, and the human evaluation function H computes reward r = H(t) from trajectory t. We then define the target policy as:^[5]

def target_policy(s):
  for action sequence a_i in A:
    t(a_i) := f(s, a_i)
    r(a_i) := H(t(a_i))
  return argmax_{a_i} r(a_i)

From here, we can define the acting policy various ways, all of which are essentially the same (with respect to subtle manipulation).

Non-parametric: Act directly, by directly calling target_policy(s).
Parametric: Use target_policy(s) as the policy improvement signal, by training a parametric (neural network) policy towards target_policy(s).
Parametric-soft-selection: Train towards target_policy’(s), where the argmax is replaced with a prior-policy-weighted softmax.

To contrast the foresight and hindsight policy feedback then: In the foresight (ordinary) version, H implements a foresight evaluation: the human looks at the predicted trajectory t(a_i), but evaluates t(a) before a_i is executed. In the pure hindsight version, we would replace the trajectory model f with the real-world dynamics (by executing action sequence a_i), and the human evaluation H would score the resulting trajectory t (in hindsight).^[6]

Appendix: Decoupling

This appendix expands on why the ELK report’s strategy for avoiding subtle manipulation rests on decoupling, and the relationship between decoupling and foresight evaluations.

A decoupled feedback procedure [Everitt et al 2017] is one which provides evaluation on some outcomes (resp. actions) from a world distinct from those outcomes (resp. in which different actions were taken). An example of a decoupled feedback procedure for outcomes would be to ask “how good would future F be?” In comparison, a non-decoupled feedback procedure for outcomes would be to ask someone in future F, “how good is the world?” Similarly, a decoupled procedure for actions would be to ask “how good is action A in state S?” (either before taking an action, or after taking some action A’), while a non-decoupled procedure would ask “how good was action A in state S?” after taking action A and observing the consequences.

The basic intuition for why decoupling helps is that tampered states/actions are scored highly according to non-decoupled feedback, but receive no benefit under decoupled feedback. More detailed and rigorous explanations can be found in Reinforcement Learning with a Corrupted Reward Channel or Avoiding Tampering Incentives in Deep RL via Decoupled Approval.

In this post, I use decoupled/foresight and non-decoupled/hindsight interchangeably. In particular, foresight-based algorithms benefit from being decoupled by default, since our evaluation can’t be affected by consequences if the evaluation occurs before the consequences. The most naive hindsight-based algorithms are typically non-decoupled, and to my knowledge, decoupled versions face similar challenges to foresight evaluations. So for our purposes here, we can treat these as mostly equivalent; I’ve used the term foresight in this post to match the ELK report.

As a side note, when reading the ELK report, it was confusing at first to me why we’d want different solutions for subtle manipulation vs. sensor tampering. Within the language of decoupling, here is one mental model which might be a helpful explanation:

Decoupling handles RF tampering.
We design the RF to have very low-level RF-inputs, such that any RF-input tampering will require blatant tampering. In particular, we want to make it impossible for subtle manipulation to affect the RF inputs (e.g. camera videos).
We use narrow ELK to handle RF-input tampering.

Decoupling within indirect normativity proposal

The feedback procedure described in the ELK report isn’t quite as simple as the pure foresight pseudocode in Appendix: policy training pseudocode. In the ELK indirect normativity proposal, rather than asking up-front “which of these predicted trajectories is best?”, we ask “which of these predicted outcomes is best, according to a near-future delegate?” (Note that we can also keep things purely foresight-based by asking “which of these predicted outcomes is best, according to amplification or debate?”, where competitiveness is again the key question.)

The feedback remains decoupled though, in that we define the utility function by asking the near-future delegate to evaluate different possible future trajectories, rather than their own current trajectory. (That is, in the simplest scheme, and then we do this recursively.) The challenges needed to make decoupled feedback competitive (in terms of understanding complex worlds) are largely the same as those for pure foresight evaluations. And the core basic point is that the purely hindsight-based and non-decoupled policy feedback procedure will lead to subtle manipulation.

^{^}
On the whole, the baseline algorithm described in the ELK report is very hindsight-based (I believe partly for pedagogical reasons, and partly to target an indefinitely scalable solution). The main point here is that much of the safety properties specifically come from the foresight-based elements.
^{^}
The ELK report proposal is more complicated, and suggests indirect normativity as one option. The point here is that there is a simple option here which avoids manipulation issues (which hindsight-based feedback does not). The situation with indirect normativity is similar, and discussed in the Appendix.
^{^}
“Predicted futures that look good in hindsight” is also fine, but at the point where you’re already evaluating actions in hindsight, there isn’t much reason to look at predicted trajecotires, rather than the actual trajectories. But the argument is the same in either case.
^{^}
Alternatively, the evaluations can be decoupled, but not purely foresight-based. This is the approach used by the indirect normativity approach to defining a utility function in the ELK report. But the core challenges in making the decoupled feedback procedure competitive are similar to those for the purely foresight-based approach.
^{^}
As in the ELK report, this is shown with exhaustive search, but could use more practical alternatives like MCTS with learned heuristics, other learned search procedures.
^{^}
As a clarifying note, with hindsight policy feedback, only the parametric policies make sense. This is because the non-parametric policy requires observing the consequences of the current action in order to determine the current action (but the non-parametric policy is mostly only of theoretical interest anyways).

[-]paulfchristiano4y20

I agree that ELK would not directly help with problems like manipulating humans, and that our proposal in the appendices is basically "Solve the problem with decoupling." And if you merely defer to future humans about hoe good things are you definitely reintroduce these problems.
I agree that having humans evaluate future trajectories is very similar to having humans evaluate alternative trajectories. The main difference is that future humans are smarter---even if they aren't any better acquainted with those alternative futures, they've still learned a lot (and in particular have hopefully learned everything that an AI from 2022 knows).
I agree that realistic futures are way too complicated for 2022 humans to judge whether they are good, and this is a way in which the "paperclip" example is very unrealistic. Internally we refer to these undesirable outcomes as "flourishing-prime" to emphasize that it's very hard to distinguish from flourishing.

I think the biggest point of disagreement is that I don't think the foresight evaluations required by this procedure are likely to be very hard, and in particular won't likely have that much overlap with the hard parts of debate or amplification. Ideally I think they are going to be very similar to the decisions a human makes by default when deciding what they want to do tomorrow (and then when they put their faith in that future self to make a wise decision about what to do the day after that). I think the core difference is that in the easy case the quality of our reasoning doesn't need to scale up with the quality of our AI, we can take it slow while our AI defends us and gives us time+space to become wiser.

That said, I do think that even without alignment worries it's not easy to chart a positive course towards becoming the kind of people we'd want to defer to, and I don't have that much confidence in us doing it well. Foresight-based evaluations may be closely related to leveraging AI to increase wisdom differentially (which may be important even if you aren't trying to handle alignment risk), but that seems like a slightly different ballgame.

In particular, I'm not imagining avoiding subtle manipulation from AI by doing foresight evaluations of whether those actions are manipulative. I'm imagining something more like a whitelist where we do foresight evaluations to decide what we do want to be influenced by. (Though as Wei Dai often points out this may be hard, since e.g. social processes play a crucial role in how our views evolve but those processes involve competitive dynamics amongst humans that may make it hard to distinguish social influences we want from manipulation enabled by with AI advisors.)

As a side note, when reading the ELK report, it was confusing at first to me why we’d want different solutions for subtle manipulation vs. sensor tampering. Within the language of decoupling, here is one mental model which might be a helpful explanation:
Decoupling handles RF tampering.
We design the RF to have very low-level RF-inputs, such that any RF-input tampering will require blatant tampering. In particular, we want to make it impossible for subtle manipulation to affect the RF inputs (e.g. camera videos).
We use narrow ELK to handle RF-input tampering.

I mostly agree with this, but want to clarify that "very low-level RF inputs" includes things like humans living lives as normal, eating normal food, breathing normal air, etc., rather than things like "

We also only expect that this is sufficient to guide deliberation over the very short term. If I'm watching a human in 2030, I don't necessarily expect to be able to tell whether things are going well or poorly for them. But I don't want to defer to them. Instead I'll defer to a human in 2029 about whether 2030 looks good, having already established (by deferring to a 2028 human) that the 2029 future is going well. This process is definitely a bit scary, but on reflection I think it's not really worse than the status quo of each human using their own faculties to chart a course for the next year.

18