Post 4 of Towards Causal Foundations of Safe AGI, preceded by Post 1: IntroductionPost 2: CausalityPost 3: Agency, and Post 4: Incentives.

By Francis Rhys Ward, Tom Everitt, Sebastian Benthall, James Fox, Matt MacDermott, Milad Kazemi, Ryan Carey representing the Causal Incentives Working Group. Thanks also to Toby Shevlane and Aliya Ahmad.

AI systems are typically trained to optimise an objective function, such as a loss or reward function. However, objective functions are sometimes misspecified in ways that allow them to be optimised without doing the intended task. This is called reward hacking. It can be contrasted with misgeneralisation, which occurs when the system extrapolates (potentially) correct feedback in unintended ways.

This post will discuss why human-provided rewards can sometimes fail to reflect what the human really wants, and why this can lead to malign incentives. It also considers several proposed solutions, all from the perspective of causal influence diagrams.

Why Humans might Reward the Wrong Behaviours

In situations where a programmatic reward function is hard to specify, AI systems can often be trained from human feedback. For example, a content recommender may be optimising for likes, and a language model trained on feedback from human raters. 

Unfortunately, humans don’t always reward the behaviour they actually want. For example, a human may give positive feedback for a credible-sounding summary, even though it actually misses key points:

When reward misspecification occurs, the human’s actual utility is decoupled from the system’s feedback.

More concerningly, the system may covertly influence the human into providing positive feedback. For example, a recommender system with the goal of maximising engagement can do so by influencing the user’s preferences and mood. This leads to a kind of reward misspecification, where the human provides positive feedback for situations that don’t actually bring them utility.

A causal model of the situation reveals the agent may have an instrumental control incentive (or similarly, an intention) to manipulate the user’s preferences. This can be inferred directly from the graph. First, the human may be influenced by the agent’s behaviour, as they must observe it before evaluating it. And, second, the agent can get better feedback by influencing the human:

The agent has an instrumental control incentive over the human’s preferences because the agent can influence these preferences and the preferences influence the agent’s feedback.

For example, we typically read a post before deciding whether to “like” it. By making the user more emotional, the system may be more likely to increase engagement. While this effect is stronger for longer interactions, the incentive is there even for “single time-step” interactions.

Scalable Oversight

One proposed solution to the reward misspecification problem is scalable oversight. It provides the human with a helper agent that advises them on what feedback to give. The helper agent observes the learning agent’s behaviour, and may point out, for instance, an inaccuracy in a credible-looking summary, or warn against manipulation attempts. The extra assistance may make it harder for the learning agent to manipulate or deceive the human:

Influential scalable oversight agendas include iterated distillation and amplificationAI safety via debaterecursive reward modelling, and constitutional AI.

Unfortunately, the learning agent still has an incentive to deceive the human or manipulate their preferences, as the human preferences still satisfy the graphical criterion for an instrumental control incentive (it’s on a directed causal path from behaviour to feedback). Additionally, the learning agent also has an incentive to deceive the helper agent:

An important question for scalable oversight schemes is whether weaker agents can effectively help to supervise more capable agents (and whether this can be done recursively to supervise agents much smarter than humans). This is related to the question of which situations enhance or diminish agency.

A second, more subtle, concern is collusion between the helper and the learning agent. Scalable oversight is built on an adversarial relationship between the helper and the learning agents: the helper agent critiques the learning agent as sharply as possible, while the learning agent strives for behaviour beyond critique. But what if they instead decided to join forces and jointly manipulate the human? Depending on what goals they’ve internalised, this could benefit both agents. Concerningly, agents using decision theories such as functional decision theory may be able to cooperate, even in situations that lead to mutual defection for standard decision theories.

As we explain in a recent paper, different decision theories can be characterised by whether they optimise object-level or mechanism level nodes (and by what operation they use, and what information they condition on). For example, a helper agent that uses functional decision theory performs optimisation at its policy mechanism node rather than its decision, which means it realises that it can shape the learning agent to cooperate, even if both agents are myopic and only optimise for the current time step:

A functional decision theorist helper agent optimises at the helper policy node, giving it an instrumental control incentive to shape the learning policy for its benefit.

A better understanding of decision theory and how it is shaped by agent training may help us avoid situations where advanced agents meant to keep each other in check collude against their overseers.


Interpretability is a well-known proposal for increasing the safety and trustworthiness of ML systems. The subfield of mechanistic interpretability seeks to understand the algorithms implemented by neural networks, i.e., to understand their internal “cognition”.

Mechanistic interpretability can help prevent misspecification by enabling the human to give feedback based on the agent’s internal cognition, rather than just on its behaviour. For example, it may enable the human to tell that the agent is planning to manipulate or deceive them, and adjust the reward appropriately:

The causal structure is quite similar to scalable oversight, except that the advice now directly depends on the learning agent’s internals. As for scalable oversight, instrumental control incentives reveal that the agent still has an incentive to deceive or manipulate the human. Assuming we label the network internals as a decision, there is also an instrumental control incentive on the interpretation, suggesting that the agent has an incentive to fool the interpreter by obfuscating its thoughts.

But should the network internals be considered a decision? Isn’t it more natural to say that the agent’s behaviour is the decision? To understand this, we can apply our definition of agency. First we add mechanism nodes which capture how each object-level node relates to its parents. There is a causal link between two mechanism nodes if retraining the system under a change to one of the nodes would lead to a change in the other (when all other mechanisms in the graph are held fixed).

For example, if the human had a different feedback policy and rewarded other types of behaviours, then gradient descent would lead to different weights and activations in the network (even if all other mechanisms in the graph were held fixed). So we draw a link from feedback policy to network internals mechanism. But the way the agent’s behaviour depends on the network’s internals is fixed independent of feedback policy. Therefore, we don’t draw any link from feedback policy to the behaviour mechanism:

Since the network internals mechanism has an ingoing link and the behaviour mechanism does not, only the former can be a decision when we view the world through this set of variables.

Impact Measures

So far we’ve seen that scalable oversight and interpretability tools can help us mitigate the specification problem to provide better feedback to AI agents. A very different approach to solving the manipulation part of reward hacking is to penalise the agent for having too much impact on the human’s preferences. 

How to rigorously define and measure impact is an open question. A natural suggestion is to compare the world in which the agent acts to the counterfactual world in which it does not act (or performs some safe baseline action). This can be illustrated with a twin graph:

To measure the impact the agent has on the human’s preferences, we can compare the preferences to how they counterfactually would be if the agent performed some safe behaviour. 

A causal model of how possible agent decisions affect user preferences is needed to compute these impact measures. Carroll et al show that such a model can be inferred from an observed interaction between a user and a content recommender over multiple time steps. Scaling this up from a toy environment to real systems is an important direction for future work.

However, impact measures have a few problems. First, the system still has an incentive to influence the user’s preferences, as can be seen from the instrumental control incentive in the graph above. Second, the system is incentivised to prevent the user’s preferences from changing from the baseline. It may therefore try to prevent the user from developing new interests, as these might lead to different preferences.

Path-Specific Objectives

One definition of manipulation is intentional and covert influence. Content recommenders can satisfy this definition, as they are typically trained to influence the user by any means, including “covert” ones like appealing to the user’s biases and emotions. Meanwhile, the instrumental control incentive on the user’s preferences discussed above, can lead to intentional influence on the user. (Whether current systems are actually manipulative is unclear.)

The good news is that this suggests ways to ensure we build non-manipulative agents. For example, an agent that doesn't try to influence the user’s preferences, would not be manipulative according to the above definition, because there is no intent.

Path-specific objectives are a way of designing agents that don’t try to influence particular parts of the environment. Given a structural causal model with the user’s preferences, such as the one for defining impact measures, we can specify a path-specific objective that tells the agent not to optimise over paths that involve the user’s preferences.

To compute the path-specific effect from the agent’s decision, we impute a baseline value of the decision in places where we want the agent to ignore the effect of its actual decision. This can also be described with a twin network:

The important difference to impact measures, is that path-specific objectives tell the agent to optimise a hypothetical feedback signal, which has been generated from a hypothetical, uninfluenced user’s preferences. This fully removes the instrumental control incentive on the user’s preferences, and thereby avoids the problem of (intentional) preference manipulation.

In one sentence, impact measures try not to influence, while path-specific objectives don’t try to influence. That is, path-specific objectives do not try to change the user’s preferences, but also do not try to prevent the user from developing novel interests.

A drawback of path-specific objectives is that they do not help address degenerative feedback loops, such as echo chambers and filter bubbles. To avoid these, path-specific objectives may be combined with some of the above techniques (though combining it with impact measures would bring back some of the bad incentives).

Further work may extend path-specific objectives to multiple time-steps, and see to what extent it improves manipulation in practice. To assess how well it works in practice, we may first need a better understanding of human agency, to be able to measure improvements from less manipulative algorithms.


Reward hacking is one of the core challenges for building highly capable and safe AI agents. In this post, we have discussed how the misspecification problem and proposed solutions can be analysed with causal models.

Directions for further work include:

  • What decision theory do agents learn under which conditions, and are there ways to shape this, to avoid agents coordinating against the human? For language models, their decision theory will be partially shaped by a combination of their pre-training and fine-tuning.
  • Interpretability can help detect intentional deception and manipulation. These concepts depend on the agent's subjective causal model, i.e. the (often implicit) model the agent bases its decisions on. How can we combine behavioural experiments with mechanistic interpretability to infer an agent’s subjective causal model? The next post will say more about this.
  • How can we infer sufficiently accurate causal models, so that we can prevent preference manipulation with impact measures and path-specific objectives?
  • What are the relevant metrics to measure whether a technique is making progress on the deception and the manipulation problems? For deception, there are truthfulness benchmarks. For manipulation, the question is more subtle, and may involve querying meta-preferences, and/or intersect with a better understanding of human agency.
  • Extend path-specific objectives to multiple time steps, and implement it in less toy environments.

In the next post, we will take a closer look at misgeneralisation, which can make agents behave badly and pursue the wrong goals, even if the rewards have been correctly specified. 

New Comment
1 comment, sorted by Click to highlight new comments since:

One definition of manipulation is intentional and covert influence. Content recommenders can satisfy this definition, as they are typically trained to influence the user by any means, including “covert” ones like appealing to the user’s biases and emotions.

I don't think that "covert" is a coherent thing an (e.g.) content recommender could optimize against. For example, everything could appeal to the biases and emotions of the wrong person. Anything can be rude/triggering/bias-inducing to the right person. In which case, how do you classify what is covert and what isn't in a way that isn't entirely subjective and also isn't behest to (arbitrary) social norms?

I still think it's possible to define manipulation ~objectively though, but in terms of infiltration across human Markov blankets.