Post 2 of Towards Causal Foundations of Safe AGI, see also Post 1 Introduction.

By Lewis Hammond, Tom Everitt, Jon Richens, Francis Rhys Ward, Ryan Carey, Sebastian Benthall, and James Fox, representing the Causal Incentives Working Group. Thanks also to Alexis Bellot, Toby Shevlane, and Aliya Ahmad.

Causal models are the foundations of our work. In this post, we provide a succinct but accessible explanation of causal models that can handle interventions, counterfactuals, and agents, which will be the building blocks of future posts in the sequence. Basic familiarity with (conditional) probabilities will be assumed.

What is causality?

What does it mean for the rain to cause the grass to become green? Causality is a philosophically intriguing topic that underlies many other concepts of human importance. In particular, many concepts relevant to safe AGI, like influence, response, agency, intent, fairness, harm, and manipulation, cannot be grasped without a causal model of the world, as we mentioned in the intro post and will discuss further in subsequent posts.

We follow Pearl and adopt an interventionist definition of causality: the sprinkler today causally influences the greenness of the grass tomorrow, because if someone intervened and turned on the sprinkler, then the greenness of the grass would be different. In contrast, making the grass green tomorrow has no effect on the sprinkler today (assuming no one predicts the intervention). So the sprinkler today causally influences the grass tomorrow, but not vice versa, as we would intuitively expect.


Causal Bayesian Networks (CBNs) represent causal dependencies between aspects of reality using a directed acyclic graph. An arrow from a variable A to a variable B means that A influences B under some fixed setting of the other variables. For example, we draw an arrow from sprinkler (S) to grass greenness (G):

A causal graph representing our running example. The sprinkler (S) influences the greenness of the grass (G).

For each node in the graph, a causal mechanism of how the node is influenced by its parents is specified with a conditional probability distribution. For the sprinkler, a distribution  specifies how commonly it is turned on, e.g. . For the grass, a conditional distribution  specifies how likely it is that the grass becomes green when the sprinkler is on, e.g. , and how likely it is that the grass becomes green when the sprinkler is off, e.g. .

By multiplying the distributions together, we get a joint probability distribution  that describes the likelihood of any combination of outcomes. Joint probability distribution are the foundation of standard probability theory, and can be used to answer questions such as "what is the likelihood that the sprinkler is on, given that I observe that the grass is wet?"

An intervention on a system changes one or more causal mechanisms. For example, an intervention that turns the sprinkler on corresponds to replacing the causal mechanism  for the sprinkler, with a new mechanism  that always has the sprinkler on. The effects of the intervention can be computed from the updated joint distribution  where  denotes the intervention. 

Note that it would not be possible to compute the effect of the intervention from just the joint probability distribution , as without the causal graph, there'd be no way to tell whether a mechanism should be changed in the factorisation  or in.

Ultimately, all statistical correlations are due to casual influences. Hence, for a set of variables there is always some CBN that represents the underlying causal structure of the data generating process, though extra variables may be needed to explain e.g. unmeasured confounders.


Suppose that the sprinkler is on and the grass is green. Would the grass have been green had the sprinkler not been on? Questions about counterfactuals like these are harder than questions about interventions, because they involve reasoning across multiple worlds. Counterfactuals are key to defining e.g. harm, intent, fairness, and impact measures, as they all depend on comparing outcomes across hypothetical worlds.

To handle such reasoning, structural causal models (SCMs) refine CBNs in three important ways. First, background context that is shared across hypothetical worlds is explicitly separated from variables that can be intervened and vary across the worlds. The former are called exogenous variables, and the latter endogenous. For our question, it will be useful to introduce an exogenous variable R for whether it rains or not. The sprinkler and the grass are endogenous variables.

The relationship between hypothetical worlds can be represented with a twin-graph, where there are two copies of the endogenous variables for actual and hypothetical worlds, and the exogenous variable(s) provide shared context:

A twin graph to answer whether the sprinkler is the reason the grass is green. Nodes in the hypothetical world are dashed. The right hand sprinkler node is intervened , representing the hypothetical. The grey exogenous rain node R provides shared context.

Second, SCMs introduce notation to distinguish endogenous variables in different hypothetical worlds. For example,  denotes grass greenness in the hypothetical world where the sprinkler is off. It can be read as shorthand for “”, and has the benefit that it can occur in expressions involving variables from other worlds. For example, our question can be expressed as .

Third, SCMs require all endogenous variables to have deterministic causal mechanisms. This is satisfied in our case if we assume that the sprinkler is on whenever it’s not raining, and the grass becomes green (only) if it rains or the sprinkler is on.

The determinism means that conditioning is as simple as updating the distribution over exogenous variables, e.g.  updates to . In our case, the probability for rain decreases from  to , since the sprinkler is never on if it's raining.

This means our question is answered by the following reasoning steps:

  • Abduction: update   to 
  • Intervention: intervene to turn the sprinkler off,  
  • Prediction: compute the value of G in the updated model.

Equivalently, in one formula:


That is, we can say that the grass would not have been green if the sprinkler had been off (under the assumption we’ve made about the specific relationships).

SCMs are strictly more powerful than CBNs. Their primary drawback is that they require deterministic relationships between endogenous variables, which are often hard to determine in practice. They're also limited to non-backtracking counterfactuals, where hypothetical worlds are distinguished by interventions.

One agent

To infer Mr Jones’ intentions or incentives, or predict how his behaviour would adapt to changes in his model of the world, we need a causal influence diagrams (CID) that labels variables as chance, decision, or utility nodes. In our example, rain would be a chance node, the sprinkler a decision, and grass greenness a utility. Since rain is a parent of the sprinkler, Mr Jones observes it before making his decision. Graphically, chance nodes are rounded as before, decisions are rectangles, utilities are diamonds, and dashed edges denote observations:

A CID representing our running example. The sprinkler is a decision optimising grass greenness.

The agent specifies causal mechanisms for its decisions, i.e. a policy, with the goal of maximising the sum of its utility nodes. In our example, an optimal policy would be to turn the sprinkler on when it's not raining (the decision when it is raining doesn’t matter). Once a policy is specified, the CID defines a CBN.

In models with agents, there are two kinds of interventions, depending on whether agents get to adapt their policy to the intervention or not. For example, only if we informed Mr Jones about an intervention to the grass before he made his sprinkler decision, could he pick a different sprinkler policy. Both pre-policy and post-policy interventions can both be handled with the standard do-operator if we add so-called mechanism nodes to the model. More about these in the next post.

Multiple agents

Interaction between multiple agents can be modelled with causal games, in which each agent has a set of decision and utility variables.

To illustrate this, assume Mr Jones sometimes sows new grass. Birds like to eat the seeds, but cannot tell from afar whether there are any. They can only see whether Mr Jones is using the sprinkler, which is more likely when the grass is new. Mr Jones wants to water his lawn if it's new, but does not want the birds to eat his seeds. This signalling game has the following structure:

A causal game representing an extension of our running example. Different colours denote different agents’ decision and utility nodes. The missing link from the new seeds (N) to the bird (B) means the bird cannot see whether new seeds are present.

Beyond modelling causality better, causal games also have some other advantages over standard extensive-form games (EFGs). For example, the causal game immediately shows that the birds are indifferent to whether Mr Jones waters the grass or not, because the only directed path from the sprinkler S to food F goes via the birds' own decision B. In an EFG, this information would be hidden in the payoffs. By explicitly representing independencies, causal games can sometimes find more subgames and rule out more non-credible threats than EFGs. A causal game can always be converted to an EFG.

Analogously to the distinction between joint probability distributions, CBNs, and SCMs, there are (multi-agent) influence diagrams that include agents in graphs that need not be causal, and structural causal influence models and structural causal games that combine agents with exogenous nodes and determinism to answer counterfactual questions.


This post introduced models that can answer correlational, interventional and counterfactual questions, and that can handle zero, one, or many agents. All in all, there are nine possible kinds of models. For more comprehensive introductions to causal models, see Section 2 of Reasoning about causality in games, and Pearl's book A Primer.

A taxonomy of causal models and their acronyms. The vertical axis positions models in the causal hierarchy (associational, interventional, or counterfactual), while the horizontal axis specifies the number of agents (0, 1, or n).

Next post. CIDs and causal games are used to model agent(s). But, what is an agent? In the next post, we take a deeper look at what agents are by looking at some characteristics shared by all agentic systems.

New Comment
5 comments, sorted by Click to highlight new comments since:

I think there's something big left out of this post, which is accounting for the agent observing and judging the causal relationships. Something has to decide how to carve up the world into parts and calculate counterfactuals. It's something that exists implicitly in your approach to causality but you don't address it here, which I think is unfortunate because although humans generally have the same frame of reference for judging causality, alien minds, like AI, may not.

The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined "set" operation associated with each variable, so that the effect of interventions is well-defined.

Once we have the variables, and a "set" and "get" operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i.e. sets a few variables) and does the measurement (i.e. observes some variables), the outcome will follow the same distribution.

So in short, I don't think we need to talk about an agent observer beyond what we already say about the variables.

Yes, the variables constitute a reference frame, which is to say an ultimately subjective way of viewing the world. Even if there is high inter-observer agreement about the shape of the reference frame, it's not guaranteed unless you also posit something like Wentworth's natural abstraction hypothesis to be true.

Perhaps a toy example will help explain my point. Suppose the grass should only be watered when there's a violet cube on the lawn. To automate this a sensor is attached to the sprinklers that turns them on only when the sensor sees a violet cube. I place a violet cube on the lawn to make sure the lawn is watered. I return a week later and find the grass is dead.

What happened? The cube was actually painted with a fine mix of red and blue paint. My eyes interpreted purple as violet, but which the sensor did not.

Conversely, if it was my job to turn on the sprinklers rather than the sensor, I would have been fooled by the purple cube into turning them on.

It's perhaps tempting to say this doesn't count because I'm now part of the system, but that's also kind of the point. I, an observer of this system trying to understand its causality, am also embedded within the system (even if I think I can isolate it for demonstration purposes, I can't do this in reality, especially when AI are involved and will reward hack by doing things that were supposed to be "outside" the system). So my subjective experience not only matters to how causality is reckoned, but also how the physical reality being mapped by causality plays out.

Sure, I think we're saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don't actually talk about the same random variable).

How big a problem is it? In practice it seems usually fine, if we're careful to test our sensor / double check we're using language in the same way. In theory, scaled up to super intelligence, it's not impossible it would be a problem.

But I would also like to emphasize that the problem you're pointing to isn't restricted to causality, it goes for all kinds of linguistic reference. So to the extent we like to talk about AI systems doing things at all, causality is no worse than natural language, or other formal languages.

I think people sometimes hold it to a higher bar than natural language, because it feels like a formal language could somehow naturally intersect with a programmed AI. But of course causality doesn't solve the reference problem in general. Partly for this reason, we're mostly using causality as a descriptive language to talk clearly and precisely (relative to human terms) about AI systems and their properties.

Fair. For what it's worth I strongly agree that causality is just one domain where this problem becomes apparent, and we should be worried about it generally for super intelligent agents, much more so than I think many folks seem (in my estimation) to worry about it today.