Work done with Ramana Kumar, Sebastian Farquhar (Oxford), Jonathan Richens, Matt MacDermott (Imperial) and Tom Everitt.
Our DeepMind Alignment team researches ways to avoid AGI systems that knowingly act against the wishes of their designers. We’re particularly concerned about agents which may be pursuing a goal that is not what their designers want.
These types of safety concerns motivate developing a formal theory of agents to facilitate our understanding of their properties and avoid designs that pose a safety risk. Causal influence diagrams (CIDs) aim to be a unified theory of how design decisions create incentives that shape agent behaviour to illuminate potential risks before an agent is trained and inspire better agent designs with more appealing alignment properties.
Our new paper, Discovering Agents, introduces new ways of tackling these issues, including:
Combined, these results provide an extra layer of assurance that a modelling mistake hasn’t been made, which means that CIDs can be used to analyse an agent’s incentives and safety properties with greater confidence.
To help illustrate our method, consider the following example consisting of a world containing three squares, with a mouse starting in the middle square choosing to go left or right, getting to its next position and then potentially getting some cheese. The floor is icy, so the mouse might slip. Sometimes the cheese is on the right, but sometimes on the left.
This can be represented by the following CID:
The intuition that the mouse would choose a different behaviour for different environment settings (iciness, cheese distribution) can be captured by a mechanised causal graph (a variant of mechanised causal game graph), which for each (object-level) variable, also includes a mechanism variable that governs how the variable depends on its parents. Crucially, we allow for links between mechanism variables.
This graph contains additional mechanism nodes in black, representing the mouse's policy and the iciness and cheese distribution.
Edges between mechanisms represent direct causal influence. The blue edges are special terminal edges – roughly, mechanism edges ~A → ~B that would still be there, even if the object-level variable A was altered so that it had no outgoing edges.
In the example above, since U has no children, its mechanism edge must be terminal. But the mechanism edge ~X → ~D is not terminal, because if we cut X off from its child Uthen the mouse will no longer adapt its decision (because its position won’t affect whether it gets the cheese).
We build on Dennet’s intentional stance – that agents are systems whose outputs are moved by reasons. The reason that an agent chooses a particular action is that it expects it to lead to a certain desirable outcome. Such systems would act differently if they knew that the world worked differently, which suggests the following informal characterisation of agents:
Agents are systems that would adapt their policy if their actions influenced the world in a different way.
The mouse in the example above is an agent because it will adapt its policy if it knows that the ice has become more slippery, or if the cheese is more likely on the left. In contrast, the output of non-agentic systems might accidentally be optimal for producing a certain outcome, but these do not typically adapt. For example, a rock that is accidentally optimal for reducing water flow through a pipe would not adapt its size if the pipe was wider.
This characterisation of agency may be read as an alternative to, or an elaboration of, the intentional stance (depending on how you interpret it) couched in the language of causality and counterfactuals. See our paper for comparisons of our notion of agents with other characterisations of agents, including Cybernetics, Optimising Systems, Goal-directed systems, time travel, and compression.
Our formal definition of agency is given in terms of causal discovery, discussd in the next section.
Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from a variable A to a variable B by experimentally intervening on A and checking if B responds, even if all other variables are held fixed.
Our first algorithm uses this causal discovery principle to discover the mechanised causal graph, given the interventional distributions (which can be obtained from experimental international data). The below image visualises the inputs and outputs of the algorithm, see our paper for the full details.
Our second algorithm transforms this mechanised causal graph to a game graph:
It works by assigning utilities to nodes with outgoing blue terminal edges on their mechanisms and decisions to nodes with incoming blue terminal edges on their mechanisms. The mechanism connections reveal which decisions and utilities belong to the same agent, and are used to determine node colours in multi-agent CIDs.
Our third algorithm transforms the game graph into a mechanised causal graph, to establish an equivalence between the different representations. The equivalence only holds under some additional assumptions, as the mechanised causal graph can contain more information than the game graph in some cases.
In the paper we prove theorems concerning the correctness of these algorithms.
In this example, we have an Actor-Critic RL setup for a one-step MDP. The underlying system has the following game graph.
Here an actor selects action A as advised by a critic. The critic’s action Q states the expected reward for each action (in the form of a vector with one element for each possible choice of A, this is often called a Q-value function). The action A influences the state S, which in turn determines the reward R. We model the actor as just wanting to follow the advice of the critic, so its utility is Y=Q(A), (the A-th element of the Q-vector). The critic wants its advice Y to match the actual reward R. Formally, it optimises W=−(R−Y)2.
Algorithm 1 produces the following mechanised causal graph:
Let’s focus on a few key edges: (~S,~Q) is present, but (~S,~A) is not, i.e. the critic cares about the state mechanism but the actor does not. The critic cares because it is optimising W which is causally downstream of S, and so the optimal decision rule for Q will depend on the mechanism of S even when other mechanisms are held constant. The dependence disappears if R is cut off from S, so the edge (~S,~Q) is not terminal. In contrast, the actor doesn’t care about the mechanism of S, because Y is not downstream of S, so when holding all other mechanisms fixed, varying ~S won't affect the optimal decision rule for A. There is however an indirect effect of the mechanism for S on the decision rule for A, which is mediated through the decision rule for Q.
Our Algorithm 2 applied to the mechanised causal graph produces the correct game graph by identifying that ~A and ~Q have incoming terminal edges, and therefore are decisions; that Y's mechanism has an outgoing terminal edge to A's mechanism and so is its utility; and that W's mechanism has an outgoing terminal edge to the mechanism for Q, and so is its utility. The decisions and utilities get coloured differently due to their terminal edge subgraphs being disconnected.
This can help avoid modelling mistakes and incorrect inference of agent incentives. In particular, Paul Christiano (private communication, 2019) has questioned the reliability of incentive analysis from CIDs, because an apparently reasonable way of modelling the actor-critic system would be as follows, where the actor is not modelled as an agent:
Doing incentive analysis on this single-agent diagram would lead to the assertion that the system is not trying to influence the state S or the reward R, because they don't lie on the directed path Q→W (i.e. neither S nor R has an instrumental control incentive). This would be incorrect, as the system is trying to influence both these variables (in an intuitive and practical sense).
Thanks to our algorithms, we can now crisply explain why this is an incorrect model of the system, since it’s not modelling A as a decision, and Y as its utility. This modelling mistake would be avoided by applying Algorithms 1 and 2 to the underlying system, which produce the correct diagram (with A as decision, Y as its utility). The correct diagram has two agents, and it's not possible to apply the single-agent instrumental control incentive. Instead, an incentive concept suitable for multi-agent systems would need to be developed. A key criterion for such a fruitful multi-agent incentives concept is that it captures the influence on S and R jointly exerted by A and Q.
Whether a variable is a decision, utility or a chance variable is relative to the overall choice of variables. This choice represents a frame in which to model that system, and the notions of decision and utility make sense only with reference to this frame. See Appendix C in our paper for some examples of this relativism.
Our work suggests some modelling advice for the practitioner, mostly that it is good practice to clarify whether a variable is object-level, or a mechanism; and that it’s best to distinguish when a variable is a utility, or is merely instrumental for some downstream utility.
We proposed the first formal causal definition of agents. Grounded in causal discovery, our key insight is that agents are systems that adapt their behaviour in response to changes in how their actions influence the world. Indeed, our Algorithms 1 and 2 describe a precise experimental process that can be done to assess whether something is an agent. Our process is largely consistent with previous informal characterisations of agents, but making it formal makes it more precise and enables agents to be identified empirically.
As illustrated with an example above, our work improves the reliability of methods building on causal models of AI systems, such as analyses of the safety and fairness of machine learning algorithms (the paper contains additional examples).
Overall we've found that causality is a useful framework for discovering whether there is an agent in a system – a key concern for assessing risks from AGI .
Excited to learn more? Check out our paper. Feedback and comments are most welcome.
The idea that "Agents are systems that would adapt their policy if their actions influenced the world in a different way." works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes: we simply check for a path from a utility function F_U to a policy Pi_D. But to apply this to a physical system, we would need a way to obtain such a partition those variables. Specifically, we need to know (1) what counts as a policy, and (2) whether any of its antecedents count as representations of "influence" on the world (and after all, antecedents A of the policy can only be 'representations' of the influence, because in the real world, the agent's actions cannot influence themselves by some D->A->Pi->D loop). Does a spinal reflex count as a policy? Does an ant's decision to fight come from a representation of a desire to save its queen? How accurate does its belief about the forthcoming battle have to be before this representation counts? I'm not sure the paper answers these questions formally, nor am I sure that it's even possible to do so. These questions don't seem to have objectively right or wrong answers.
So we don't really have any full procedure for "identifying agents". I do think we gain some conceptual clarity. But on my reading, this clear definition serves to crystallise how hard it is to identify agents, moreso than it shows practically how it can be done.
(NB. I read this paper months ago, so apologies if I've got any of the details wrong.)
The idea ... works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes. ... But to apply this to a physical system, we would need a way to obtain such a partition those variables
Agree, the formalism relies on a division of variable. One thing that I think we should perhaps have highlighted much more is Appendix B in the paper, which shows how you get a natural partition of the variables from just knowing the object-level variables of a repeated game.
Does a spinal reflex count as a policy?
A spinal reflex would be different if humans had evolved in a different world. So it reflects an agentic decision by evolution. In this sense, it is similar to the thermostat, which inherits its agency from the humans that designed it.
Does an ant's decision to fight come from a representation of a desire to save its queen?
Same as above.
How accurate does its belief about the forthcoming battle have to be before this representation counts?
One thing that I'm excited about to think further about is what we might call "proper agents", that are agentic in themselves, rather than just inheriting their agency from the evolution / design / training process that made them. I think this is what you're pointing at with the ant's knowledge. Likely it wouldn't quite be a proper agent (but a human would, as we are able to adapt without re-evolving in a new environment). I have some half-developed thoughts on this.
How computantially expensive is this to implement? When I looked into CID a couple of years ago, figuring out a causal graph for an agent/environment was quite costly, which would make adoption harder.
I haven't considered this in great detail, but if there are N variables, then I think the causal discovery runtime is O(N2). As we mention in the paper (footnote 5) there may be more efficient causal discovery algorithms that make use of certain assumptions about the system.
On adoption, perhaps if one encounters a situation where the computational cost is too high, one could coarse-grain their variables to reduce the number of variables. I don't have results on this at the moment but I expect that the presence of agency (none, or some) is robust to the coarse-graining, though the exact number of agents is not (example 4.3), nor are the variables identified as decisions/utilities (Appendix C).
The way I see it, the primary value of this work (as well as other CID work) is conceptual clarification. Causality is a really fundamental concept, which many other AI-safety relevant concepts build on (influence, response, incentives, agency, ...). The primary aim is to clarify the relationships between concepts and to derive relevant implications. Whether there are practical causal inference algorithms or not is almost irrelevant.
TLDR: Causality > Causal inference :)