Tom Everitt

Research Scientist at DeepMind


Towards Causal Foundations of Safe AGI

Wiki Contributions


The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

  • Does not want to manipulate the shutdown button
  • Does respond to the shutdown button
  • Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)

If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.


From a quick read, your proposal seems closely related to Jessica Taylor's causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review

I really like this articulation of the problem!

To me, a way to point to something similar is to say that preservation (and enhancement) of human agency is important (value change being one important way that human agency can be reduced).

One thing I've been trying to argue for is that we might try to pivot agent foundations research to focus more on human agency instead of artificial agency. For example, I think value change is an example of self-modification, which has been studied a fair bit for artificial agents.

Sure, I think we're saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don't actually talk about the same random variable).

How big a problem is it? In practice it seems usually fine, if we're careful to test our sensor / double check we're using language in the same way. In theory, scaled up to super intelligence, it's not impossible it would be a problem.

But I would also like to emphasize that the problem you're pointing to isn't restricted to causality, it goes for all kinds of linguistic reference. So to the extent we like to talk about AI systems doing things at all, causality is no worse than natural language, or other formal languages.

I think people sometimes hold it to a higher bar than natural language, because it feels like a formal language could somehow naturally intersect with a programmed AI. But of course causality doesn't solve the reference problem in general. Partly for this reason, we're mostly using causality as a descriptive language to talk clearly and precisely (relative to human terms) about AI systems and their properties.

The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined "set" operation associated with each variable, so that the effect of interventions is well-defined.

Once we have the variables, and a "set" and "get" operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i.e. sets a few variables) and does the measurement (i.e. observes some variables), the outcome will follow the same distribution.

So in short, I don't think we need to talk about an agent observer beyond what we already say about the variables.

The idea ... works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes. ... But to apply this to a physical system, we would need a way to obtain such a partition those variables

Agree, the formalism relies on a division of variable. One thing that I think we should perhaps have highlighted much more is Appendix B in the paper, which shows how you get a natural partition of the variables from just knowing the object-level variables of a repeated game.

Does a spinal reflex count as a policy?

A spinal reflex would be different if humans had evolved in a different world. So it reflects an agentic decision by evolution. In this sense, it is similar to the thermostat, which inherits its agency from the humans that designed it.

Does an ant's decision to fight come from a representation of a desire to save its queen?

Same as above.

How accurate does its belief about the forthcoming battle have to be before this representation counts?

One thing that I'm excited about to think further about is what we might call "proper agents", that are agentic in themselves, rather than just inheriting their agency from the evolution / design / training process that made them. I think this is what you're pointing at with the ant's knowledge. Likely it wouldn't quite be a proper agent (but a human would, as we are able to adapt without re-evolving in a new environment). I have some half-developed thoughts on this.

This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.

For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?

Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won't be systematically deceiving humans to pursue some particular agenda of the agent. 

As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.

The way I see it, the primary value of this work (as well as other CID work) is conceptual clarification. Causality is a really fundamental concept, which many other AI-safety relevant concepts build on (influence, response, incentives, agency, ...). The primary aim is to clarify the relationships between concepts and to derive relevant implications. Whether there are practical causal inference algorithms or not is almost irrelevant. 

TLDR: Causality > Causal inference :)

Sure, humans are sometimes inconsistent, and we don't always know what we want (thanks for the references, that's useful!). But I suspect we're mainly inconsistent in borderline cases, which aren't catastrophic to get wrong. I'm pretty sure humans would reliably state that they don't want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for

alignment of strong optimizers simply cannot be done without grounding out in something fundamentally different from a feedback signal.

I don't think this is obvious at all.  Essentially, we have to make sure that humans give feedback that matches their preferences, and that the agent isn't changing the human's preferences to be more easily optimized.

We have the following tools at our disposal:

  1. Recursive reward modelling / Debate. By training agents to help with feedback, improvements in optimization power boosts both the feedback and the process potentially fooling the feedback. It's possible that it's easier to fool humans than it is to help them not be fooled, but it's not obvious this is the case.
  2. Path-specific objectives. By training an explicit model of how humans will be influenced by agent behavior, we can design an agent that optimizes the hypothetical feedback that would have been given, had the agent's behavior not changed the human's preferences (under some assumptions).

This makes me mildly optimistic of using feedback even for relatively powerful optimization.

Load More