Tom Everitt

Research Scientist at DeepMind

Wiki Contributions


The way I see it, the primary value of this work (as well as other CID work) is conceptual clarification. Causality is a really fundamental concept, which many other AI-safety relevant concepts build on (influence, response, incentives, agency, ...). The primary aim is to clarify the relationships between concepts and to derive relevant implications. Whether there are practical causal inference algorithms or not is almost irrelevant. 

TLDR: Causality > Causal inference :)

Sure, humans are sometimes inconsistent, and we don't always know what we want (thanks for the references, that's useful!). But I suspect we're mainly inconsistent in borderline cases, which aren't catastrophic to get wrong. I'm pretty sure humans would reliably state that they don't want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for

alignment of strong optimizers simply cannot be done without grounding out in something fundamentally different from a feedback signal.

I don't think this is obvious at all.  Essentially, we have to make sure that humans give feedback that matches their preferences, and that the agent isn't changing the human's preferences to be more easily optimized.

We have the following tools at our disposal:

  1. Recursive reward modelling / Debate. By training agents to help with feedback, improvements in optimization power boosts both the feedback and the process potentially fooling the feedback. It's possible that it's easier to fool humans than it is to help them not be fooled, but it's not obvious this is the case.
  2. Path-specific objectives. By training an explicit model of how humans will be influenced by agent behavior, we can design an agent that optimizes the hypothetical feedback that would have been given, had the agent's behavior not changed the human's preferences (under some assumptions).

This makes me mildly optimistic of using feedback even for relatively powerful optimization.

Nice post! The Game Theory / Bureaucracy is interesting. It reminds me of Drexler's CAIS proposal, where services are combined into an intelligent whole. But I (and Drexler, I believe) agree that much more work could be spent on figuring out how to actually design/combine these systems.

Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)

We would also be curious to see a draft of the MDP-generalization once you have something ready to share!


  • I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).


This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs).  We're still working on these as there are a range of subtleties, but I'm pretty confident we'll have a good account of it.

Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there. 

(Sorry btw for slow reply; I keep missing alignmentforum notifications.)

Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent. 

What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive on X. For this reason, we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect.

Glad you liked it.

Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.

Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.

There is a paper which I believe is trying to do something similar to what you are attempting here:

Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147

Are you aware of it? How do you think their ideas relate to yours?

Load More