Progress on Causal Influence Diagrams

Planned summary for the Alignment Newsletter:

Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the _wrong incentives_. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are <@graphical models@>(@Understanding Agent Incentives with Causal Influence Diagrams@) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:
1. We can analyze [what happens](https://arxiv.org/abs/2102.07716) when you [intervene](https://arxiv.org/abs/1707.05173) on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.
2. We can <@avoid reward tampering@>(@Designing agent incentives to avoid reward tampering@) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its _current_ reward function.
3. A [multiagent version](https://arxiv.org/abs/2102.05008) allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.

[-]Vanessa Kosoy4y20

IIUC, in a multi-agent influence model, every subgame perfect equilibrium is also a subgame perfect equilibrium in the corresponding extensive form game, but the converse is false in general. Do you know whether at least one subgame perfect equilibrium exists for any MAIM? I couldn't find it in the paper.

[-]James Fox4y*00

Hi Vanessa, Thanks for your question! Sorry for taking a while to reply. The answer is yes if we allow for mixed policies (i.e., where an agent can correlate all of their decision rules for different decisions with a shared random bit), but no if we restrict agents to only be able to use behavioural policies (i.e., decision rules for each of an agent's decisions are independent because they can't access a shared random bit). This is analogous to the difference between mixed and behavioural strategies in extensive form games, where (in general) a subgame perfect equilibrium (SPE) is only guaranteed to exist in mixed strategies (and the game is finite etc by Nash' theorem).

Note that If all agents in the MAIM have perfect recall (where they remember their previous decisions and the information that they knew at previous decisions), then there is guaranteed to exist a SPE in behavioural policies). In fact, Koller and Milch showed that only a weaker criterion of "sufficient recall" is needed (https://www.semanticscholar.org/paper/Ignorable-Information-in-Multi-Agent-Scenarios-Milch-Koller/5ea036bad72176389cf23545a881636deadc4946).

In a forthcoming journal paper, we expand significantly on the the theoretical underpinnings and advantages of MAIMs and so we will provide more results there.

[-]IlyaShpitser4y20

Pretty interesting.

Since you are interested in policies that operate along some paths only, you might find these of interest:

https://pubmed.ncbi.nlm.nih.gov/31565035/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6330047/

We have some recent stuff on generalizing MDPs to have a causal model inside every state ('path dependent structural equation models', to appear in UAI this year).

[-]tom4everitt4y30

Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)

We would also be curious to see a draft of the MDP-generalization once you have something ready to share!

[-]IlyaShpitser4y*20

https://auai.org/uai2021/pdf/uai2021.89.preliminary.pdf (this really is preliminary, e.g. they have not yet uploaded a newer version that incorporates peer review suggestions).

---

Can't do stuff in the second paper without worrying about stuff in the first (unless your model is very simple).