# 8

To avoid catastrophic conflict in multipolar AI scenarios, we would like to design AI systems such that AI-enabled actors will tend to cooperate. This post is about some problems facing this effort and some possible solutions. To explain these problems, I'll take the view that the agents deployed by AI developers (the ''principals'') in a multipolar scenario are moves in a game. The payoffs to a principal in this game depend on how the agents behave over time. We can talk about the equilibria of this game, and so on. Ideally, we would be able to make guarantees like this:

1. The payoffs resulting from the deployed agents' actions are optimal with respect to some appropriate "welfare function''. This welfare function would encode some combination of total utility, fairness, and other social desiderata;
2. The agents are in equilibrium --- that is, no principal has an incentive to deploy an agent with a different design, given the agents deployed by the other principals.

The motivation for item 1 is clear: we want outcomes which are fair by each of the principals' lights. In particular, we want an outcome that the principals will all agree to. And item 2 is desirable because an equilibrium constitutes a self-enforcing contract; each agent wants to play their equilibrium strategy, if they believe that the other agents are playing the same equilibrium. Thus, given that the principals all say that they will deploy agents that satisfy 1 and 2, we could have some confidence that a welfare-optimal outcome will in fact obtain.

Two simple but critical problems need to be addressed in order to make such guarantees: the equilibrium and prior selection problems. The equilibrium selection problem is that this deployment game will have many equilibria. Even if the principals agree on a welfare function, it is possible that many different profiles of agents optimize the same welfare function. So the principals need to coordinate on the profile of agents deployed in order to make guarantees like 1 and 2. Moreover, the agents will probably have private information, such as information about their payoffs, technological capabilities, and so on. As I will explain below, conflicting priors about private information can lead to suboptimal outcomes. And we can’t expect agents to arrive at the same priors by default. So a prior selection problem also has to be solved.

The equilibrium selection problem is well-known. The prior selection problem is discussed less. In games where agents have uncertainty about some aspect of their counterparts (like their utility functions), the standard solution concept --- Bayesian Nash equilibrium --- requires the agents to have a common prior over the possible values of the players' private information. This assumption might be very useful for some kinds of economic modeling, say. But we cannot expect that AI agents deployed by different principals will have the same priors over private information --- or even common knowledge of each others' priors --- in all of their interactions, in the absence of coordination[1].

It might be unnatural to think about coordinating on a prior; aren't your priors your beliefs? How can you change your beliefs without additional evidence? But there may be many reasonable priors to have, especially for a boundedly rational agent whose "beliefs'' are (say) some complicated property of a neural network. This may be especially true when it comes to beliefs about other agents' private information, which is something that's particularly difficult to learn about from observation (see here for example). And while there may be many reasonable priors to have, incorrect beliefs about others' priors could nonetheless have large downsides[2]. I give an example of the risks associated with disagreeing priors later in the post.

Possible solutions to these problems include:

• Coordination by the principals to build a single agent;
• Coordination by the principals on a profile of agents which are in a welfare-optimal equilibrium;
• Coordination by the principals on procedures for choosing among equilibria and specifying a common prior at least in certain high-stakes interactions between their agents (e.g., interactions which might escalate to destructive conflict).

Finally, a simple but important takeaway of the game-theoretic perspective on multipolar AI deployment is that it is not enough to evaluate the safety of an agent's behavior in isolation from the other agents that will be deployed. Whether an agent will behave safely depends on how other agents are designed to interact, including their notions of fairness and how they form beliefs about their counterparts. This is more reason to promote coordination by AI developers, not just on single-agent safety measures but on the game theoretically-relevant aspects of their agents' architectures and training.

### A learning game model of multipolar AI deployment

In this idealized model, principals simultaneously deploy their agents. The agents take actions on the principals' behalf for the rest of time. Principal has reward function , which their agent is trying (in some sense) to maximize. I'll assume that perfectly captures what principal values, in order to separate alignment problems from coordination problems. The agent deployed by principal is described by a learning algorithm .

At each time point , learning algorithms map histories of observations to actions . For example, these algorithms might choose their actions by planning according to an estimated model. Let be a discount factor and the (partially observed) world-state at time . Denote policies for agent by . Write the world-model estimated from data (which might include models of other agents) as . Let be the expectation over trajectories generated udner policy and model . Then this model-based learning algorithm might look like: In a multiagent setting, each agent’s payoffs depend on the learning algorithms of the other agents. Write the profile of learning algorithms as . Then we write the expected cumulative payoffs for agent when the agents described by are deployed as .

The learning game is the game in which strategies are learning algorithms and payoffs are long-run rewards . We will say that is a learning equilibrium if it is a Nash equilibrium of the learning game (cf. Brafman and Tennenholtz). Indexing all players except by , this means that for each

Let be a welfare function measuring the quality of payoff profile generated by learning algorithm profile . For instance, might simply be the sum of the individual payoffs: . Another candidate for is the Nash welfare. Ideally we would have guarantees like 1 and 2 above with respect to an appropriately-chosen welfare function. Weaker, more realistic guarantees might look like:

• is a -optimal equilibrium with respect to the agents' world-models at each time-step (thus not necessarily an equilibrium with respect to the true payoffs), or
• The actions recommended by constitute a -optimal equilibrium in sufficiently high-stakes interactions, according to the agents' current world-models.

The equilibrium and prior selection problems need to be solved to make such guarantees. I'll talk about these in the next two subsections.

#### The equilibrium selection problem

For a moment, consider the reward functions for a different game: an asymmetric version of Chicken (Table 1)[3]. Suppose players 1 and 2 play this game infinitely many times. The folk theorems tell us that there a Nash equilibrium of this repeated game for every profile of long-run average payoffs in which each player gets at least as much as they can guarantee themselves unilaterally ( for player 1 and for player 2). Any such payoff profile can be attained in equilibrium by finding a sequence of action profiles that generates the desired payoffs, and then threatening long strings of punishments for players who deviate from this plan. This is a problem, because it means that if a player wants to know what to do, it's not sufficient to play a Nash equilibrium strategy. They could do arbitrarily badly if their counterpart is playing a strategy from a different Nash equilibrium.

So, if we want to guarantee that the players don't end up playing lots of 's, it is not enough to look at the properties of a single player. For instance, in the case of AI, suppose there are two separate AI teams independently training populations of agents. Each AI team wants to teach their agents to behave "fairly" in some sense, so they train them until they converge to an evolutionary stable strategy in which some "reasonable'' welfare function is maximized. But, these populations will likely be playing different equilibria. So disaster could still arise if agents from the two populations are played against each other[4].

Then how do players choose among these equilibria, to make sure that they're playing strategies from the same one? It helps a lot if the players have an opportunity to coordinate on an equilibrium before the game starts, as the principals do in our multipolar AI deployment model.

One intuitively fair solution would be alternating between and at each step. This would lead to player 1 getting an average payoff of 1.75 and player 2 getting an average payoff of 0.5. Another solution might be arranging moves such that the players get the same payoff (equal to about ), which in this case would mean playing twelve 's for every seven 's. Or, player 2 might think they can demand more because they can make player 1 worse-off than player 1 can make them. But, though they may come to the bargaining table with differing notions of fairness, both players have an interest in avoiding coordination failures. So there is hope that the players would reach some agreement, given a chance to coordinate before the game.

Likewise, the learning game introduced above is a complex sequential game --- its payoffs are not known at the outset, but can be learned over time. And this game will also have different equilibria that correspond to different notions of fairness. One solution is for the principals to coordinate on a set of learning algorithms which jointly maximize a welfare function and punish deviations from this optimization plan, in order to incentivize cooperation. I discuss this approach in example 5.1.1 here and in this draft.

The problem of enforcement is avoided if the principals coordinate to build a single agent, of course. But it's not clear how likely this is to happen, so it seems important to have solutions which require different degrees of cooperation by the principals. On the other hand, what if the principals are not even able to fully coordinate on the choice of learning algorithms? The principals could at least coordinate on bargaining procedures that their agents will use in the highest-stakes encounters. Such an arrangement could be modeled as specifying a welfare function for measuring the fairness of different proposals in high-stakes interactions, and specifying punishment mechanisms for not following the deal that is maximally fair according to this function. Ensuring that this kind of procedure leads to efficient outcomes also requires agreement on how to form credences in cases where agents possess private information. I address this next.

#### The prior selection problem

In this section, I'll give an example of the risks of noncommon priors. In this example, agents having different beliefs about the credibility of a coercive threat leads to the threat being carried out.

Bayesian Nash equilibrium (BNE) is the standard solution concept for games of incomplete information, i.e., games in which the players have some private information. (An agent's private information often corresponds to their utility function. However, in my example below it's more intuitive to specify the private information differently.) In this formalism, each player has a set of possible "types'' encoding their private information. A strategy maps the set of types to the set of mixtures over actions (which we'll denote by . Finally, assume that the players have a common prior over the set of types. Let be the expected payoff to player when the (possibly mixed) action profile is played. Thus, a BNE is a strategy profile such that, for each and each ,

To illustrate the importance of coordination on a common prior, suppose that two agents find themselves in a high-stakes interaction under incomplete information. Suppose that at time , agent 2 (Threatener) tells agent 1 (Target) that they will carry out some dire threat if Target doesn't transfer some amount of resources to them. However, it is uncertain whether the Threatener has actually committed to carrying out such a threat.

Say that Threatener is a Commitment type if they can commit to carrying out the threat, and a Non-Commitment type otherwise. To compute a BNE, the agents need to specify a common prior for the probability that Threatener is a Commitment type. But, without coordination, they may in fact specify different values for this prior. More precisely, define

• : The probability Threatener thinks Target assigns to being a Commitment type;
• : The Target’s credence that Threatener is a Commitment type;
• : The utility to Threatener if they carry out;
• : The utility to Threatener if Target gives in;
• : The utility to Target if they give in;
• : The utility to Target if the threat is carried out.

A threat being carried out is the worst outcome for everyone. In BNE, Commitment types threaten (and thus commit to carry out a threat) if and only if they think that Target will give in, i.e., . But Targets give in only if Thus threats will be carried out by Commitment types if and only if On the other hand, suppose the agents agree on the common prior probability that Threatener is a Commitment type (so ). Then the execution of threats is always avoided.

How might the agents agree on a common prior? In the extreme case, the principals could try to coordinate to design their agents so that they always form the same credences from public information. Remember that the learning algorithms introduced above fully specify the action of player given an observation history. This includes specifying how agents form credences like . Thus full coordination on the profile of learning algorithms chosen, as suggested in the previous subsection, could in principle solve the problem of specifying a common prior. For instance, write the set of mutually observed data as . Let be a function mapping to common prior probabilities that Threatener is a Commitment type, . The learning algorithms then could be chosen to satisfy

Again, full coordination on a pair of learning algorithms might be unrealistic. But it still might be possible to agree beforehand on a method for specifying a common prior in certain high-stakes situations. Because of incentives to misrepresent one's credences, it might not be enough to agree to have agents just report their credences and (say) average them (in this case e.g. Target would want to understate their credence that Threatener is a Commitment type). One direction is to have an agreed-upon standard for measuring the fit of different credences to mutually observed data. A simple model of this would for the principals to agree on a loss function which measures the fit of credences to data. Then the common credence at the time of a high-stakes interaction , given the history of mutually observed data , would be . This can be arranged without full coordination on the learning algorithms .

I won't try to answer the question of how agents decide, in a particular interaction, whether they should use some "prior commonification'' mechanism. To speculate a bit, the decision might involve higher-order priors. For instance, if Threatener has a higher-order prior over and thinks that there's a sufficiently high chance that inequality (1) holds, then they might think they're better off coordinating on a prior. But, developing a principled answer to this question is a direction for future work.

### Acknowledgements

Thanks to Tobi Baumann, Alex Cloud, Nico Feil, Lukas Gloor, and Johannes Treutlein for helpful comments.

1. Actually, the problem is more general than that. The agents might not only have disagreeing priors, but model their strategic interaction using different games entirely. I hope to address this in a later post. For simplicity I'll focus on the special case of priors here. Also, see the literature on "hypergames" (e.g. Bennett, P.G., 1980. Hypergames: developing a model of conflict), which describe agents who have different models of the game they're playing. ↩︎

2. Compare with the literature on misperception in international relations, and how misperceptions can lead to disaster in human interaction. Many instances of misperception might be modeled as "incorrect beliefs about others' priors''. Compare also with the discussion of crisis bargaining under incomplete information in Section 4.1 here. ↩︎

3. I set aside the problem of truthfully eliciting each player's utility function. ↩︎

4. Cf. this CHAI paper, which makes a related point in the context of human-AI interaction. However, they say that we can't expect an AI trained to play an equilibrium strategy in self-play to perform well against a human, because humans might play off-equilibrium (seeing as humans are "suboptimal''). But the problem is not just that one of the players might play off-equilibrium. It's that even if they are both playing an equilibrium strategy, they may have selected different equilibria. ↩︎