Prisoners' Dilemma with Costs to Modeling

Scott Garrabrant

We consider a modification to the open source prisoners' dilemma in which agents must pay some resources to model each other. We will use the modal combat framework, but where agents pay a cost proportional to the depth of boxes in their code. Even a small modeling penalty makes the FairBot-FairBot outcome no longer an equilibrium, since the best response to FairBot is to be CooperateBot and not pay the modeling penalty. The best response to CooperateBot is to be DefectBot, and the pure DefectBot-DefectBot outcome is a stable Nash equilibrium. In fact, I believe that DefectBot-DefectBot is the unique pure strategy Nash equilibrium.

Amazingly, this turns out to be okay! For small modeling penalties, there is a mixed strategy equilibrium which mixes between CooperateBot, FairBot, and PrudentBot! Both players get exactly the same utility in expectation as the FairBot-FairBot outcome.

Further, if you consider an evolutionary system where populations reproduce in proportion to how well they do in prisoners' dilemmas with each other, it appears that as the modeling penalty gets small, the basin of the defect equilibrium also gets small, and nearly all initial conditions cycle around CooperateBot, FairBot, and PrudentBot!

This post came out of conversations with Sam Eisenstat, Abram Demski, Tsvi Benson-Tilsen, and Andrew Critch. It is a first draft that could use a coauthor to carefully check everything, expand on it, and turn it into a paper. If you think you could do that with minimal guidance from me, let me know.

Formalism

We will be using the modal combat framework, and identifying $⊤$ with cooperation and $⊥$ with defection. Agents are defined to formulas that combine the other agent $X$ run on various agents using propositional calculus and a modal operator $□$ . The $□$ represents provability, and every instance of $X$ run on an agent in the formula must be contained within a $□$ . Recall some common modal agents:

CooperateBot is defined by $C B (X) \leftrightarrow ⊤$ .

DefectBot is defined by $D B (X) \leftrightarrow ⊥$ .

FairBot is defined by $F B (X) \leftrightarrow □ (X (F B))$ .

PrudentBot is defined by $P B (X) \leftrightarrow □ (X (P B) \land (X (D B) \to □ ⊥))$ .

These 4 agents interact with each other as follows: CooperateBot cooperates with everyone. DefectBot defects against everyone. FairBot defects against only DefectBot. PrudentBot defects against CooperateBot and DefectBot and cooperates with itself and FairBot.

We will say that the depth of an agent is the maximum of the depth of $□$ s in its code and the depth of the agents that it calls the opponent on. CooperateBot and DefectBot have depth 0, FairBot has depth 1, and PrudentBot has depth 2.

We will use a prisoner's dilemma where mutual cooperation produces utility 2, mutual defiction produces utility 1, and exploitation produces utility 3 for the exploiter and 0 for the exploited. Each player will also pay a penalty of $ε$ times its depth.

Pure Equilibria

The best response to both CooperateBot and DefectBot is DefectBot, since when the opponent does not depend on you, you want to defect with the least possible penalty.

The best response to FairBot is CooperateBot, since you can't exploit FairBot, so you want to get mutual cooperation with the least possible penalty.

The best response to PrudentBot is FairBot, since you can't exploit PrudentBot, you can't mutually cooperate with penalty 0, but you can mutually cooperate with penalty 1 by being FairBot. (This is assuming $ε$ is at less than $\frac{1}{2}$ . Otherwise, you just want to defect to avoid the penalty.)

Thus, if the only options are CooperateBot, DefectBot, FairBot, and PrudentBot, the unique pure strategy equilibrium is mutual DefectBot.

I believe that DefectBot is the only pure strategy equilibrium in general. This would follow directly from the fact that if a depth $n$ agent $X$ cooperates with another depth $n$ agent $Y$ , then there exists a depth $n - 1$ agent $Y^{'}$ which $X$ also cooperates with. I believe this is true, but haven't proven it yet. If it turns out to be false, there might be a simple modification of the penalties that makes it true.

Mixed Equilibria

First, let's look at Mixed Equilibria in which the only options are the four modal agents above.

Let $d_{i}$ , $c_{i}$ , $f_{i}$ , and $p_{i}$ be the probabilities with which player $i$ is DefectBot, CooperateBot, FairBot, and PrudentBot respectively.

The utility of playing DefectBot is $d_{i} + 3 c_{i} + f_{i} + p_{i} = 1 + 2 c_{i}$ , where $i$ is the other player.

The utility of playing CooperateBot is $2 c_{i} + 2 f_{i}$ , where $i$ is the other player.

The utility of playing FairBot is $d_{i} + 2 c_{i} + 2 f_{i} + 2 p_{i} - ε = 2 - d_{i} - ε$ , where $i$ is the other player.

The utility of playing PrudentBot is $d_{i} + 3 c_{i} + 2 f_{i} + 2 p_{i} - 2 ε = 1 + 2 c_{i} + f_{i} + p_{i} - 2 ε$ , where $i$ is the other player.

Note that if one player is indifferent between CooperateBot and DefectBot, this means the other player plays FairBot exactly $\frac{1}{2}$ of the time, but then the first player would be better off playing PrudentBot (as long as $ε < \frac{1}{4}$ ). Thus no player can ever randomize between CooperateBot and DefectBot.

If a player doesn't play CooperateBot at all, the other player can't play Prudent Bot, since FairBot is strictly better. Thus, if both players don't play CooperateBot, they both play only DefectBot and FairBot. If this isn't the pure defect solution, it must be because $f_{i} = ε$ and $d_{i} = 1 - ε$ for both players, which is indeed a (not very good) Nash equilibrium.

If exactly one player plays CooperateBot at all, the first player mixes between CooperateBot and FairBot, while the other player mixes between (at most) DefectBot, FairBot and PrudentBot. The second player can't play FairBot, since CooperateBot would have done strictly better. Thus the first player gets 0 utility for playing CooperateBot, contradicting the fact that it is impossible to get 0 expected utility playing FairBot.

Finally, we have the case where both players play CooperateBot with some probability, and thus neither plays DefectBot at all. If either player did not play FairBot at all, then DefectBot would strictly dominate CooperateBot for the other player, contradicting the fact that both players play CooperateBot. If either player did not play PrudentBot at all, CooperateBot would strictly dominate FairBot, contradicting the fact that both players play FairBot.

Thus, in the only remaining equilibria, both players mix between all of CooperateBot, FairBot, and PrudentBot. Since both players are indifferent between CooperateBot and FairBot, both players must play PrudentBot with probability exactly $\frac{ε}{2}$ . Since both players are indifferent between FairBot and PrudentBot, both players must play CooperateBot with probability exactly $ε$ . This leaves probability $1 - \frac{3 ε}{2}$ for FairBot, and the result is indeed a Nash equilibrium.

We have a total of three Nash equilibria. All three are symmetric. The first two are bad and both players get the expected utility $1$ . The last one is good and both players get expected utility $2 - ε$ .

Next, we have to show that this outcome is also an equilibrium in the game where both players can play any modal agent. If another agent were to get utility more than $2 - ε$ against this $\frac{ε}{2}$ PrudentBot, $ε$ CooperateBot, $1 - \frac{3 ε}{2}$ FairBot combination, it would clearly have to be depth $1$ , since the only depth $0$ agents are CooperateBot and DefectBot, and depth $2$ PrudentBot already has the perfect behavior against these 3 agents. It would also have to mutually cooperate with FairBot, and defect against CooperateBot.

Suppose a depth $1$ agent $X$ provably cooperates with FairBot and defects against CooperateBot. Note that since $X$ is depth $1$ , $P A + 1$ knows what $X (C B)$ is, and so $P A + 1$ knows that $X (C B) = ⊥ \neq ⊤ = X (F B)$ . However $P A + 1$ also know that $□ □ ⊥ \to \forall Y □ (F B (Y) = ⊤ = C B (Y))$ . Thus $P A + 1$ knows $□ □ ⊥ \to X (F B) = X (C B)$ , so $P A + 1$ knows $\neg □ □ ⊥$ , contradiction. Therefore, no agent of depth $1$ can have the desired behavior, and our good Nash equilibrium is in fact a Nash equilibrium of the game where both players can play any modal agent.

Evolutionary Simulations

Next, we will consider what happens when a large population of modal agents evolves by playing prisoners' dilemmas. We will consider a system consisting only of DefectBot, CooperateBot, FairBot, and PrudentBot. We will consider a simple model where at each time step, each population grows a small amount proportional to the size of that population times the expected utility a member in that population gets by playing a prisoners' dilemma with a random other agent. Notice that in this model, a population that starts out nonzero will remain nonzero, and a population that starts out 0 will remain 0.

Since the above three Nash equilibria were symmetric Nash equilibria, they are also equilibria of this evolutionary system. This is because the expected utility of being each type of agent that appears any nonzero amount is the same. Since this system keeps populations that start at 0 at 0, there are other equilibria too; for example, any point where all the agents are the same is in equilibrium since it cannot change.

We can then ask about the stability of these equilibria. The pure defect one is stable, since if there are only a small number of agents that are not DefectBots, they won't play each other enough to cancel out their modeling penalty. The $1 - ε$ DefectBot, $ε$ FairBot one is unstable, since the more FairBots there are, the better the FairBots do. Unfortunately, the good equilibrium is also unstable. This is harder to see, but a small perturbation will cause a very slow spiraling outward. (I checked this with simulation, but did not prove it mathematically.)

However, I ran a simulation that seemed to show that for a small $ε$ , and for almost all initial conditions that mix between all four types of agents, the system does not converge, but instead goes in a large cycle, in which the system spends most of its time with almost all FairBots. Eventually a very small number of CooperateBots climbs out to become non-negligible, then as soon as the CooperateBots are a significant proportion, the PrudentBots come in to exploit them, then the PrudentBots quickly turn into FairBots, and the cycle repeats. Meanwhile, the DefectBots are pushed down into a very small proportion of the population.

I also looked at a second model as follows: The population starts with a small finite number of agents of each type. At each time step, a new agent enters, and permanently becomes the modal agent that maximizes expected utility against the existing population. In this simulation, unless you start with very few FairBots or PrudentBots, you will simply switch between adding FairBots, PrudentBots, and CooperateBots, never adding a new DefectBot.

A more careful study could actually find what the dynamics are mathematically rather than depending on simulations, and can consider systems with more or even all modal agents. There also might be other models that are more justified than the one I used. I expect that most ways of doing it will result in very little defection.

Conclusion

I like this result. I just assumed that DefectBot would be the only Nash equilibrium, and I was surprised that the story had a much happier ending. When I first became suspicious, I was thinking that you could get a good equilibrium out of an infinite collection of different agents, but it turns out you only need three. You can view the CooperateBots as defecting against the FairBots by refusing to pay to punish bad actors, and you can view the FairBots as defecting against the PrudentBots by refusing to pay to punish the non-punishers. You might think you would need to punish arbitrary levels on non-punishers to be in equilibrium, but it turns out that the PrudentBots can get paid exactly enough to remain competitive by exploiting the CooperateBots, and the CooperateBots can get exploited exactly enough to cancel out their unwillingness to model the opponent.

[-]Jaime Sevilla6y40

I have been thinking about this research direction for ~4 days.

No interesting results, though it was a good exercise to calibrate how much do I enjoy researching this type of stuff.

In case somebody else wants to dive into it, here are some thoughts I had and resources I used:

Thoughts:

The definition of depth given in the post seems rather unnatural to me. This is because I expected it would be easy to relate the depth of two agents to the rank of the world of a Kripke chain where the fixed points representing their behavior will stabilize. Looking at Zachary Gleit's proof of the fixed point theorem (see The Logic of Provability, chapter 8, by G. Boolos) we can relate the modal degree of a fixed point to the number of modal operators that appear in the modalized formula to be fixed. I thought I could go through Gleit's proof counting the number of boxes that appear in the fixed points, and then combine that with my proof of the generalized fixed point theorem to derive the relationship between the number of boxes appearing in the definition of two agents and the modal degree of the fixed points that appear during a match. This ended up being harder than what I anticipated, because naively counting the number of boxes that appear in Gleit's proof makes really fast growing formulas appear and its hard to combine them through the induction of the generalized theorem proof.

Resources:

The Logic of Provability, by G. Boolos. Has pretty much everything you need to know about modal logic. Recommended reading chapters 1,3,4,5,6,7,8.
Fixed point theorem of provability logic, by J. Sevilla. An in depth explanation I wrote in Arbital some years ago.
Modal Logic in the Wolfram Language, by J. Sevilla. A working implementation of Modal Combat, with some added utilities. It is hugely inefficient and Wolfram is not a good choice because license issues, but may be useful to somebody who wants to compute the result of a couple combats or read about modal combat at introductory level. You can open the attached notebook in the Wolfram Programming Lab.

Thank you Scott for writing this post, it has been useful to get a glimpse of how to do research.

AI ALIGNMENT FORUM
AF