Sarah Constantine recently wrote about the Pavlov strategy for iterated prisoner's dilemma. She points out that it deserves to be better-known, given that it has similar virtues to Tit-for-Tat (TfT). I strongly agree.
In this post, I would like to explain how to generalize Pavlov to a wide set of games (achieving some surprisingly strong, but not actually very good, coordination properties), and also convey some intuitions about Pavlov-style vs TfT-style coordination.
The paper Achieving Pareto Optimality Through Distributed Learning by Marden, Young, and Pao describes a strategy which allows players in an iterated game to eventually converge to a Pareto-optimal action profile with high probability. In fact, the strategy described converges to action profiles which maximize the sum of all the utilities! That should sound too good to be true. Some big caveats are coming.
The strategy is, roughly, as follows. Agents have two different states: content and discontent. When content, the agent copies its own action from the previous round, whatever it was. When discontent, the agent plays randomly.
An agent flips between content and discontent based on how well things are going for it. There is a parameter, experimentation rate (e.r.), which determines how likely an agent is to flip states.
The nice property this strategy has is that for a low e.r., players end up eventually taking an efficient set of actions with high probability. It accomplishes this without agents knowing anything about each other; everyone acts on the payoffs they receive alone.
This is kind of nice, but looking at the strategy, we can see that the convergence rate will be very poor. The reason it works is that all agents have to become content at once in order to stay that way with significant probability. If only some agents become content, then the payoffs of the game will remain inconsistent, due to the agents who are still randomizing. As e.r. gets small, agents become very unlikely to become content, but comparatively very very very unlikely to become content when they've personally gotten a poor outcome. So, what happens overall is that everyone thrashes around for a long long time, ensuring that all the different possible combined action-states get explored. When everyone happens to become content at the exact same time, then they stay that way for a long long long time. This is an unlikely coincidence, but far more likely when the combined set of actions in a round maximizes joint utility.
The basics of TfT are to copy the other player's move from the previous round, whereas the basics of Pavlov are to copy your own move from the previous round if things went well, and otherwise change. This is captured nicely by the content/discontent idea.
On the other hand, the convergence behavior described -- where players flail around for a very long time before they settle on a jointly efficient strategy -- doesn't sound much like Pavlov.
In Prisoner's Dilemma, we (outside of the game) have an idea of what "good" and "bad" outcomes are, so that "copy your strategy if they cooperate; change your strategy if they defect" makes sense. The content/discontent algorithm, on the other hand, doesn't know what payoffs are reasonable to expect. That's why Pavlov is easily contented whereas the content/discontent algorithm isn't.
If we know ahead of time which outcomes are reasonable (which is easy in a symmetric game, but more generally has to rely on a notion of fairness such as the Nash Bargaining Solution), we can program agents to switch to "content" as soon as they see payoffs at least as high as what they can reasonably expect to get. This makes the strategy more like simple Pavlov. However, if the agents try to do this kind of reasoning, it will tend to make them exploitable: "bully" agents can convince them to be content with low payoffs by making high payoffs seem impossible.
Even so, it doesn't make sense to set e.r. extremely low, which is what's necessary to guarantee eventual Pareto-optimality. In practice you'd rather have a moderate e.r., accepting the chance of settling on suboptimal action profiles.
There's something of a trade-off between exploitability and convergence rate. If e.r. is high, then agents settle easily, so bullies can take advantage by only offering exploitative opportunities. If e.r. is moderate, then bullies have to be verrrry patient to pull that off. The generalized-Pavlov agent incentivises cooperation by being more likely to settle for better offers. If e.r. is low, though, the probability of becoming content is so low that there's no incentive for other agents to care; they'll just deal with your randomization in whatever way is optimal for them.
Using the MTG color wheel to discuss AI alignment has been popular recently, so... according to me, TfT is a very "white" strategy, and Pavlov is a very "red" strategy. (For the rest of the post, "Pavlov" and "TfT" will refer to more general clusters rather than the specific prisoner's dilemma strategies.)
Here's some discussion of what the strategy clusters look like: (I'll leave open the question of what strategies might correspond to the other MTG colors.)
Even within the domain of iterated prisoner's dilemma, there are many strategies which are still generally within the TfT cluster. Sarah discusses some of them.
Moving away from an iterated setting and into a logical one, TfT-like reasoning tries to copy the move of the other player on this round, rather than the previous round. Reasoning about the other player with logical uncertainty, a natural strategy is NicerBot, which cooperates with probability epsilon higher than its estimate of the other player's probability of cooperation. Adding epsilon ensures that two NicerBots cooperate with each other (otherwise they could end up converging to any probability of cooperation); more generally, NicerBot matches the cooperativeness of other players.
There is an important difficulty in generalizing TfT-like strategies further: you don't want to cooperate with a rock which has "I cooperate" engraved on it. In other words, it doesn't make sense to cooperate with non-agents. TfT is a strategy which makes sense when we've identified that we're dealing with another agent, but we need to combine its behavior with more general decision-theoretic reasoning, somehow.
In open-source game theory with proof-based agents, the Löbian handshake could be considered a TfT-like concept. However, unlike with NicerBot, we naturally only cooperate with other agents; we defect against a rock with "I cooperate" written on it. So, naively, it seems like we're in a better position to generalize TfT-like reasoning in pure logic than we are in logically-uncertain reasoning. This puzzle is discussed more in this post.
The Löbian handshake is still fragile to Agent Simulates Predictor type problems. One way of interpreting this (which I think is the right way) is that although a Löbian-handshake-capable agent won't cooperate with rocks, its notion of "rock" is anything with much less processing power than itself. This is problematic, because it only allows cooperation with similarly-intelligent agents.
The intuitive notion of logical causality I associate with TfT is one which combines the logically-uncertain reasoning of Nicerbot and the Löbian handshake of the proof-based setting. It is very TDT-ilke: you reason about your decision as if you have logical control over relevantly similar agents. You coordinate with other agents by reasoning in the same way about what the optimal joint strategy would be.
Pavlov-type reasoning doesn't have the same difficulties generalizing as TfT. Whereas TfT seems to require extra pieces added on to take advantage of non-agents (like PrudentBot), Pavlov does this naturally, while also naturally finding cooperative equilibria with other Pavlov agents.
It seems possible that Pavlov-cluster algorithms can do well in Agent Simulates Predictor -type problems, too. Pavlov happily defects against a rock because the rock exerts no agentic pressure toward cooperation. However, in Agent Simulates Predictor, the predictor does exert "agentic pressure" -- it makes things go poorly for agents who predictably defect. A Pavlov will respond to this by being less likely to settle on defection as a strategy. (This is very hand-wavy, however. It would be good to construct a setting in which illustrates this behavior.)
I've already described a generalization to deterministic iterated games, although the convergence behavior is admittedly quite bad (perhaps it can be improved). It isn't hard to generalize the strategy I described to games with stochastic payoffs, if one is willing to use hacks like aggregating iterations into batches to de-noise the payout information. It would be interesting to see a non-hacky way of generalizing it.
Moving to a more logical setting, we may again want to substitute the notion of "previous round" with "prediction". TfT-generalization matched its behavior to the predicted behavior of the other agent; Pavlov would instead copy its own predicted behavior! Or, rather: if Pavlov expects things to go well, it tries to do what it would expect itself to do. If Pavlov expects things to go poorly, it tries to invalidate its self-prediction.
(This doesn't seem obviously good, but, perhaps it can lead somewhere.)
I'm not sure what notion of logical control should be associated with Pavlov. It feels connected with Decision Theory is for Making Bad Outcomes Inconsistent: Pavlov is trying to "shake off" bad outcomes, as if that made them less probable, much like the (apparently) bad reasoning discussed in The Happy Dance Problem. The content/discontent algorithm shows that something like this can work in the long run (though not in the short run).
Another idea is that the content/discontent algorithm controls the stationary distribution of the Markov chain (the iterated play) by setting the transition probabilities so as to steer toward optimal outcomes preferentially. Perhaps a notion of logical control in which an agent thinks of itself as steering in this way?
Overall, the Pavlov side of things is very sketchy. As competing visions, TfT-like reasoning may hold up better in the long run. Still, it seems useful to try to flesh out Pavlov-type reasoning as an alternative.
Not super important but maybe worth mentioning in the context of generalizing Pavlov: the strategy Pavlov for the iterated PD can be seen as an extremely shortsighted version of the law of effect, which basically says: repeat actions that have worked well in the past (in similar situations). Of course, the LoE can be applied in a wide range of settings. For example, in their reinforcement learning textbook, Sutton and Barto write that LoE underlies all of (model-free) RL.
Somewhat true, but without further bells and whistles, RL does not replicate the Pavlov strategy in Prisoner's Dilemma, so I think looking at it that way is missing something important about what's going on.