A wellknown issue in game theory is that of multiple Nash equilibria in coordination games.
For example, consider the following game. Players 1 and 2 both choose either "high" or "low". If both choose "high", they both get 10 utility; if both choose "low", they both get 5 utility; if they choose different actions, then they get no utility. There are 3 Nash equilibria: one where both always play "high", one where both always play "low", and one where each plays "high" with 1/3 probability and "low" with 2/3 probability. While (high, high) is obviously the best Nash equilibrium, unilaterally switching to high when the current equilibrium is (low, low) is probably unwise.
You could imagine TDT arguments based on symmetry between the different players to argue that "high" is the only rational choice, but it is easy to imagine asymmetric variants of this problem where the symmetry arguments are much less clear.
The presence of suboptimal Nash equilibria looks like a serious problem for reconciling individual and collective rationality, even when players have common payoffs. If everyone is going to play "low", then playing "low" is individually rational (at least according to CDT); however, the collective of agents fails to act rationally due to a coordination failure.
What I hoped to do before coming up with the idea in this post is to somehow reduce collective rationality to individual rationality. Can we prescribe individual rationality conditions to each player, such that if each player follows these conditions, then a nearoptimal outcome for everyone is attained?
Somewhat surprisingly, the answer is "kind of." It certainly seems like solving coordination problems requires global optimization instead of just local optimization, but under appropriate conditions, local optimization is actually sufficient.
Problem setup
We will consider the setting of normalform commonpayoff games. Let be the number of players. Let be some finite set of actions; each player will select the action . Let be the shared utility function. Each player receives the same utility .
I will now assume that the players play some number of practice rounds before the "real" round. These practice rounds are very similar to those in the fictitious play algorithm. Ontologically, what are these practice rounds? One interpretation is that they are actual rounds of the game that have been played in the past; another interpretation is that they are a sequence of analogous oneshot rounds played by agents who have more and more compute and are thus able to simulate previous rounds. (The second interpretation might seem strange, but this is essentially how logical inductors achieve generalization bounds in decision theory problems)
Let be the number of practice rounds. Let be the action selected by player in practice round . I will assume , i.e. the last practice round is the "real" round.
We will assume that, in each round , some shared random number is generated, and that players' policies in round may depend on this number.
How are players' policies represented? We will assume that each player has only blackbox access to each others' actions in previous practice rounds. Let be player 's policy, which selects an action distribution for player in round given knowledge of the actions in all previous rounds and of the current shared random number.
The generative model is now obvious: For each in sequence, sample and then sample each independently, where the notation refers to the list .
Given a set of policies, it is possible to compute, for each , expected utilities at each time step (which are the expected utilities attained if ):
The limit and corresponding and are interesting; they indicate how well a set of policies does in expectation after a large number of practice rounds.
Policies that solve the game
The policies we will consider are ones that usually stick with the same action they played in the last round, but sometimes switch to a different action, and are more likely to switch the worse the utility in the last round was.
Let be a lower bound on . Let be some "rationality parameter". Consider the following set of policies:
where is a delta distribution on , and probability distributions are taken to be vectors.
These policies start out with each player selecting a random action. In each round, a random player is selected (according to ); this player switches their action to some uniformly random action with a probability that is lower the higher the utility in the previous round was, and otherwise sticks with the same action. Players other than the randomly selected player always stick with the same action.
An important aspect of these policies is that each player only does local optimization. That is, each player shifts their own action depending on the utility in the last round, such that over the course of many iterations they are more likely to take an action that attains a high utility given the other players' actions. Naively, one would expect a local optimization algorithm like this one to get stuck in traps such as the (low, low) Nash equilibrium, but this turns out not to be the case.
We will prove the following theorem:
Theorem 1: For all there exists such that .
i.e. by setting high enough, these policies achieve an arbitrarilyclosetooptimal expected utility in the limit.
Analyzing the process as a Markov chain
To prove this theorem, we can analyze the multiround process with these policies as a Markov chain where the state is the action vector . The transition function (giving the distribution of given ) is:
Consider the following distribution over states:
i.e. it is a softmax that is more likely to select states with higher utilities.
We will show:
Lemma 1: is the unique stationary distribution of the Markov chain defined by .
Proof:
It is wellknown (e.g. see this Wikipedia article) that an ergodic Markov chain satisfying detailed balance (a condition defined later) with respect to a probability distribution over states has this distribution of states as the unique stationary distribution.
The Markov chain is aperiodic, since every state has a nonzero probability of transitioning to itself. Additionally, it is irreducible (i.e. it is possible to get from any state to any other state), since any action in the state can change to any other action with nonzero probability while leaving the other actions fixed. Therefore, the Markov chain is ergodic.
This Markov chain satisfies a detailed balance condition relative to the state distribution . Specifically, for any states and :
i.e. if a state is sampled from then transitioned once, the probability of a transition from to equals the probability of a transition from to .
The proof of this detailed balance property proceeds by case analysis on and .
 If , then the condition is trivial.
 If and differ on more than one element, then both transition probabilities are 0, so the condition is trivial.
 If they differ on a single element , then:
and similarly
Since the Markov chain is ergodic and satisfies detailed balance with respect to , it follows that is the unique stationary distribution of the Markov chain.
Now we will prove Theorem 1.
Theorem 1: For all there exists such that .
Proof:
Clearly, , since is the unique stationary distribution of considered as a Markov chain.
Define , i.e. the highest achievable utility. Define to be the secondhighest achievable utility.
At this point, we need only set to be large enough that the assigns almost all of its probability mass to optimal states. The argument that this is possible to do follows.
Fix . Set .
Let be an optimal action profile, and let be a nonoptimal action profile. We have
Since there is at least one optimal state and fewer than nonoptimal states, the ratio between the probability (under ) of a state being optimal and the probability of a state being nonoptimal is at least . Since in general for positive , this means that the probability of a state being optimal is at least . A nonoptimal state has utility at least , so the overall expected suboptimality (i.e. difference between and ) is at most , as desired.
Extension to noncommonpayoff games
The policies defined above can be extended to play noncommonpayoff games. Specifically, each agent can have a higher likelihood of switching action in each round if their own utility is lower. Unfortunately, this results in mutual defection in a prisoner's dilemma; while I won't present a detailed analysis here, the gist of the argument is that (C, C) is much more likely to switch into (C, D) or (D, C), than (C, D) or (D, C) is to switch into (C, C). So a different approach than this one will be needed for achieving Pareto optimal outcomes in noncommonpayoff games.
Conclusion
In this post I have shown that, in commonpayoff games under a multiround structure, there is a set of policies that only does local optimization but which, together, yield a globally nearoptimal outcome. This is important because getting global optimality from local optimization is required for decision theory (i.e. each agent taking an action based on its own expected utility) to produce Pareto optimal outcomes in commonpayoff games.
In a previous post on coordination games, I presented a plan for solving formal decision theory:

Solve Cartesian commonpayoff decision theory problems where players reason about each other using something like reflective oracles.

Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.

Extend the solution in 2 to a logically uncertain naturalized setting.
This post has done something pretty close to solving 1, with the multiround structure taking the place of reflective oracles. Unfortunately, step 2 fails since the solution method does not extend to noncommonpayoff games. So it appears that, for my plan to work, 1 must be solved in a different way that can be extended to also solve noncommonpayoff games.
Alternatively, the problem of noncommonpayoff games may be attacked directly, though this seems difficult given that noncommonpayoff games are more complex than commonpayoff games, and include all the dynamics present in commonpayoff games. But perhaps restricted subsets of general games, other than the set of commonpayoff games, can be found and solved.