## AI ALIGNMENT FORUMAF

Point (1) seems to be a combination of an issue of working around the absence of a mathematically-elegant communication channel in the formalism, and an incentive to choose some orderings over others because of (2). If (2) is solved and they can communicate, then they can agree on an ordering without any trouble because they're both indifferent to which one is chosen.

If you don't have communication but you have solved (2), I think you can solve the problem by splitting agents into two stages. In the first stage, agents try to coordinate on an ordering over the points. To do this, the two agents X and Y each generate a bag of orderings Ox and Oy that they think might be Schelling points. Agent X first draws an ordering from Ox and tries to prove coordination on it in PA+1, then draws another with replacement and tries to prove coordination on it in PA+2, then draws another with replacement and tries to prove coordination on it in PA+3, etc. Agent Y does the same thing, with a different bag of proposed orderings. If there is overlap between their respective sets, then the odds that they will fail to find a coordination point fall off exponentially, albeit slowly.

Then in the second stage, after proving in PA+n for some n that they will both go through the same ordering, each will try to prove coordination on point 1 in PA+n+1, on point 2 in PA+n+2, etc, for the points they find acceptable.

# Modal Bargaining Agents

4 min read16th Apr 2015No comments

# 0

Summary: Bargaining problems are interesting in the case of Löbian cooperation; Eliezer suggested a geometric algorithm for resolving bargaining conflicts by leaving the Pareto frontier, and this algorithm can be made into a modal agent, given an additional suggestion by Benja.

# Bargaining and Fairness

When two agents can read each others' source code before playing a game, Löbian cooperation can transform many games into pure bargaining problems. Explicitly, the algorithm of "if I can prove you choose X, then I choose Y, otherwise I choose Z" makes it possible for an agent to offer a deal better for both parties than a Nash equilibrium, then default to that Nash equilibrium if the deal isn't verifiably taken. In the simple case where there is a unique Nash equilibrium, two agents with that sort of algorithm can effectively reduce the decision theory problem to a negotiation over which point on the Pareto frontier they will select.

However, if two agents insist on different deals (e.g. X insists on point D, Y insists on point A), then it falls through and we're back to the Nash equilibrium O. And if you try and patch this by offering several deals in sequence along the Pareto frontier, then a savvy opponent is just going to grab whichever of those is best for them. So we'd like both a notion of the fair outcome, and a way to still make deals with agents who disagree with us on fairness (without falling back to the Nash equilibrium, and without incentivizing them to play hardball with us).

Note that utility functions are only defined up to affine transformations, so any solution should be invariant under independent rescalings of the players' utilities. This requirement, plus a few others (winding up on the Pareto frontier, independence of irrelevant alternatives, and symmetry) are all satisfied by the Nash solution to the bargaining problem: choose the point on the Pareto frontier so that the area of the rectangle from O to that point is maximized.

This gives us a pretty plausible answer to the first question, but leaves us with the second: are we simply at war with agents that have other ideas of what's fair? (Let's say that X thinks that the Nash solution N is fair, but Y thinks that B is fair.) Other theorists have come up with other definitions of the fair solution to a bargaining problem, so this is a live question!

And the question of incentives makes it even more difficult: if you try something like "50% my fair equilibrium, 50% their fair equilibrium", you create an incentive for other agents to bias their definition of fairness in their own favor, since that boosts their payoff.

# Bargaining Away from the Pareto Frontier

Eliezer's suggestion in this case is as follows: an agent defines its set of acceptable deals as "all points in the feasible set for which my opponent's score is at most what they would get at the point I think is fair". If each agent's definition of fairness is biased in their own favor, the intersection of the agents' acceptable deals has a corner within the feasible set (but not on the Pareto frontier unless the agents agree on the fair point), and that is the point where they should actually achieve Löbian cooperation.

Note that in this setup, you get no extra utility for redefining fairness in a self-serving way. Each agent winds up getting the utility they would have had at the fairness point of the other agent. (Actually, Eliezer suggests a very slight slope to these lines, in the direction that makes it worse for the opponent to insist on a more extreme fairness point. This sets up good incentives for meeting in the middle. But for simplicity, we'll just consider the incentive-neutral version here.)

Moreover, this extends to games involving more than two agents: each one defines a set of acceptable deals by the condition that no other agent gets more than they would have at the agent's fairness point, and the intersection has a corner in the feasible set, where each agent gets the minimum of the payoffs it would have achieved at the other agents' fairness points.

# Modal Bargaining Agents

Now, how do we set this bargaining algorithm up as a modal decision theory? We can only consider finitely many options, though we're allowed to consider computably mixed strategies. Let's assume that our game has finitely many pure strategies for each player. As above, we'll assume there is a unique Nash equilibrium, and set it at the origin.

Then there's a natural set of points we should consider: the grid points within the feasible set (the convex hull formed by the Nash equilibrium and the Pareto optimal points above it) whose coordinates correspond to utilities of the pure strategies on the Pareto frontier. This is easier to see than to read:

Now all we need is to be sure that Löbian cooperation happens at the point we expect. There's one significant problem here: we need to worry about syncing the proof levels that different agents are using.

(If you haven't seen this before, you might want to work out what happens if any two of the following three agents are paired with one another:

• X returns A if PA proves its opponent returns A, else X returns D.
• Y returns B if PA proves its opponent returns B, else Y returns A if PA proves its opponent returns A, else Y returns D.
• Z returns C if PA proves its opponent returns C, else Z returns A if PA + Con(PA) proves its opponent returns A, else Z returns D.

Results in rot13: Nal gjb qvssrerag ntragf nobir erghea Q ntnvafg rnpu bgure, orpnhfr CN pna'g cebir vgf bja pbafvfgrapl, naq gurersber gur snpg gung n cebbs frnepu va CN snvyf pna arire or cebirq va CN.)

Benja suggested one way to ensure that Löbian cooperation happens at the right point: we assume that in addition to the payoff matrix, the agents are mutually aware of an ordering relation on the grid points. In order to land on the best mutually acceptable point (in the Pareto sense), it's merely necessary for the ordering to respect the "level" of grid points, defined as the total number of grid lines traversed along either axis from O to the grid point.

Then, we simply have each player look for cooperation at proof level N at the Nth point, skipping those points that are unacceptable to it. Since the best mutually acceptable point is the only acceptable point at its level (or any prior level), it will be chosen.

Again, this works for modal decision problems with more than two players.

# Open questions and nagging issues:

1. Is there a better way to resolve bargaining without incentivizing other agents to take extreme positions?
2. Is there some reasonable way to make this work without providing the canonical ordering of grid points?
3. If agents are each biased in their opponent's direction, this algorithm still gets a result, but in this case there is more than one grid point on the highest level of the mutually acceptable region, and thus the canonical ordering actually chooses the outcome!
4. If an agent's "fairness point" on the Pareto frontier is itself a mixed strategy profile rather than a pure one, and the other agent doesn't know which point that is, can this still work? In particular, if there is an ordering on the entire feasible set, and if two agents each add extra grid lines to the set based on their own fairness points (without knowing the others), is there an algorithm for selecting proof levels which guarantees that they will meet at the best mutually acceptable point that appears in both of their grids?