Incorporating Mechanism Design Into Decision Theory

StrivingForLegibility

Incorporating Mechanism Design Into Decision Theory

5 min read26th Jan 20244 comments

17

Acausal TradeCommitment MechanismsComputer Security & CryptographyCounterfactualsDecision TheoryGame TheoryMechanism DesignOpen Source AIOpen Source Game TheoryAIWorld Optimization

Frontpage

In the previous post, we looked at one way of handling externalities: letting other agents pay you to shift your decision. And we also considered the technique of aggregating those offers into an auction. This technique of "implementing a mechanism to handle an incentive misalignment" is extremely useful and seems like a promising avenue for future improvements to decision theory.

I want to frame mechanisms as "things that reshape incentives." Auctions, markets, and voting systems are all mechanisms; social technologies that can be invented by working backwards from a social welfare measure (a social choice theory) and designing a game such that players following their individual incentives will find themselves in socially high-ranking Nash equilibria.

I suspect that incorporating mechanism design more fully into decision theory will be extremely fruitful. Yudkowsky's probabilistic rejection algorithm^[1] first identifies the socially optimal outcome (a fair Pareto optimum), and works backwards to identify a decision procedure which:

If universalized leads to the socially optimal outcome
If best-responded to still leads to the socially optimal outcome (stabilizing it as a Nash equilibrium)

The Robust Cooperation paper does the same thing for the Prisoners' Dilemma. Probabilistic rejection also only uses appropriate-threats, like sometimes rejecting unfair offers even at cost to oneself, leading it to degrade gracefully when negotiators disagree about what the socially optimal outcome is. The term "non-credible threat" is named that way because a classically-rational agent would never actually pay a cost "merely" for its incentive-reshaping effect on the other players. Not all non-credible threats are appropriate, but there are times when it's appropriate to pay costs to reshape the incentives of others.

Policy Counterfactuals

There is a representation theorem by Joyce, which I understand to be something like "a Joyce-rational agent will choose an action that maximizes an expected utility expression that looks like $E U (a) := \sum_{j = 1}^{N} P (a ↪ o_{j}; x) \cdot U (o_{j})$ ." The Functional Decision Theory (FDT) paper frames the differences between major decision theories like CDT, EDT, and FDT as using different ways to compute the free parameter $P (a ↪ o_{j}; x)$ . This parameter is an action-counterfactual, whose interpretation is something like "if I were to take this action $a$ , what is the probability that outcome $o_{j}$ would occur, given some background knowledge $x$ ?"^[2]

So we can look at "reshaping an agent's incentives" through the lens of "reshaping an agent's counterfactual expectations." We can think of other agents performing this scan over policies they could implement, looking for what will elicit the most desirable response from our decision theory. And the structure of our decision theory determines their counterfactual expectations about what those responses are.

Hardening Your Decision Theory

From this perspective, the assumption that other agents have legible open-source access to our decision theory forces us to harden the attack surface of our decision theory, rather than relying on security through obscurity.

Giving in to inappropriate-threats like blackmail is a software vulnerability which another decision theory might exploit. Similarly for failing to make appropriate-threats, like accepting any positive offer in the Ultimatum game or unconditionally Cooperating in every round of an infinitely-iterated Prisoners' Dilemma. (We want our decision theory to implement a strategy more like tit-for-tat, which Cooperates conditional on reciprocal Cooperation, or some other compensation.)

We want to design our decision theory so that, to the greatest extent possible, other agents find that best-responding to our decision theory leads to high-ranking outcomes according to our social choice theory.

Mechanism Counterfactuals

Software systems can simulate any computable mechanism given enough computing power. This is a sufficient condition for such a mechanism to influence the behavior of software agents; such agents can perform a logical handshake to act as if a mechanism "exists", even if it is not "really" implemented.

How do you act "as if" a voting system exists? By imagining how everyone would vote, if such a voting system existed, and then acting in accordance with the results. The same works for auctions, markets, negotiations, anything that reshapes incentives compared with the underlying strategic context. And these can be composed together into networks, like first imagining "as if" private property ownership-tags exist, and then imagining "as if" there were a market for the goods on the underlying consensus-imaginary property rights layer.

State Channels

One specific architecture for this sort of thing is state channels: a relatively simple smart contract serves as a dispute resolution mechanism for this whole scheme. Alice and Bob each deposit some funds with this smart contract, and then conduct most of their activity together without involving the blockchain at all. Alice and Bob exchange signed messages with each other, which enables them to prove the authenticity of these messages to the dispute resolution smart contract.

One of the most valuable features of a state channel is that it shapes the counterfactual expectations of each participant, so that each can safely treat the interaction "as if" it was happening with all of the security guarantees of the underlying blockchain, without all of the overhead. This includes being able to act "as if" further smart contracts had been deployed to the block chain, and these virtual smart contracts can be composed together into networks within the state channel.

At the end of their interaction, Alice and Bob can inform the smart contract of the result and withdraw whatever funds they're entitled to. Neither has an incentive to distort the report in their favor, because each has enough cryptographic information to prove the actual result in the event of a dispute.

Logical Commitments

Software systems with legible access to each other's source code don't even need the overhead of a blockchain to hold a consensus model of other software systems in their heads. The legibility makes this logical line-of-sight transitive; any software system that AliceBot reasons about, BobBot can also reason about.

A general-purpose technology we'll want for open-source game theory is a logical commitment. (Or just commitment when it's clear from context.) When AliceBot can implement any policy $π \in Π$ , a logical commitment $Φ \subseteq Π$ is the legible fact that AliceBot will only implement a policy $ϕ \in Φ$ from this subset. When $Φ = Π$ , this corresponds to the null commitment "I will implement a policy $π \in Π$ ." When $Φ = {ϕ}$ , this corresponds to the very specific commitment "I will implement exactly the policy $ϕ$ ."

We'll also want conditional commitments, which apply if some condition is true. FairBot offers the conditional commitment "if I can prove that you'll Cooperate with me, I'll Cooperate with you." It also offers the complementary commitment to cover the case where such a proof search fails: "In that case, I'll Defect." For any domain where a decision theory has a defined output, it is implicitly making commitments and conditional commitments.

Finally, a joint commitment $Φ \subseteq Π$ is subset of a joint policy space $Π$ , and represents a commitment for each corresponding player. These can also be conditional.

In the next post we'll see an example of how networks of counterfactual mechanisms can be used to produce useful logical commitments.

^{^}
The algorithm is described in Project Lawful so spoilers but it's here, and it's discussed without spoilers earlier in this sequence.
^{^}
I assume the theorem still makes sense if you think of agents as optimizing their global policy rather than their local action after Bayesian updating, but I haven't been able to look at the original paper. It looks to be available behind a paywall here. But policy optimization currently seems like the obvious way to go so I'll go back to talking about policy counterfactuals.
EDIT: gwern has put up a copy of the relevant chapter, see the discussion here for more details.

Acausal TradeCommitment MechanismsComputer Security & CryptographyCounterfactualsDecision TheoryGame TheoryMechanism DesignOpen Source AIOpen Source Game TheoryAIWorld Optimization

Frontpage

17

Reframing Acausal Trolling as Acausal Patronage

Mentioned in

4Counterfactual Mechanism Networks

Incorporating Mechanism Design Into Decision Theory

26th Jan 2024

3gwern

1StrivingForLegibility

3Charlie Steiner

3StrivingForLegibility

New Comment

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:52 AM

[-]gwern3mo30

It looks to be available behind a paywall here.

Both the book & individual chapter (by DOI) are available in LG/SH. I've put up a copy.

[-]StrivingForLegibility3mo10

Thank you! I'm interested in checking out earlier chapters to make sure I understand the notation, but here's my current understanding:

There are 7 axioms that go into Joyce's representation theorem, and none of them seem to put any constraints on the set of actions available to the agent. So we should be able to ask a Joyce-rational agent to choose a policy for a game.

My impression of the representation theorem is that a formula like can represent a variety of decision theories. Including ones like CDT which are dynamically inconsistent: they have a well-defined answer to "what do you think is the best policy", and it's not necessarily consistent with their answer to "what are you actually going to do?"

So it seems like the axioms are consistent with policy optimization, and they're also consistent with action optimization. We can ask a decision theory to optimize a policy using an analogous expression: $E U (π) := \sum_{j = 1}^{N} P (π ↪ o_{j}; x) \cdot U (o_{j})$ .

It seems like we should be able to get a lot of leverage by imposing a consistency requirement that these two expressions line up. It shouldn't matter whether we optimize over actions or policies, the actions taken should be the same.

I don't expect that fully specifies how to calculate the counterfactual data structures $P (a ↪ o_{j}; x)$ and $P (π ↪ o_{j}; x)$ , even with Joyce's other 7 axioms. But the first 7 didn't rule out dynamic or counterfactual inconsistency, and this should at least narrow our search down to decision theories that are able to coordinate with themselves at other points in the game tree.

[-]Charlie Steiner3mo31

One thing that's always seemed important, but that I don't know how to fit in, is the ecological equilibrium. E.g. it seems like the Chicken game (payoff matrix (((0,0),(1,2)),((2,1),(0,0))) ) supports an ecosystem of different strategies in equilibrium. How does this mesh with any particular decision theory?

[-]StrivingForLegibility3mo30

Totally! The ecosystem I think you're referring to is all of the programs which, when playing Chicken with each other, manage to play a correlated strategy somewhere on the Pareto frontier between (1,2) and (2,1).

Games like Chicken are actually what motivated me to think in terms of "collaborating to build mechanisms to reshape incentives." If both players choose their mixed strategy separately, there's an equilibrium where they independently mix (, $\frac{2}{3}$ ) between Straight and Swerve respectively. But sometimes this leads to (Straight, Straight) or (Swerve, Swerve), leaving both players with an expected utility of $\frac{2}{3}$ and wishing they could coordinate on Something Else Which Is Not That.

If they could coordinate to build a traffic light, they could correlate their actions and only mix between (Straight, Swerve) and (Swerve, Straight). A 50/50 mix of these two gives each player an expected utility of 1.5, which seems pretty fair in terms of the payoffs achievable in this game.

Anything that's mutually unpredictable and mutually observable can be use to correlate actions by different agents. Agents that can easily communicate can use cryptographic commitments to produce legibly fair correlated random signals.

My impression is that being able to perform logical handshakes creates program equilibria that can be better than any correlated equilibrium. When the traffic light says the joint strategy should be (Straight, Swerve), the player told to Swerve has an incentive to actually Swerve rather than go Straight, assuming the other player is going to be playing their part of the correlated equilibrium. But the same trick doesn't work in the Prisoners' Dilemma: a traffic light announcing (Cooperate, Cooperate) doesn't give either player an incentive to actually play their part of that joint strategy. Whereas a logical handshake actually does reshape the players' incentives: they each know that if they deviate from Cooperation, their counterpart will too, and they both prefer (Cooperate, Cooperate) to (Defect, Defect).

I haven't found any results for the phrase "correlated program equilibrium", but cousin_it talks about the setup here:

AIs that have access to each other's code and common random bits can enforce any correlated play by using the quining trick from Re-formalizing PD. If they all agree beforehand that a certain outcome is "good and fair", the trick allows them to "mutually precommit" to this outcome without at all constraining their ability to aggressively play against those who didn't precommit. This leaves us with the problem of fairness.

This gives us the best of both worlds: the random bits can get us any distribution over joint strategies we want, and the logical handshake allows enforcement of that distribution so long as it's better than each player's BATNA. My impression is that it's not always obvious what each player's BATNA is, and in this sequence I recommend techniques like counterfactual mechanism networks to move the BATNA in directions that all players individually prefer and agree are fair.

But in the context of "delegating your decision to a computer program", one reasonable starting BATNA might be "what would all delegates do if they couldn't read each other's source code?" A reasonable decision theory wouldn't give in to inappropriate threats, and this removes the incentive for other decision theories to make them towards us in the first place. In the case of Chicken, the closed-source answer might be something like the mixed strategy we mentioned earlier: ( $\frac{1}{3}$ , $\frac{2}{3}$ ) mixture between Straight and Swerve.

Any logical negotiation needs to improve on this baseline. This can make it a lot easier for our decision theory to resist threats. Like in the next post, AliceBot can spin up an instance to negotiate with BobBot, and basically ignore the content of this negotiation. Negotiator AliceBot can credibly say to BobBot "look, regardless of what you threaten in this negotiation, take a look at my code. Implementer AliceBot won't implement any policy that's worse than the BATNA defined at that level." And this extends recursively throughout the network, like if they perform multiple rounds of negotiation.

Moderation Log