Work done during SERI MATS 3.0 with mentorship from Jesse Cliffton. Huge thanks for all the feedback and discussions to Anthony DiGiovanni, Daniel Kokotajlo, Martín Soto, Rubi J. Hudson and Jan Betley! Also posted to EA forum.

Daniel's post about commitment races motivates why they may be a severe problem. Here, I'll describe a concrete protocol that if adopted, would let us avoid some cases of miscoordination caused by them.

TL;DR

The key ingredient is having a mandatory time delay, during which the commitments aren't yet binding. At the end of that delay, you decide whether to make your commitment binding or revert it, and this decision can be conditional on previous decisions of other participants. This in itself would give rise to new races, but it can be managed by adding some additional rules.

I think the biggest challenge would be to convince the "commitment infrastructure" (which I describe below) to adopt such a protocol.

Benefits

In the case of the game of chicken, the 3 rules listed below should often push whoever committed later to Swerve.
- In simple games like chicken it may be achieved easier, just by relying on conditional commitments (“If I came second, I Swerve”).^[1] But here we add another mechanism: tentative commitment period, which gives us another nice feature:
In the real world it can be not so obvious that some commitments are incompatible. The tentative period gives the agents time to analyze the situation in depth and check if any commitments are clashing. This is especially useful in highly multipolar cases where multiple parties try to commit at the same time or where actions have complex consequences and interactions.
- We also don't need to know in advance all the actions the others can take - we can analyze their actions after they've already tentatively committed to them.
Even if participants manage to coordinate (so one Dares and one Swerves), the solution found hastily during a commitment race can still be quite poor. Boomerang enables bargaining that can Pareto improve on this hasty solution.

Necessary ingredients

The protocol relies on some mechanism M on which agents can make commitments - a "commitment infrastructure". M could be something like the Ethereum network, or some powerful international body.

We require that:

When someone publishes a commitment, M arrives at a consensus about the time at which the commitment was published.
It’s more convenient/cheap/credible to do some commitments on M that outside of it.

2. is needed because the protocol relies on certain commitments being forbidden. Agents could decide to do those forbidden commitments outside of M, so we need to make that as hard as possible for them, compared to committing on M. I think this is the hardest part of the whole proposal. M would need to be locked into place by a network effect - everyone is using M because everyone else is using M.

Protocol

Here are the rules:

R1: All commitments have a mandatory tentative period, meaning that they only become binding after some fixed time T (we can say that freeze_time = publish_time + T).
- So you have no way to make a commitment credible before freeze_time (if you were allowed to end the tentative period, we would create a new race to end the tentative period as soon as possible).
R2: During the tentative period, you can still decide to revert your commitment.
- Somewhere before the freeze_time you send to M your final decision (whether you revert or not), but hashed.^[2]
  - You also need to add random salt to your decision before hashing, so that it cannot be revealed through brute-forcing.
- After freeze_time you reveal the decision (and it must match the previously sent hash, otherwise M would reject your commitment).
  - So in the analogy of the game of chicken - you threw out your steering wheel, but here it boomerangs back to you, giving you your last chance to catch it.
- You shouldn’t be able to reveal the final decision to anyone before freeze_time because we don’t want the commitment to get credible before freeze_time.
  - To ensure that, we add a rule that anyone who knows the final decision before freeze_time has the power to revert the commitment.
  - Now, if you reveal the decision to your opponent, they will probably break your commitment.
R3: Your final decision is allowed to be conditional on the final decision of some other commitment, if and only if your freeze_time comes after the freeze_time of that other commitment.

Those rules may seem like a lot, but I think they (or some comparably complex set of rules) are all needed if we want to avoid creating new races later in time. The aim is to have only one race, at the very beginning, and everything else should be calm, non-racy and completely independent of agents' speed of making commitments (f.e. what their ping is, or how well connected they are with the commitment infrastructure).

Example

We have a modified game of chicken with the following payoffs:

if you both Dare, you die, which is worth -100 utils
if you Dare and your opponent Swerves, you prove that you're a badass which is worth 10 utils
if you Swerve, you drive into a shrubbery, which ruins your car's awesome paint job, which is worth -20 utils
there also may be some additional actions available, but they are not obvious

Let's set the length of the tentative period at one minute, and let’s say that they have 3 minutes before they potentially crash into each other.

0:00 - Race starts.
0:01 - Alice publishes a commitment "I Dare" - it's like throwing her steering wheel out the window - the wheel will "boomerang back" at 1:01 at which point if Alice doesn't "catch it", the commitment becomes final.
0:02 - Bob didn't see in time that Alice threw out the wheel, so he publishes a commitment "I Dare" - it's like throwing his steering wheel out the window - the wheel will boomerang back at 1:02. At this point, in a regular game of chicken they would be doomed. But here, there's still hope.
0:53 - Bob sends out Hash(“If Alice doesn't revert her commitment to Dare, I Revert this commitment”)^[3]^[4]
0:55 - Alice sends out Hash(“I don't revert”)
1:04 - Bob reveals the original decision: “If Alice doesn't revert her commitment to Dare, I Revert this commitment”
1:07 - Alice reveals the original decision: “I don't revert”
1:07 - M makes Alice’s “I don't revert” binding, and then also resolves Bob’s decision to “I Revert this commitment [to dare]”. The fact that Alice is now committed to Dare, later makes Bob Swerve.

Note that in principle at 0:53 Bob could instead decide to unconditionally Dare even though he is second, hoping that Alice may be too scared to Dare.

But with Boomerang such ruthless Daring is much less likely than without it. At the time of decision, Alice and Bob have a shared knowledge of who is first, and also only the second one can make a conditional commitment. This breaks the symmetry of the original game of chicken. The option of making the conditional commitment (when you have that option) is pretty compelling - it's both safe and taking opportunities when they arise. Additionally it would create a focal point of what the participants are "supposed to do" - everyone expects that the first committer gets to Dare and the second must do a conditional commitment, and diverting from this equilibrium would only hurt you.

Addition of bargaining

With the three rules described above, we managed to avoid the most catastrophic outcome. But that outcome is still pretty poor, because the initial commitments were chosen with almost zero thought. If agents later notice some Pareto improvement, to move to this new solution the first agent (Alice) would need to revert her first commitment and give up her privileged position. To be willing to do it, Alice would need a guarantee from the second agent (Bob) that he will also revert. But in the existing protocol, Alice cannot have such a guarantee, because after Alice reverts, Bob could still do whatever - R3 forbids conditioning on commitments that come after yours.

To fix that, we can add another rule:

R4: you can allow some other commitment to condition on your commitment even if its freeze time comes before yours but they still have the right to reject this option
- This right to reject may seem counter-intuitive, but being unable to condition on others is actually a privilege. It makes your commitment more credible and it is this them who are pushed to Swerve.

It may be tricky to see how that helps, so let's rerun our example with that new rule:

0:00 - Race starts.
0:01 - Alice throws her steering wheel out the window.
0:02 - Bob throws his steering wheel out the window.
0:37 - Bob realizes that they can Pareto improve over the previous outcome! They could just both stop, and he will publicly declare that Alice is more badass than him, and also pay her one util. This gives him a payoff of -1 instead of -20, and for Alice +11 instead of +10. He tentatively commits to do this, if Alice reverts her commitment to Dare. (Bob's new commitment can become final at 1:37.) He also allows Alice to condition her decision (at 1:01) on his decision (at 1:37).
0:53 - Bob sends out Hash(“If Alice doesn't revert her commitment to Dare, I Revert this commitment”)
0:55 - Alice sends out Hash(“If Bob doesn’t Revert that commitment from 0:37, I revert my commitment to Dare”)
1:04 - Bob reveals the original decision: “If Alice doesn't revert her commitment to Dare, I Revert this commitment”
1:07 - Alice reveals the original decision: “If Bob doesn’t Revert that commitment from 0:37, I revert my commitment to Dare”
1:31 - Bob sends out Hash(“Follow through with the new commitment”)
1:39 - Bob reveals the original decision: “Follow through with the new commitment”
1:39 - M makes Bob’s plan binding - he must now stop, declare Alice to be more badass and pay her; then M resolves Alice’s conditional commitment to “I revert my commitment to Dare”. The fact that Bob stopped his car on the tracks and they've already established that she's the badass, makes her stop too.

We could even have a chain of multiple commitments “conditioning on the future”. In practice we may want to limit that somehow, so that the resolution cannot be delayed indefinitely.

Future work

Adoption. - What are the most tractable ways of implementing this protocol? F.e. if agents make the commitments on Ethereum network or in some international legal body, how do we convince all the nodes in the network / participating nations to enforce the protocol?^[5]
- It's probably easiest to start by convincing a handful of most important players. Then the rest of players may gradually follow.
- Once you have that system in place, to lock it in, participants may want to penalize those who make commitments using some alternative mechanisms. ^[6] (h/t Jan Betley)
Reputation effects. - You may choose to make ruthless unconditional commitments (like Daring when you know someone else already committed to Dare) to build a ruthless reputation. This way, in future interactions you may hope that others will Swerve even when they committed first. Could this buildup of ruthless reputation be prevented? F.e. we could require full anonymity of the agents. Or rather be more selective, and somehow “erase the identity” of whoever behaves ruthlessly.^[7]
Gaming the system using proxies. - The protocol could be broken if you state your decision publicly, and then pay some proxy to salt and hash this decision and upload this hash in your name (by somehow authorizing it to do it, f.e. giving it your private key). As long as this proxy doesn’t reveal this salt to anyone, no one will invalidate your commitment. And to the extent that the proxy is known to behave reliably, your publicly stated decision is credible. How do we break such proxies or make them impractical?
Can you find other ways to break the protocol?

Appendix

Some non-crucial technical details that you may want to skip:

rule R4 gives rise to a kind of iterated ultimatum game:
- Alice could reject the option to condition on Bob's new commitment, making that new commitment useless
- it would force Bob to propose a new one, with a split more favorable to Alice
- this rejecting+offering may continue right up until freeze_time, so it has some potential for miscoordination (Alice takes the risk of rejecting a very late offer, and Bob doesn't have enough time to offer a new one)
- so we may need to add a limit, that you can’t offer this option closer to their freeze_time than some duration, and also they need to wait some (shorter) duration before rejecting your option
- this way it would be clear that some offer is the last one
cycles
- conditioning on the future could make some commitments rely on themselves in a circular way
- easiest way to prevent it, is that when you allow an earlier commitment C1 to condition on your C2, you must give up the power to condition on anything between C1 and C2
- I’m not sure yet if that’s enough for more complex graphs of conditioning - anyway, we could somehow calculate which periods you cannot condition on
- or, a potentially more powerful but tricky option, is to embrace the circularity, and if we have multiple ways to satisfy a circular commitment, choose a Pareto optimal way
  - this has a similar vibe to open-source game theory
  - agents would need to state their preferences for all those possible ways
  - if there are multiple Pareto optimal solutions, probably agents with commitments earlier on the timeline should have priority
  - if there is no way to satisfy the cycle, we should break it and probably again resolve in favor of earlier committers, but I’m not sure about the details here
we may get another race to lock in the hash of your final decision right after you published your tentative commitment, because if you lock in very soon, that weakly indicates that you decided not to revert
- to fix that, we may also require that you are only allowed to lock in the hash of your final decision after some time has passed since the start of the tentative period

^{^}
This would only work in very simple cases like chicken, because you would need to know in advance what are all the possible commitments that others can make, so that you can define what "being second in a race" exactly means.
^{^}
Alternative rule could be to have M generate some random number at freeze_time, and only then an agent can make the final decision, because we require them to reference that number in the decision message. But that could create a race, where the second committer decides to Dare anyway, because they hope this information will reach the first committer soon enough to sway them. For this reason we would need to postpone the generation of second committer's random number, to wait for the first committer's decision. But if the protocol is used by a lot of agents at the same time, and we play it safe and assume that everyone may potentially clash with anyone, then we have to postpone every commitment on the network which scales badly.
^{^}
To be clear, the decisions would actually be written as formal statements, not natural language, and also explicitly state which commitments they reference.
^{^}
The order of sending these hashes is irrelevant here. That's why Bob can send that hash first, even though he's the second committer.
^{^}
It may be better to adopt Boomerang sooner than later: After someone already established a strategic advantage that lets them commit more ruthlessly, they will oppose the adoption of such a protocol. But agents should be keener to accept the protocol if they don't know yet if they'll be the advantaged or disadvantaged ones.
^{^}
This works best if commitments on those alternative mechanisms are crisp, so that you can clearly define what will be penalized. F.e. committing through smart contracts is crisper than committing through staking your reputation.

But this penalization may be tricky, because it's costly for the penalizer, and you would prefer others to carry this cost. So it requires participants to coordinate to all penalize together. Here's an example technique which may help.
^{^}
But if we require full anonymity, we lose any positive reputation effects we had. And if we “erase the identity” of whoever behaves ruthlessly, then encountering someone with a fresh identity serves as evidence that they are ruthless, defeating the purpose of this erasure.

AI ALIGNMENT FORUM
AF