The "Commitment Races" problem

by Daniel Kokotajlo5 min read23rd Aug 201924 comments

36

Game TheoryAI
Frontpage

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.

For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible

Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.

Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.

In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.

So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.

Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they've learned about by t.

Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) "I commit to making you lose if you do that move." In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike--and possibly outright bullying--by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn't comply. Practice what you preach!)

When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other one limits theirs. But it gets worse.

Sometimes commitments can be made "at the same time"--i.e. in ignorance of each other--in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)

Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request--perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying--so many years later they meet each other, learn about each other, and end up locked into all-out war.

I'm not saying disastrous AGI commitments are the default outcome; I'm saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We'd wish we built a paperclip maximizer instead.

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

Objection: "Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?"

Reply: Several points:

1. Devastating commitments (e.g. "Grim Trigger") are much more possible with AGI--just alter the code! Inigo Montoya is a fictional character and even he wasn't able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.

2. Credibility is much easier also, especially in an acausal context (see above.)

3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.

4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.

5. Finally, these terrible things (Brutal threats, costly fights) do happen to some extent even among humans today--especially in situations of anarchy. We want the AGI we built to be less likely to do that stuff than humans, not merely as likely.

Objection: "Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it's-too-late arguments, and thus will be incapable of hurting anyone."

Reply: That would be nice, wouldn't it? Let's hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...

Anecdote: A friend of mine, when she was a toddler, would threaten her parents: "I'll hold my breath until you give me the candy!" Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.

Conclusion

Overall, I'm not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if "solving bargaining" turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.

Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.

36

25 comments, sorted by Highlighting new comments since Today at 12:40 AM
New Comment

Okay, so now having thought about this a bit...

I at first read this and was like "I'm confused – isn't this what the whole agent foundations agenda is for? Like, I know there are still kinks to work out, and some of this kinks are major epistemological problems. But... I thought this specific problem was not actually that confusing anymore."

"Don't have your AGI go off and do stupid things" is a hard problem, but it seemed basically to be restating "the alignment problem is hard, for lots of finnicky confusing reasons."

Then I realized "holy christ most AGI research isn't built off the agent foundations agenda and people regularly say 'well, MIRI is doing cute math things but I don't see how they're actually relevant to real AGI we're likely to build.'"

Meanwhile, I have several examples in mind of real humans who fell prey to something similar to commitment-race concerns. i.e. groups of people who mutually grim-triggered each other because they were coordinating on slightly different principles. (And these were humans who were trying to be rationalist and even agent-foundations-based)

So, yeah actually it seems pretty likely that many AGIs that humans might build might accidentally fall into these traps. 

So now I have a vague image in my head of a rewrite of this post that ties together some combo of:

  • The specific concerns noted here
  • The rocket alignment problem "hey man we really need to make sure we're not fundamentally confused about agency and rationality."
  • Possibly some other specific agent-foundations-esque concerns

Weaving those into a central point of:

"If you're the sort of person who's like 'Why is MIRI even helpful? I get how they might be helpful but they seem more like a weird hail-mary or a 'might as well given that we're not sure what else to do?'... here is a specific problem you might run into if you didn't have a very thorough understanding of robust agency when you built your AGI. This doesn't (necessarily) imply any particular AGI architecture, but if you didn't have a specific plan for how to address these problems, you are probably going to get them wrong by default.

(This post might already exist somewhere, but currently these ideas feel like they just clicked together in my mind in a way they hadn't previously. I don't feel like I have the ability to write up the canonical version of this post but feel like "someone with better understanding of all the underlying principles" should)

One big factor this whole piece ignores is communication channels: a commitment is completely useless unless you can credibly communicate it to your opponent/partner. In particular, this means that there isn't a reason to self-modify to something UDT-ish unless you expect other agents to observe that self-modification. On the other hand, other agents can simply commit to not observing whether you've committed in the first place - effectively destroying the communication channel from their end.

In a game of chicken, for instance, I can counter the remove-the-steering-wheel strategy by wearing a blindfold. If both of us wear a blindfold, then neither of us has any reason to remove the steering wheel. In principle, I could build an even stronger strategy by wearing a blindfold and using a beeping laser scanner to tell whether my opponent has swerved - if both players do this, then we're back to the original game of chicken, but without any reason for either player to remove their steering wheel.

I think in the acausal context at least that wrinkle is smoothed out.

In a causal context, the situation is indeed messy as you say, but I still think commitment races might happen. For example, why is [blindfold+laserscanner] a better strategy than just blindfold? It loses to the blindfold strategy, for example. Whether or not it is better than blindfold depends on what you think the other agent will do, and hence it's totally possible that we could get a disastrous crash (just imagine that for whatever reason both agents think the other agent will probably not do pure blindfold. This can totally happen, especially if the agents don't think they are strongly correlated with each other and sometimes even if they do (e.g. if they use CDT)) The game of chicken doesn't cease being a commitment race when we add the ability to blindfold and the ability to visibly attach laserscanners.

Blindfold + scanner does not necessarily lose to blindfold. The blindfold does not prevent swerving, it just prevents gaining information - the blindfold-only agent acts solely on its priors. Adding a scanner gives the agent more data to work with, potentially allowing the agent to avoid crashes. Foregoing the scanner doesn't actually help unless the other player knows I've foregone the scanner, which brings us back to communication - though the "communication" at this point may be in logical time, via simulation.

In the acausal context, communication kicks even harder, because either player can unilaterally destroy the communication channel: they can simply choose to not simulate the other player. The game will never happen at all unless both agents expect (based on priors) to gain from the trade.

If you choose not to simulate the other player, then you can't see them, but they can still see you. So it's destroying one direction of the communication channel. But the direction that remains (they seeing you) is the dimension most relevant for e.g. whether or not there is a difference between making a commitment and credibly communicating it to your partner. Not simulating the other player is like putting on a blindfold, which might be a good strategy in some contexts but seems kinda like making a commitment: you are committing to act on your priors in the hopes that they'll see you make this commitment and then conform their behavior to the incentives implied by your acting on your priors.

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.

 

Why is that?

All the versions of updatelessness that I know of would have led to some pretty disastrous, not-adding-up-to-normality behaviors, I think. I'm not sure. More abstractly, the commitment races problem has convinced me to be more skeptical of commitments, even ones that seem probably good. If I was a consequentialist I might take the gamble, but I'm not a consequentialist -- I have commitments built into me that have served my ancestors well for generations, and I suspect for now at least I'm better off sticking with that than trying to self-modify to something else.

(This is some of what I tried to say yesterday, but I was very tried and not sure I said it well)

Hm, the way I understand UDT, is that you give yourself the power to travel back in logical time. This means that you don't need to actually make commitment early in your life when you are less smart.

If you are faced with blackmail or transparent Newcomb's problem, or something like that, where you realise that if you had though of the possibility of this sort of situation before it happened (but with your current intelligence), you would have pre-committed to something, then you should now do as you would have pre-committed to.

This means that an UDT don't have to do tons of pre-commitments. It can figure things out as it goes, and still get the benefit of early pre-committing. Though as I said when we talked, it does loose some transparency which might be very costly in some situations. Though I do think that you loose transparency in general by being smart, and that it is generally worth it.

(Now something I did not say)

However the there is one commitment that you (maybe?[1]) have to do to get the benefit of UDT if you are not already UDT, which is to commit to become UDT. And I get that you are wary of commitments. 

Though more concretely, I don't see how UDT can lead to worse behaviours. Can you give an example? Or do you just mean that UDT get into commitment races at all, which is bad? But I don't know any DT that avoids this, other than always giving in to blackmail and bullies, which I already know you don't, given one of the stories in the blogpost.

[1] Or maybe not. Is there a principled difference between never giving into blackmail becasue you pre-committed something, or just never giving into blackmail with out any binding pre-commitment? I suspect not really, which means you are UDT as long as you act UDT, and no pre-commitment needed, other than for your own sake.

Thanks for the detailed reply!

where you realise that if you had though of the possibility of this sort of situation before it happened (but with your current intelligence), you would have pre-committed to something, then you should now do as you would have pre-committed to.

The difficulty is in how you spell out that hypothetical. What does it mean to think about this sort of situation before it happened but with your current intelligence? Your current intelligence includes lots of wisdom you've accumulated, and in particular, includes the wisdom that this sort of situation has happened, and more generally that this sort of situation is likely, etc. Or maybe it doesn't -- but then how do we define current intelligence then? What parts of your mind do we cut out, to construct the hypothetical?

I've heard of various ways of doing this and IIRC none of them solved the problem, they just failed in different ways. But it's been a while since I thought about this.

One way they can fail is by letting you have too much of your current wisdom in the hypothetical, such that it becomes toothless -- if your current wisdom is that people threatening you is likely, you'll commit to giving in instead of resisting, so you'll be a coward and people will bully you. Another way they can fail is by taking away too much of your current wisdom in the hypothetical, so that you commit to stupid-in-retrospect things too often.

Imagine your life as a tree (as in data structure). Every observation which (from your point of view of prior knowledge) could have been different, and every decision which (from your point of view) could have been different, is a node in this tree. 

Ideally you would would want to pre-analyse the entire tree, and decide the optimal pre-commitment for each situation. This is too much work. 

So instead you wait and see which branch you find yourself in, only then make the calculations needed to figure out what you would do in that situation, given a complete analysis of the tree (including logical constraints, e.g. people predicting what you would have done, etc). This is UDT. In theory, I see no drawbacks with UDT. Except in practice UDT is also too much work. 

What you actually do, as you say, is to rely on experience based heuristics. Experience based heuristics is much superior for computational efficiency, and will give you a leg up in raw power. But you will slide away from optimal DT, which will give you a negotiating disadvantage. Given that I think raw power is more important than negotiating advantage, I think this is a good trade-off. 

The only situation where you want to rely more on DT principles, is in super important one-off situations, and you basically only get those in weird acausal trade situations. Like, you could frame us building a friendly AI as acausal trade, like Critch said, but that framing does not add anything useful. 

And then there is things like this and this and this, which I don't know how to think of. I suspect it breaks somehow, but I'm not sure how. And if I'm wrong, getting DT right might be the most important thing.

But in any normal situation, you will either have repeated games among several equals, where some coordination mechanism is just uncomplicatedly in everyone interest. Or your in a situation where one person just have much more power over the other one.

It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”

The set of solution concepts you commit to regarding as acceptable still poses an equilibrium selection problem. But, on the face of it at least, the “thinking more” part is handled by conditional commitments to act on the basis of future beliefs.

I guess there’s the problem of what protocols for specifying future world-models you commit to regarding as acceptable. Maybe there are additional protocols that haven’t occurred to you, but which other agents may have committed to and which you would regard as acceptable when presented to you. Hopefully it is possible to specify sufficiently flexible methods for determining whether protocols proposed by your future counterparts are acceptable that this is not a problem.

If I read you correctly, you are suggesting that some portion of the problem can be solved, basically -- that it's in some sense obviously a good idea to make a certain sort of commitment, e.g. "When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.” So the commitment races problem may still exist, but it's about what other commitments to make besides this one, and when. Is this a fair summary?

I guess my response would be "On the object level, this seems like maybe a reasonable commitment to me, though I'd have lots of questions about the details. We want it to be vague/general/flexible enough that we can get along nicely with various future agents with somewhat different protocols, and what about agents that are otherwise reasonable and cooperative but for some reason don't want to agree on a world-model with us? On the meta level though, I'm still feeling burned from the various things that seemed like good commitments to me and turned out to be dangerous, so I'd like to have some sort of stronger reason to think this is safe."

Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced otherwise”. Idk.

Your meta-level concern seems warranted. One would at least want to try to formalize the kinds of commitments we’re discussing and ask if they provide any guarantees, modulo equilibrium selection.

I think we are on the same page then. I like the idea of a deliberation module; it seems similar to the "moral reasoning module" I suggested a while back. The key is to make it not itself a coward or bully, reasoning about schelling points and universal principles and the like instead of about what-will-lead-to-the-best-expected-outcomes-given-my-current-credences.

I mostly agree with this post, except I'm not convinced it is very important. (I wrote some similar thought here.)

Raw power (including intelligence) will always be more important than having the upper hand in negotiation. Because I can only shift you up to the amount I can threaten you.

Let's say I can cause you up to X utility of harm, according to your utility function. If I'm maximally skilled at blackmail negotiation then I can decide your action with in the set of action such that your utility is with in (max-X, max] utility.

If X utility is a lot, then I can influence you a lot. If X is not so much then I don't have much power over you. If I'm strong then X will be large, and influencing your action will probably be of little importance to me. 

Blackmail is only important when players are of similar straights which is probably unlikely, or if the power to destroy is much more than the power to create, which I also find unlikely. 

The main scenario where I expect blackmail to seriously matter (among super intelligences) is in aclausal trade between different universes. I'm sceptical to this being a real thing, but admit I don't have strong arguments on this point.

I agree raw power (including intelligence) is very useful and perhaps generally more desireable than bargaining power etc. But that doesn't undermine the commitment races problem; agents with the ability to make commitments might still choose to do so in various ways and for various reasons, and there's general pressure (collective action problem style) for them to do it earlier while they are stupider, so there's a socially-suboptimal amount of risk being taken.

I agree that on Earth there might be a sort of unipolar takeoff where power is sufficiently imbalanced and credibility sufficiently difficult to obtain and "direct methods" easier to employ, that this sort of game theory and bargaining stuff doesn't matter much. But even in that case there's acausal stuff to worry about, as you point out.

I was confused about this post, and... I might have resolve my confusion by the time I got ready to write this comment. Unsure. Here goes:

My first* thought: 

Am I not just allowed to precommit to "be the sort of person who always figures out whatever the optimal game theory was, and commit to that?". I thought that was the point. 

i.e. I wouldn't precommit to treating either the Nash Bargaining Solution or Kalai-Smorodinsky Solution as "the permanent grim trigger bullying point", I'd precommit to something like "have a meta-policy of not giving into bullying, pick my best-guess-definition-of-bullying as my default trigger, and my best-guess grim-trigger response, but include an 'oh shit I didn't think about X' parameter." (with some conditional commitments thrown in)

Where X can't be an arbitrary new belief – the whole point of having a grim trigger clause is to be able to make appropriately weighted threats that AGI-Bob really thinks will happen. But, if I legitimately didn't think of the Kalai-Smordinwhatever solution as something an agent might legitimately think was a good coordination tool, I want to be able to say. depending on circumstances:

  1. If the deal hasn't resolved yet "oh, shit I JUUUST thought of the Kalai-whatever thing and this means I shouldn't execute my grim trigger anti-bullying clause without first offering some kind of further clarification step."
  2. If the deal already resolved before I thought of it, say "oh shit man I really should realized the Kalai-Smorodinsk thing was a legitimate schelling point and not started defecting hard as punishment. Hey, fellow AGI, would you like me to give you N remorseful utility in return for which I stop grim-triggering you and you stop retaliating at me and we end the punishment spiral?"

My second* thought:

Okay. So. I guess that's easy for me to say. But, I guess the whole point of all this updateless decision theory stuff was to actually formalize that in a way that you could robustly program an AGI that you were about to give the keys to the universe. 

Having a vague handwavy notion of it isn't reassuring enough if you're about to build a god. 

And while it seems to me like this is (relatively) straightforward... do I really want to bet that?

I guess my implicit assumption was that game theory would turn out to not be that complicated in the grand scheme of thing. Surely once you're a Jupiter Brain you'll have it figured out? And, hrmm, maybe that's true, but but maybe it's not, or maybe it turns out the fate of the cosmos gets decided with smaller AGIs fighting over Earth which much more limited compute.

Third thought:

Man, just earlier this year, someone offered me a coordination scheme that I didn't understand, and I fucked it up, and the deal fell through because I didn't understand the principles underlying it until just-too-late. (this is an anecdote I plan to write up as a blogpost sometime soon)

And... I guess I'd been implicitly assuming that AGIs would just be able to think fast enough that that wouldn't be a problem.

Like, if you're talking to a used car salesman, and you say "No more than $10,000", and then they say "$12,000 is final offer", and then you turn and walk away, hoping that they'll say "okay, fine, $10,000"... I suppose metaphorical AGI used car buyers could say "and if you take more than 10 compute-cycles to think about it, the deal is off." And that might essentially limit you to only be able to make choices you'd precomputed, even if you wanted to give yourself the option to think more.

That seems to explain why my "Just before deal resolves, realize I screwed up my decision theory" idea doesn't work. 

It seems like my "just after deal resolves and I accidentally grim trigger, turn around and say 'oh shit, I screwed up, here is remorse payment + a costly proof that that I'm not fudging my decision theory'" should still work though?

I guess in the context of Acausal Trade, I can imagine things like "they only bother running a simulation of you for 100 cycles, and it doesn't matter if on the 101st cycle you realize you made a mistake and am sorry." They'll never know it.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

I dunno. Maybe I'm still confused.

But, I wanted to check in on whether I was on the right track in understanding what considerations were at play here.

...

*actually there were like 20 thoughts before I got to the one I've labeled 'first thought' here. But, "first thought that seemed worth writing down."

Thanks! Reading this comment makes me very happy, because it seems like you are now in a similar headspace to me back in the day. Writing this post was my response to being in this headspace.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

This sounds like a plausibly good rule to me. But that doesn't mean that every AI we build will automatically follow it. Moreover, thinking about acausal trade is in some sense engaging in acausal trade. As I put it:

Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

As for your handwavy proposals, I do agree that they are pretty good. They are somewhat similar to the proposals I favor, in fact. But these are just specific proposals in a big space of possible strategies, and (a) we have reason to think there might be flaws in these proposals that we haven't discovered yet, and (b) even if these proposals work perfectly there's still the problem of making sure that our AI follows them:

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."
Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

If you want to think and talk more about this, I'd be very interested to hear your thoughts. Unfortunately, while my estimate of the commitment races problem's importance has only increased over the past year, I haven't done much to actually make intellectual progress on it.

Yeah I'm interested in chatting about this. 

I feel I should disclaim "much of what I'd have to say about this is a watered down version of whatever Andrew Critch would say". He's busy a lot, but if you haven't chatted with him about this yet you probably should, and if you have I'm not sure whether I'll have much to add.

But I am pretty interested right now in fleshing out my own coordination principles and fleshing out my understanding of how they scale up from "200 human rationalists" to 1000-10,000 sized coalitions to All Humanity and to AGI and beyond. I'm currently working on a sequence that could benefit from chatting with other people who think seriously about this.

This feels like an important question in Robust Agency and Group Rationality, which are major topics of my interest.

I have another "objection", although it's not a very strong one, and more of just a comment.

One reason game theory reasoning doesn't work very well in predicting human behavior is because games are always embedded in a larger context, and this tends to wreck the game-theory analysis by bringing in reputation and collusion as major factors. This seems like something that would be true for AIs as well (e.g. "the code" might not tell the whole story; I/"the AI" can throw away my steering wheel but rely on an external steering-wheel-replacing buddy to jump in at the last minute if needed).

In apparent contrast to much of the rationalist community, I think by default one should probably view game theoretic analyses (and most models) as "just one more way of understanding the world" as opposed to "fundamental normative principles", and expect advanced AI systems to reason more heuristically (like humans).

But I understand and agree with the framing here as "this isn't definitely a problem, but it seems important enough to worry about".


I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.

Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even after making an oath - often with some reputational consequences, but still possible if we deem it worthwhile. Even so, we're rightly reluctant to make serious committments. An agent who can actually enforce it's self-limitations is going to be orders of magnitude more hesitant to do so.

All that said, it's worth recognizing that an agent that's significantly better at predicting the consequences of potential commitments will pay a lower cost for the best of them, and has a material advantage over those who need flexibility because they don't have information. This isn't a race in time, it's a race in knowledge and understanding. I don't think there's any way out of that race - more powerful agents are going to beat weaker ones most of the time.

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate.

I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not.

I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.


This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016.

I found some related discussions going back to 2009. It's mostly highly confused, as you might expect, but I did notice this part which I'd forgotten and may actually be relevant:

But if you are TDT, you can’t always use less com­put­ing power, be­cause that might be cor­re­lated with your op­po­nents also de­cid­ing to use less com­put­ing power

This could potentially be a way out of the "racing to think as little as possible before making commitments" dynamic, but if we have to decide how much to let our AIs think initially before making commitments, on the basis of reasoning like this, that's a really hairy thing to have to do. (This seems like another good reason for wanting to go with a metaphilosophical approach to AI safety instead of a decision theoretic one. What's the point of having a superintelligent AI if we can't let it figure these kinds of things out for us?)

If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can “move first” can get much more than the one that “moves second.”

I'm not sure how the folk theorem shows this. Can you explain?

going updateless is like making a bunch of commitments all at once

Might be a good idea to offer some examples here to help explain updateless and for pumping intuitions.

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn’t actually commit to anything then.

Interested to hear more details about this. What would have happened if you were actually able to become updateless?

Thanks, edited to fix!

I agree with your push towards metaphilosophy.

I didn't mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1's preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be "earlier in logical time" than player 2 and make a credible commitment.

As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don't do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don't have a super clear example of how this might lead to disaster, but I intend to work one out in the future...

Same goes for my own experience. I don't have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.