Review

(Epistemic status: I think this is right?)

Alice is the CEO of ArmaggAI, and Bob is the CEO of BigModelsAI, two major AI capabilities organizations. They're racing to be the first to build a superintelligence aligned to their respective CEV which would take over the universe and satisfy their values.

Alice would prefer to slow down so she has a lot more time to make sure that what her company is building will be aligned; but she's worried about Bob's company taking advantage of that and racing themselves, resulting in Bob's utility function being {what the lightcone is filled with} instead of Alice's; so she feels like she has no choice but to race, to maximize her own utility. Likewise vice-versa.

This state of affairs is far from the pareto frontier of {their utility functions, each weighed by how likely they are to be the one to build an aligned superintelligence}: it causes a lot more worlds where everyone is dead, instead of either Alice's or Bob's utility function being maximized.

But they don't really get all this LDT business; it seems complicated to them. If only they had a convenient lesswrong post spelling out exactly what they should do, in clear easy-to-follow terms!

The procedure they should implement is as follows. Alice should:

  • In worlds where she does launch a superintelligence aligned to itself, and it takes over the world and builds an Alice::utopia, she should tell her AI:

    Actually, don't just maximize my own utility function. Maximize a (threat-resistant) mixture of my utility function and Bob's, weighed by what ratio of timelines you think we each launched an aligned superintelligence in, but also weighed by the probability that both me and Bob implement this procedure.

  • Before that, she should slow down their company's capabilities progress.

And likewise for bob.

Now, their respective aligned superintelligences aren't omniscient; it could be that they over-estimate or under-estimate how likely Bob or Alice would've been to implement this procedure, for example. But one thing's pretty much for sure: neither Alice nor Bob are able to fool the future-superintelligence into thinking that they'd implement this procedure, when actually they wouldn't. Because it's superintelligent.

If Alice wins, she knows Bob will follow the procedure because her superintelligence can tell (better than Bob can fake). And Bob doesn't have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure. They're each kept in check by the other's future-winning-self, and they can each rely on this being superintelligently-checked by their respective future selves.

So the only way Alice has to get some of her utility maximized in worlds where Bob wins, is to actually behave like this, including before either has launched a superintelligence. And likewise for Bob.

Their incentive gradient is in the direction of being more likely to follow this procedure, including slowing down their capabilities progress — and thus decreasing the amount of worlds where their AI is unaligned and everyone dies forever.

In the real world, there are still Bob's and Alice's who don't implement this procedure, but that's mostly because they don't know/understand that if they did, they would gain more utility. In many cases, it should suffice for them to be informed that this is indeed where their utility lies.

Once someone has demonstrated that they understand how LDT applies here, and that they're generally rational, then they should understand that implementing this protocol (including slowing down AI capabilities) is what maximizes their utility, and so you can count on them cooperating.


Now, in actuality, this is not quite the full generality of how LDT applies here. What they each should actually tell their superintelligence if they win is actually simpler:

Maximize a mixture of my utility function and Bob's (and anyone else who might be in a position to build superintelligence), weighed in whatever way creates a (non-threatening)) incentive gradient which maximizes my utility, including the utility I have for reducing the amount of worlds in which everyone dies.

Or even more simply:

Maximize my utility function (as per LDT).

But I think it's neat to have a clearer idea how that actually shakes out.

Theres' no weird acausal magic going on here. By racing for AI, Bob would be slightly increase the chance that he's the one to launch the aligned superintelligence that takes over the world, but he's causing more dead worlds in total, and loses the utility he would otherwise gain in worlds where Alice wins, ending up with net less utility overall.

If either of them are somewhat negative utilitarian, racing is even worse: all those dead worlds where they launch an unaligned superintelligence leave remote alien baby-eaters free to eat babies, whereas if they increased the amount of total worlds where either of them get an aligned superintelligence, then that aligned superintelligence can pay a bunch of its lightcone in exchange for them eating less babies. This is not a threat; we're never pessimizing the aliens' utility function. We're simply offering them a bunch of realityfluid/negentropy right here, in exchange for them focusing more on a subset of their values which doesn't contain lots of what Alice and Bob would consider suffering — the aliens can only be strictly better-off than if we didn't engage with them.


Now, this isn't completely foolproof. If Alice is very certain that her own superintelligence will indeed be aligned when it's launched no matter how fast she goes, then she has no incentive to slow down — in her model, Bob doesn't have much to offer to her.

But should she really have that confidence, when a bunch of qualified alignment reserach people keep telling her that she might be wrong?

She should really make sure she has really high confidence, and that she's in general implementing rationality correctly.


Oh, and needless to say, people who are neither Alice nor Bob also have a bunch of utility to gain by taking actions which reduce the total number of dead worlds by forcing both of their companies to slow down (eg through regulation).

When we have this much utility in common (not wanting to die), it's really really dumb to "defect". Unlike in the prisoner's dilemma, this "defection" doesn't get you more utility, it gets you less. This is not a zero-sum game at all. If you think it is, if you think your preferred future is the exact opposite of your opponent's preferred future, then you're probably making a giant reasoning mistake.

Whether your utility function is focused on creating nice things or by reducing suffering ("positively-caring" and "negatively-caring"), slowing down AI progress in order to have a better chance of alignment is probably what serves your utility function best.

New Comment
3 comments, sorted by Click to highlight new comments since:

At a glance, I think this works, and it's a neat approach. I have doubts, though.

The impossibility of explaining the theory behind this to random SV CEOs or military leaders... is not one of them. The human culture had always contained many shards of LDT-style thinking, the concept of "honour" chief among them. To explain it, you don't actually need to front-load exposition about decision theories and acausal trade – you can just re-use said LDT-shards, and feed them the idea in an immediately intuitive format.

I'm not entirely sure how that would look like, and it's not a trivial re-framing problem. But I think it's very doable with some crafty memetic engineering.

My first concern is that we might not actually have the time. While proliferating this idea (in the sense of "This Is What You Do If You Have AGI") is doable, that'd still take some time. You'd need to split it into a set of five-word messages, and parcel them out over the years. I think you'd said your timeline is 0-5 years; that's IMO definitely not enough.

Based on the latest developments (Gemini, what Q* is supposed to be, both underwhelming), I think we have a fair bit longer (like, a decade-ish). Might be enough if we start yesterday.

My second concern is... more vague, but I feel like this is still being too optimistic about the human nature. Sure, maybe it'd work for the current crop of major-AI-Lab CEOs. But in a lot of situations (e. g., acute xenophobia), I think the preference ordering goes "I win" > "they lose" > "a compromise", such that they would prefer an all-or-nothing gamble to a measured split of the gains. (Like, it's almost Copenhagen Ethics-ish? It feels utterly repulsive to contribute to your enemy's happiness, such that you'd rather either eradicate them or be eradicated, no matter how self-destructive that is?)

At that point, I may be being too cynical, though. I also might feel differently if I were staring at the version of this pitch already re-framed into intuitive terms.

Due to my timelines being this short, I'm hopeful that convincing just "the current crop of major-AI-Lab CEOs" might actually be enough to buy us the bulk of time that something like this could buy.

I think this has a fix-point selection problem: If one or both of them start with a different prior under which the other player punishes them for not racing / doesn't reward them enough (maybe because they have very little faith in the other's rationality, or because they think it's not within their power to decide that, and also there's not enough evidential correlation in their decisions), then they'll race.

Of course, considerations about whether the other player normatively endorses something LDT-like also enter the picture. And even if individual humans would endorse it (and that's already a medium-big if), I worry our usual decision structures (for example in AI labs) don't incentivize it (and what's the probability some convincing decision theorist cuts through them? not sure).