Disentangling Perspectives On Strategy-Stealing in AI Safety

I understood the idea of Paul's post as: if we start in a world where humans-with-aligned-AIs control 50% of relevant resources (computers, land, minerals, whatever), and unaligned AIs control 50% of relevant resources, and where the strategy-stealing assumption is true—i.e., the assumption that any good strategy that the unaligned AIs can do, the humans-with-aligned-AIs are equally capable of doing themselves—then the humans-with-aligned-AIs will wind up controlling 50% of the long-term future. And the same argument probably holds for 99%-1% or any other ratio. This part seems perfectly plausible to me, if all those assumptions hold.

Then we can talk about why the strategy-stealing assumption is not in fact true. The unaligned AIs can cause wars and pandemics and food shortages and removing-all-the-oxygen-from-the-atmosphere to harm the humans-with-aligned-AIs, but not so much vice-versa. The unaligned AI can execute a good strategy which the humans-and-aligned-AIs are too uncoordinated to do, instead the latter will just be bickering amongst themselves, hamstrung by following laws and customs and taboos etc., and not having a good coherent idea of what they're trying to do anyway. The aligned AIs might be less capable than an unaligned AI because of "alignment tax"—we make them safe by making them less powerful (they act conservatively, there are humans in the loop, etc.). And so on and so forth. All this stuff is in Paul's post, I think.

I feel like Paul's post is a great post in all those details, but I would have replaced the conclusion section with

"So, in summary, for all these reasons, the strategy-stealing assumption (in this context) is more-or-less totally false and we shouldn't waste our time thinking about it"

whereas Paul's conclusion section is kinda the opposite. (Zvi's comment along the same lines.)

I feel like a lot of this post is listing reasons that the strategy-stealing assumption is false (e.g. humans don't know what they're trying to do and can't coordinate with each other regardless), which are mostly consistent with Paul's post. It also notes that there are situations in which we don't care whether the strategy-stealing assumption is true or false (e.g. unipolar AGI outcomes, situations where all the AIs are misaligned, etc.).

And then other parts of the post are, umm, I'm not sure, sending "something is wrong" vibes that I'm not really understanding or sympathizing with…

Some of the arguments can probably be extended to n-player turn-based games with relatively little difficulty, certain simultaneous games with also relatively little difficulty (as we’ll see below), and probably continuous-time games with moderate difficulty. ↩︎
This is the reason that the strategy-stealing argument can’t be used to prove a win for Black in Go with komi: the game is not actually symmetric; if you try to pass your first turn to “effectively become P2”, you can’t win by taking (half the board - komi + 0.5) points, like White can. ↩︎
For fun, though, this is also one reason why strategy-stealing can’t be used to prove a guaranteed win/draw for White in chess, due to zugzwang. ↩︎
Actually this isn’t the best example because first-player advantage in raw Gomoku is so huge that Gomoku has been explicitly solved by computers, but we can correct this example by imagining that instead of Gomoku I named a way more computationally complex version of Gomoku, where you win if you get like 800 in a row in like 15 dimensions or something. ↩︎
This assumes that the proportion of “influence” over the future a coalition holds is roughly proportional to the fraction of maximum possible utility they could achieve if everyone were aligned. There are obvious flaws to this assumption, which Paul discusses in his post. ↩︎
This means that the use of the phrase “human values… win out” above is doing a little bit of subtle lifting. Under the assumptions of Paul’s model, humans with 99% of flexible influence can achieve 99% of maximum utility in the long run. IMHO It’s a moral philosophical question whether this is an acceptable outcome; Paul bites the bullet and assumes that it is for his analysis. ↩︎
Furthermore, depending on the exact scenario you’re analyzing you might have to make the assumption that aligned AIs are designed such that humans can effectively cooperate with them, which starts to bleed into considerations about interpretability and corrigibility. This wasn’t discussed in Paul’s original post but was in the comments. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

17

Disentangling Perspectives On Strategy-Stealing in AI Safety

17