This looks really interesting to me. I remember when the Safety via Debate paper originally came out; I was quite curious to see more work around modeling debate environments and getting a better sense on how well we should expect it to perform in what kinds of situations. From what I can tell this does a rigorous attempt at 1-2 models.
I noticed that this is more intense mathematically than most other papers I'm used to in this area. I started going through it but was a bit intimidated. I was wondering if you may suggest tips for reading through it and understanding it. Do readers need to understand some of Measure Theory or other specific areas of math that may be a bit intense for what we're used to on LessWrong? Are there any other things we should read first or make sure we know to help prepare accordingly?
I appreciate this post for working to distill a key crux in the larger debate.
Some quick thoughts:
1. I'm having a hard time understanding the "Alas, the power-seeking ruthless consequentialist AIs are still coming” intuition. It seems like a lot of people in this community have this intuition, and I feel very curious why. I appreciate this crux getting attention.
2. Personally, my stance is something more like, "It seems very feasible to create sophisticated AI architectures that don't act as scary maximizers." To me it seems like this is what we're doing now, and I see some strong reasons to expect this to continue. (I realize this isn't guaranteed, but I do think it's pretty likely)
3. While the human analogies are interesting, I assume they might appeal more to the "consequentialist AIs are still coming” crowd than people like myself. Humans were evolved for some pretty wacky reasons, and have a large number of serious failure modes. Perhaps they're much better than some of what people imagine, but I suspect that we can make AI systems that have much more rigorous safety properties in the future. I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.
4. You mention human intrinsic motivations as a useful factor. I'd flag that in a competent and complex AI architecture, I'd expect that many subcomponents would have strong biases towards corrigibility and friendliness. This seems highly analogous to human minds, where it's really specific sub-routines and similar that have these more altruistic motivations.