Okay, that makes much more sense. I initially read the diagram as saying that just lines 1 and 2 were in the box.
If that's how it works, it doesn't lead to a simplified cartoon guide for readers who'll notice missing steps or circular premises; they'd have to first walk through Lob's Theorem in order to follow this "simplified" proof of Lob's Theorem.
Forgive me if this is a dumb question, but if you don't use assumption 3: (C -> C) inside steps 1-2, wouldn't the hypothetical method prove 2: C for any C?
So I think that building nanotech good enough to flip the tables - which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than "disassemble all GPUs", which I choose not to name explicitly - is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer - albeit still doomed - if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other's deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
I don't think you're going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It's also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn't incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like "giant inscrutable matrices trained by gradient descent", before the world ends, is a harsher question.
I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Just to restate the standard argument against:
If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don't want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.
(Or, I mean, actually the strategy is "mutually cooperate"? Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold? Yes yes, more complicated in practice, but we don't even, really, get to say that we were blindsided here. The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn't mutual defection, despite the hopeful human saying, "But I set you up with circumstances that I thought would make you not decide that! How could you? Why? How could you just get a better outcome for yourselves like this?")
The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function. You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.
I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.
By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..." That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
I'm particularly impressed by "The Floating Droid". This can be seen as early-manifesting the foreseeable difficulty where:At kiddie levels, a nascent AGI is not smart enough to model humans and compress its human feedback by the hypothesis "It's what a human rates", and so has object-level hypotheses about environmental features that directly cause good or bad ratings;
When smarter, an AGI forms the psychological hypothesis over its ratings, because that more sophisticated hypothesis is now available to its smarter self as a better way to compress the same data;
Then, being smart, the AGI goodharts a new option that pries apart the 'spurious' regularity (human psychology, what fools humans) from the 'intended' regularity the humans were trying to gesture at (what we think of as actually good or bad outcomes).
Reality hits back on the models we train via loss functions based on reality-generated data. But alignment also hits back on models we train, because we also use loss functions (based on preference data). These seem to be symmetrically powerful forces.
Alignment doesn't hit back, the loss function hits back and the loss function doesn't capture what you really want (eg because killing the humans and taking control of a reward button will max reward, deceiving human raters will increase ratings, etc). If what we wanted was exactly captured in a loss function, alignment would be easier. Not easy because outer optimization doesn't create good inner alignment, but easier than the present case.