I think I'm missing something with the Löb's theorem example.
If can be proved under the theorem, then can't also be proved? What's the cause of the asymmetry that privileges taking $5 in all scenarios where you're allowed to search for proofs for a long time?
While reading through the report I made a lot of notes about stuff that wasn't clear to me, so I'm copying here the ones that weren't resolved after finishing it. Since they were written while reading, a lot of these may be either obvious or nitpick-y.
Footnote 14, page 15:
Can we make the assumption that defeating the method allows the AI to get better loss since it's effectively wireheading at that point? If so, then wouldn't a realistic messy AI learn a Bayes net once it had >= 109 parameters? In other words, are there reasons beyond performance that preclude an AI from learning a single monolithic model?
Footnote 33, page 30 (under the heading "Strategy: have AI help humans improve our understanding"):
I realize this is only a possible example of how we might implement this, but wouldn't a training procedure that explicitly involves humans be very anti-competitive? The strategy described in the actual text sounds like it's describing an AI assistant that automates science well enough to impart us with all the predictor's knowledge, which wouldn't run into this issue.
Footnote 48 to this paragraph on page 36:
Footnote:
I might be reading too much into this, but I don't understand the basis of this claim. Is it that the correspondence differs only at the low-level? If so, I still don't see how science outcompetes gradient descent.
Page 51, under the heading "[ELK] may be sufficient for building a worst-case solution to outer alignment:
I haven't thoroughly read the article on amplification, so this question may be trivial, but my understanding is that amplified humans are more or less equivalent to humans with AI-trained Bayes nets. If true, then doesn't this require the assumption that tasks will always have a clean divide between the qualitative (taste of cakes) which we can match with an amplified human, and the quantitative (number of cakes produced per hour) which we can't? That feels like it's a safe assumption to make, but I'm not entirely sure.
Page 58, in the list of features suggesting that M(x) knew that A' was the better answer:
First, why is the first point necessary to suggest that M(x) knew that A' was the better answer? Second, how are the last two points different?
Page 69, under "Can your AI model this crazy sequence of delegation?":
We need the AI to have a much smaller margin of error when it comes to modelling this sequence of delegation than needed for the AI to reason about acquiring resources for future copies - in other words, for a limited amount of computation, the AI will still try to reason about acquiring resources for future copies and could succeed in the absence of other superintelligences because of the lack of serious opposition, but modelling the delegation with that limited computation might be dangerous because of the tendency for value drift.
Page 71:
Is it possible for the AI to manipulate the human in world i to be easily satisfied in order to improve the evaluation of world i+1?
Page 73:
As I understand this, z_prior is what the model expects to happen when it sees "action" and "before", z_posterior is what it thinks actually happened after it sees "after", and kl is the difference between the two that we're penalizing it on. What is logprob doing?