A possibly helpful - because starker - hypothetical training approach you could try for thinking about these arguments is make an instance of the imitatee that has all their (at least cognitive) actions sped up by some large factor (e.g. 100x), e.g., via brain emulation (or just "by magic" for the purpose of the hypothetical).
It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.
(Bold direct claims, not super confident - criticism welcome.)
The approach to ELK in this post is unfalsifiable.
A counterexample to the approach would need to be a test-time situation in which:
Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predictions.
So now I’m unclear what progress has been made. This looks like simply defining “the predictor knows P” as “there is a mechanistic explanation of the outputs starting from an assumption of P in the predictor’s world model”, then declaring ELK solved by noting we can search over and compare mechanistic explanations.
I think you're right - thanks for this! It makes sense now that I recognise the quote was in a section titled "Alignment research can only be done by AI systems that are too dangerous to run".
“We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.”
we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the
“scientists don’t know how the game of life works”.
Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?
Another thing confusing to me along these lines:
“for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B”
where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and "seeing" its grid of cells?
They have a strong belief that in order to do good alignment research, you need to be good at “consequentialist reasoning,” i.e. model-based planning, that allows creatively figuring out paths to achieve goals.
I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning -- e.g., where to expect it to be instantiated, what form it takes, how/why it "works" -- is potentially highly relevant to alignment.
I'm focusing on the code in Appendix B.
What happens when
self.diamondShard's assessment of whether some consequences contain diamonds differs from ours? (Assume the agent's world model is especially good.)
upweights actions and plans that lead to
how is it determined what the actions and plans lead to?
Interesting - it's not so obvious to me that it's safe. Maybe it is because avoiding POUDA is such a low bar. But the sped up human can do the reflection thing, and plausibly with enough speed up can be superintelligent wrt everyone else.