Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.
This is easiest to see if we use a "strong invariance" condition, in which each of the must mediate between and . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure () from any little spatially-localized chunk of the gas (). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e. only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.
I agree with this point as stated, but think the probability is more like 5% than 0.1%
Same.
I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.
Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"
That's not particularly cruxy for me either way.
Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".
Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.
Solid post!
I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.
This isn't a proper response to the post, but since I've occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:
Pretty decent post overall.
There is no "AI gets control of button" option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.
I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?
Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.
I'm not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there's nothing in their decision-driving models which is causally upstream of the button.
That's not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we'd like to avoid if possible, and I don't have any argument that that particular sort of inefficiency is necessary for corrigible behavior.
The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent's internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent's epistemic probabilities sane.
... but then the question is whether that subagent induces button-influencing-behavior. I don't yet have a good argument in either direction on that question.
By the time a human artist can create landscape images which look nearly as good as those examples to humans, yeah, I'd expect they at least get the number of fingers on a hand consistently right (which is also a "how good it looks to humans" thing). But that's still reifying "how good it looks to humans" as the metric.
Some explanations I've seen for why AI is bad at hands:
My girlfriend practices drawing a lot, and has told me many times that hands (and faces) are hard not because they're unusual geometrically but because humans are particularly sensitive to "weirdness" in them. So an artist can fudge a lot with most parts of the image, but not with hands or faces.
My assumption for some time has been that e.g. those landscape images you show are just as bad as the hands, but humans aren't as tuned to notice their weirdness.
Yeah, that's right.
The secret handshake is to start with "X is independent of Y given Z" and "X is independent of Z given Y", expressed in this particular form:
P[X,Y,Z]=P[X|Z]P[Y,Z]=P[X|Y]P[Y,Z]
... then we immediately see that P[X|Z]=P[X|Y] for all X,Y,Z such that P[Y,Z]>0.
So if there are no zero probabilities, then P[X|Z]=P[X|Y] for all X,Y,Z.
That, in turn, implies that P[X|Z] takes on the same value for all Z, which in turn means that it's equal to P[X]. Thus X and Z are independent. Likewise for X and Y. Finally, we leverage independence of Y and Z given X:
P[X,Y,Z]=P[Y|X]P[Z|X]P[X]
=P[Y]P[Z]P[X]
(A similar argument is in the middle of this post, along with a helpful-to-me visual.)