FWIW it appears that out of the 4 differences you cited here, only one of them (the relaxation of the restriction that the scrubbed output must be the same) still holds as of this January paper from Geiger's group https://arxiv.org/abs/2301.04709. So the methods are even more similar than you thought.
I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.
I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.
This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.
(Didn't consult Quintin on this; I speak for myself)
I flatly deny that our arguments depend on AGI being anything like an LLM. I think the arguments go through in a very wide range of scenarios, basically as long as we're using some kind of white-box optimization to align them, rather than e.g. carrot-and-stick incentives or prompt engineering. Even if we only relied on prompt engineering, I think we'd be in a better spot than with humans (because we can run many controlled experiments).
I’m pretty confused by this claim. Why should we expect the human reward system to overwrite all secret desires? Also how do we know it’s not doing that? Your desires are just causal effects of a bunch of stuff including your reward circuitry.
You can just include online learning in your experimentation loop. See what happens when you let the AI online learn for a bit in different environments. I don't think online learning changes the equation very much. It's known to be less stable than offline RL, but that instability hurts capabilities as well as alignment, so we'd need a specific argument that it will hurt alignment significantly more than capabilities, in ways that we wouldn't be able to notice during training and evaluation.
It just means we are directly updating the AI’s neural circuitry with white box optimizers. This will be true across a very wide range of scenarios, including (IIUC) your brain-like AGI scenario.
I don’t see why any of the differences you listed are relevant for safety.
I basically deny this, especially if you're stipulating that it's a "clean" distinction. Obviously folk psychology has a fuzzy distinction between beliefs and desires in it, but it's also well-known both in common sense and among neuroscientists etc. that beliefs and desires get mixed up all the time and there's not a particularly sharp divide.