(Didn't consult Quintin on this; I speak for myself)
I flatly deny that our arguments depend on AGI being anything like an LLM. I think the arguments go through in a very wide range of scenarios, basically as long as we're using some kind of white-box optimization to align them, rather than e.g. carrot-and-stick incentives or prompt engineering. Even if we only relied on prompt engineering, I think we'd be in a better spot than with humans (because we can run many controlled experiments).
A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains.
I’m pretty confused by this claim. Why should we expect the human reward system to overwrite all secret desires? Also how do we know it’s not doing that? Your desires are just causal effects of a bunch of stuff including your reward circuitry.
As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.
That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before
You can just include online learning in your experimentation loop. See what happens when you let the AI online learn for a bit in different environments. I don't think online learning changes the equation very much. It's known to be less stable than offline RL, but that instability hurts capabilities as well as alignment, so we'd need a specific argument that it will hurt alignment significantly more than capabilities, in ways that we wouldn't be able to notice during training and evaluation.
I have no idea how I’m supposed to interpret this sentence ["we are the innate reward system"] for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!
It just means we are directly updating the AI’s neural circuitry with white box optimizers. This will be true across a very wide range of scenarios, including (IIUC) your brain-like AGI scenario.
Brains can imitate, but do so in a fundamentally different way from LLM pretraining
I don’t see why any of the differences you listed are relevant for safety.
Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it.
I basically deny this, especially if you're stipulating that it's a "clean" distinction. Obviously folk psychology has a fuzzy distinction between beliefs and desires in it, but it's also well-known both in common sense and among neuroscientists etc. that beliefs and desires get mixed up all the time and there's not a particularly sharp divide.
FWIW it appears that out of the 4 differences you cited here, only one of them (the relaxation of the restriction that the scrubbed output must be the same) still holds as of this January paper from Geiger's group https://arxiv.org/abs/2301.04709. So the methods are even more similar than you thought.
I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.
I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.
This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.