Going by the Risks from Learned Optimization sequence, it's not clear if mesa-optimization is a big threat if the model continues to be updated throughout deployment. I suspect this has been discussed before (links welcome), but I didn't find anything with a quick search.
Lifelong/online/continual learning is popular and could be the norm in future. I'm interested in how that (and other learning paradigms, if relevant) fits into beliefs about mesa-optimization risk.
If you believe the arguments hold up under a lifelong learning paradigm: is that because there could still be enough time between updates for the mesa-optimizer to defect, or some other reason? If you believe the train-test paradigm is likely to stick around, why is that?
I don't think that doing online learning changes the analysis much at all.
As a simple transformation, any online learning setup at time step t is equivalent to training on steps 1 to t−1 and then deploying on step t. Thus, online learning for n steps won't reduce the probability of pseudo-alignment any more than training for n steps will because there isn't any real difference between online learning for n steps and training for n steps—the only difference is that we generally think of the training environment as being sandboxed and the online learning environment as not being sandboxed, but that just makes online learning more dangerous than training.
You might argue that the fact that you're doing online learning will make a difference after time step t because if the model does something catastrophic at time step t then online learning can modify it to not do that in the future—but that's always true: what it means for an outcome to be catastrophic is that it's unrecoverable. There are always things that we can try to do after our model starts behaving badly to rein it in—where we have a problem is when it does something so bad that those methods won't work. Fundamentally, the problem of inner alignment is a problem of worst-case guarantees—and doing some modification to the model after it takes its action doesn't help if that action already has the potential to be arbitrarily bad.
I agree with all of this—online learning doesn't change the probability of pseudo-alignment but might m... (read more)