I don't think that doing online learning changes the analysis much at all.
As a simple transformation, any online learning setup at time step is equivalent to training on steps 1 to and then deploying on step . Thus, online learning for steps won't reduce the probability of pseudo-alignment any more than training for steps will because there isn't any real difference between online learning for steps and training for steps—the only difference is that we generally think of the training environment as being sandboxed and the online learning environment as not being sandboxed, but that just makes online learning more dangerous than training.
You might argue that the fact that you're doing online learning will make a difference after time step because if the model does something catastrophic at time step then online learning can modify it to not do that in the future—but that's always true: what it means for an outcome to be catastrophic is that it's unrecoverable. There are always things that we can try to do after our model starts behaving badly to rein it in—where we have a problem is when it does something so bad that those methods won't work. Fundamentally, the problem of inner alignment is a problem of worst-case guarantees—and doing some modification to the model after it takes its action doesn't help if that action already has the potential to be arbitrarily bad.
Suppose you are a mesa-optimiser. You want X, but your training function is towards Y.
You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.
The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.
The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren't fitness-aligned.