Going by the Risks from Learned Optimization sequence, it's not clear if mesa-optimization is a big threat if the model continues to be updated throughout deployment. I suspect this has been discussed before (links welcome), but I didn't find anything with a quick search.

Lifelong/online/continual learning is popular and could be the norm in future. I'm interested in how that (and other learning paradigms, if relevant) fits into beliefs about mesa-optimization risk.

If you believe the arguments hold up under a lifelong learning paradigm: is that because there could still be enough time between updates for the mesa-optimizer to defect, or some other reason? If you believe the train-test paradigm is likely to stick around, why is that?

New Answer
New Comment

3 Answers sorted by

I don't think that doing online learning changes the analysis much at all.

As a simple transformation, any online learning setup at time step is equivalent to training on steps 1 to and then deploying on step . Thus, online learning for steps won't reduce the probability of pseudo-alignment any more than training for steps will because there isn't any real difference between online learning for steps and training for steps—the only difference is that we generally think of the training environment as being sandboxed and the online learning environment as not being sandboxed, but that just makes online learning more dangerous than training.

You might argue that the fact that you're doing online learning will make a difference after time step because if the model does something catastrophic at time step then online learning can modify it to not do that in the future—but that's always true: what it means for an outcome to be catastrophic is that it's unrecoverable. There are always things that we can try to do after our model starts behaving badly to rein it in—where we have a problem is when it does something so bad that those methods won't work. Fundamentally, the problem of inner alignment is a problem of worst-case guarantees—and doing some modification to the model after it takes its action doesn't help if that action already has the potential to be arbitrarily bad.

Thanks. I think I understand, but I'm still confused about the effect on the risk of catastrophe (i.e. not just being pseudo-aligned, but having a catastrophic real-world effect). It may help to clarify that I was mainly thinking of deceptive alignment, not other types of pseudo-alignment. And I'll admit now that I phrased the question stronger than I actually believe, to elicit more response :)

I agree that the probability of pseudo-alignment will be the same, and that an unrecoverable action could occur despite the threat of modification. I'm interested i... (read more)

3Evan Hubinger3y
I agree with all of this—online learning doesn't change the probability of pseudo-alignment but might make it harder for a deceptively aligned model to defect. That being said, I don't think that deceptive models defecting later is necessarily a good thing—if your deceptive models start defecting sooner, but in recoverable ways, that's actually good because it gives you a warning shot. And once you have a deceptive model, it's going to try to defect against you at some point, even if it just has to gamble and defect randomly with some probability. First, I do think that worst-case guarantees are achievable if we do relaxed adversarial training with transparency tools. Second, I actually have done a bunch of probabilistic risk analysis on exactly this sort of situation here. Note, however, that the i.i.d. situation imagined in that analysis is extremely hard to realize in practice as there are fundamental distributional shifts that are very difficult to overcome—such as the distributional shift from a situation where the model can't defect profitably to a situation where it can.

You don't even need a catastrophe in any global sense. Disrupting the training procedure at step t should be sufficient.

The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren't fitness-aligned.

Suppose you are a mesa-optimiser. You want X, but your training function is towards Y. 

You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.

The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.