[ Question ]

Do mesa-optimizer risk arguments rely on the train-test paradigm?

by Ben Cottier1 min read10th Sep 20206 comments



Going by the Risks from Learned Optimization sequence, it's not clear if mesa-optimization is a big threat if the model continues to be updated throughout deployment. I suspect this has been discussed before (links welcome), but I didn't find anything with a quick search.

Lifelong/online/continual learning is popular and could be the norm in future. I'm interested in how that (and other learning paradigms, if relevant) fits into beliefs about mesa-optimization risk.

If you believe the arguments hold up under a lifelong learning paradigm: is that because there could still be enough time between updates for the mesa-optimizer to defect, or some other reason? If you believe the train-test paradigm is likely to stick around, why is that?

New Answer
Ask Related Question
New Comment

3 Answers

I don't think that doing online learning changes the analysis much at all.

As a simple transformation, any online learning setup at time step is equivalent to training on steps 1 to and then deploying on step . Thus, online learning for steps won't reduce the probability of pseudo-alignment any more than training for steps will because there isn't any real difference between online learning for steps and training for steps—the only difference is that we generally think of the training environment as being sandboxed and the online learning environment as not being sandboxed, but that just makes online learning more dangerous than training.

You might argue that the fact that you're doing online learning will make a difference after time step because if the model does something catastrophic at time step then online learning can modify it to not do that in the future—but that's always true: what it means for an outcome to be catastrophic is that it's unrecoverable. There are always things that we can try to do after our model starts behaving badly to rein it in—where we have a problem is when it does something so bad that those methods won't work. Fundamentally, the problem of inner alignment is a problem of worst-case guarantees—and doing some modification to the model after it takes its action doesn't help if that action already has the potential to be arbitrarily bad.

Suppose you are a mesa-optimiser. You want X, but your training function is towards Y. 

You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.

The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.

The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren't fitness-aligned.