Ben Cottier


Deceptive Alignment

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

Joint optimization may be unstable, but if the model is not trained to convergence, might it still be jointly optimizing at the end of training? This occurred to me after reading which finds that "Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence." If convergence is becoming less common in practical systems, it's important to think about the implications of that for mesa-optimization.