Inner Alignment

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimzer)optimizer) is aligned with the objective funcitionfunction of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.