Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an
optimzer) is aligned with the objective funcition of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.