In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense "trying" to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.1[1]

Wei Dai brings up a similar idea in an SL4 thread.2[2]

The optimization daemons article on Arbital was published probably in 2016.3[1]

  1. "Optimization daemons". Arbital. 
  2. Wei Dai. '"friendly" humans?' December 31, 2003. 
  1. ^
  2. ^

    Wei Dai. '"friendly" humans?' December 31, 2003.