Towards a mechanistic understanding of corrigibility — AI Alignment Forum