I'm not familiar with alignment research too deeply, but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards.
For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution P(r|s,a) of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional "hand-of-god" handing it a∗(st), the optimal action for some state st , the agent must then work backwards from this optimal action to update P(r|s,a) . It must then reason from this updated distribution of rewards to P(π∗), the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action adisable that would prevent future "hand-of-god" optimal action outputs would not choose it because that would mean not being able to further constrain P(π∗), which makes its expected future reward smaller.
Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.
As far as I can see, their improvements are: