CIRL Corrigibility is Fragile

AdamGleave

Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.

I wanted to share why I'm personally excited by this work in case it helps contextualize it for others.

We'd all like AI systems to be "corrigible", cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.

Provably corrigible sounds great, but where there's a proof there's also an assumption, and Carey et al soon pointed out a number of other assumptions under which this no longer holds -- e.g. if there is model misspecification causing the incorrect probability distribution to be computed.

That's a real problem, but every method can fail if you implement it wrongly (although some are more fragile than others), so this didn't exactly lead to people giving up on the CIRL framework. Recently Shah et al described various benefits they see of CIRL (or "assistance games") over reward learning, though this doesn't address the corrigibility question head on.

A lot of the corrigibility properties of CIRL come from uncertainty: it wants to defer to a human because the human knows more about its preferences than the robot. Recently, Yudkowsky and others described the problem of fully updated deference: if the AI has learned everything it can, it may have no uncertainty, at which point this corrigibility goes away. If the AI has learned your preferences perfectly, perhaps this is OK. But here Carey's critique of model misspecification rears its head again -- if the AI is convinced you love vanilla ice cream, saying "please no give me chocolate" will not convince it (perhaps it thinks you have a cognitive bias against admitting your plain, vanilla preferences -- it knows the real you), whereas it might if it had uncertainty.

I think the prevailing view on this forum is to be pretty down on CIRL because its not corrigible. But I'm not convinced corrigibility in the strict sense is even attainable or desirable. In this post, we outline a bunch of examples of corrigible behavior that I would absolutely not want in an assistant -- like asking me for approval before every minor action! By contrast, the near-corrigible behavior -- asking me only when the robot has genuine uncertainty -- seems more desirable... so long as the robot has calibrated uncertainty. To me, CIRL and corrigibility seem like two extremes: CIRL is focusing on maximizing human reward, whereas corrigibility is focused on avoiding ever doing the wrong thing even under model misspecification. In practice, we need a bit of both -- but I don't think we have a good theoretical framework for that yet.

In addition to that, I hope this post serves as a useful framework to ground future discussions on this. Unfortunately I think there's been an awful lot of talking past each other in debates on this topic in the past. For example, to the best of my knowledge, Hadfield-Menell and other authors of CIRL never believed it solved corrigibility under the assumptions Carey introduced. Although our framework is toy, I think it captures the key assumptions people disagree about, and it can be easily extended to capture more as needed in future discussions.