Corrigibility Can Be VNM-Incoherent

So, let me try to summarize and check my understanding. In the first part of the post, you show that most random reward functions are not corrigible. This looks correct.

So, let me try to summarize and check my understanding. In the first part of the post, you show that most random reward functions are not corrigible. This looks correct.

In the second part, you want to prove that VNM-coherence is incompatible with corrigibility in this universe, and I don't think I follow. So, suppose that R(A_blue),R(B_blue),R(C_blue)>max(R(A_red),R(B_red),R(C_red)). Now we change the dynamics so that the human will not correct the agent by default, but can be manipulated into it. Then we need to add states A_black and C_black, and arro...

2Alex Turner7mo
Although I didn't make this explicit, one problem is that manipulation is still weakly optimal—as you say. That wouldn't fit the spirit of strict corrigibility, as defined in the post. Note that AUP doesn't have this problem.