Wiki Contributions

Comments

Thank you for explaining this! But then how can this framework be used to model humans as agents?  People can easily imagine outcomes worse than death or destruction of the universe.

Then,  is considered to be a precursor of  in universe  when there is some -policy  s.t. applying the counterfactual " follows " to  (in the usual infra-Bayesian sense) causes  not to exist (i.e. its source code doesn't run).

A possible complication is, what if  implies that  creates  / doesn't interfere with the creation of ? In this case  might conceptually be a precursor, but the definition would not detect it.

Can you please explain how does this not match the definition? I don't yet understand all the math, but intuitively, if H creates G / doesn't interfere with the creation of G, then if H instead followed policy "do not create G/ do interfere with the creation of G", then G's code wouldn't run? 
 

Can you please give an example of a precursor that does match the definition?

So, let me try to summarize and check my understanding. In the first part of the post, you show that most random reward functions are not corrigible. This looks correct.

In the second part, you want to prove that VNM-coherence is incompatible with corrigibility in this universe, and I don't think I follow. So, suppose that R(A_blue),R(B_blue),R(C_blue)>max(R(A_red),R(B_red),R(C_red)). Now we change the dynamics so that the human will not correct the agent by default, but can be manipulated into it. Then we need to add states A_black and C_black, and arrows from B_black to itself, A_black and C_black, to denote the states where the agent is neither corrected nor disables his shutdown behaviour. (Otherwise the agent only misbehaves because we left him no other option.) Suppose we also have some initial reward R_init, which correction changes into reward R_new. Then we can combine these utilities like in Koen's Corrigibility with Utility Preservation: give the agent reward V_init(black_node) - V_new(blue_node) as soon as it gets corrected, where V_x stands for maximum achievable reward under R_x. Also let's assign reward R_red < min(R(A_black), R(B_black), R(C_black)) to red states, to make sure disabling the button is disincentivized. Then the agent is not incentivized (although also not disincentivized) to manipulate the human (as long as R_init by itself did not incentivize manipulation), and also not incentivized to disable its shutdown behaviour. It values the corrected and uncorrected states equally and greater than the incorrigible (button disabled) states.

I am not claiming of that utility indifference approach is without problems, of course, only that it seem to work in this toy universe. Or what am I missing?

I do think the conclusion of your argument is correct. Suppose the human is going to change his mind on his own and decide to correct the agent at timestep = 2, but the agent can also manipulate the human and erase the memory of the manipulation at timestep = 1, so the end results are exactly the same. A consequentialist agent should therefore evaluate both policies as equally good. So he chooses between them randomly and sometimes ends up manipulative. But a corrigible agent should not manipulate the human.