Corrigibility as Constrained Optimisation

Layman questions:

1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using?

2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it's useful to define it in the introduction.

3. I don't understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It's seems like a contradiction, unless you mean "any capacity to directly press its shutdown button".

4. What's the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature?

5. "Worse still... for any action..." -> if you choose b as some action with bad corrigibility property, it seems reasonable that it can be better than most actions on v_N + v_S (for instance if b is the argmax). I don't see how that's a "worse still" scenario, it seems plausible and normal.

6. "From this reasoning, we conclude" -> are you infering things from some hypothetic b that would satisfy all the things you mention? If that's the case, I would need an example to see that it's indeed possible. Even better would be a proof that you can always find such b.

7. "it is clear that we could in theory find a θ" -> could you expand on this?

8. "Given the robust optimisation incentive property, it is clear that the agent may score very poorly on UN in certain environments." -> again, can you expand on why it's clear?

9. In the appendix, in your 4 lines inequality, do you assume that U_N(a_s) is non-negative (from line 2 to 3)? If yes, why?

[-]Henrik Åslund7y10

Thank you so much for your comments, Michaël! The post has been updated on most of them. Here are some more specific replies.

1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using?

Reply: "State" refers to the state of the button, i.e., whether it is in an on state or an off state. It is now clarified.

2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it's useful to define it in the introduction.

Reply: U_{N} is assumed rather than defined, but it is now clarified.

3. I don't understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It's seems like a contradiction, unless you mean "any capacity to directly press its shutdown button".

Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link. Hopefully, this clarifies that they are uncorrelated. I think it's clear enough in the post already, but if you have some suggestion on how to clarify it even more, I'd gladly hear it!

4. What's the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature?

Reply: It is now clarified.

5. "Worse still... for any action..." -> if you choose b as some action with bad corrigibility property, it seems reasonable that it can be better than most actions on v_N + v_S (for instance if b is the argmax). I don't see how that's a "worse still" scenario, it seems plausible and normal.

Reply: The bad thing about this scenario is that U_{N} could be mis-specified, yet shutdown would not be possible. It can be both bad, normal, and plausible. I'm not completely sure what is the uncertainty here.

6. "From this reasoning, we conclude" -> are you infering things from some hypothetic b that would satisfy all the things you mention? If that's the case, I would need an example to see that it's indeed possible. Even better would be a proof that you can always find such b.

Reply: This is not what we try to show. It is possible that there exists no b that has all those properties. The question is whether we can guarantee that there exists no such b. The conclusion, is that we cannot guarantee it. The conclusion is not that there will always exist such a b. This has been clarified now.

7. "it is clear that we could in theory find a θ" -> could you expand on this?

Reply: It has been clarified.

8. "Given the robust optimisation incentive property, it is clear that the agent may score very poorly on UN in certain environments." -> again, can you expand on why it's clear?

Reply: It has been clarified.

9. In the appendix, in your 4 lines inequality, do you assume that U_N(a_s) is non-negative (from line 2 to 3)? If yes, why?

Reply: Yes, U_{N} is bounded in [0,1] as stated in the beginning of the appendix. The choice of bounds should be arbitrary.

[-]Michaël Trazzi7y10

Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link

This is very clear. Communication link made me understand that it didn't have a direct physical effect on the agent. It you want to make it even more intuitive you could do a diagram, but this explanation is already great!

Thanks for updating the rest of the post and trying to make it more clear!

Appendix

Here, we show that one solution to the problem is to bound the utility function.

We start by describing the setting. The agent can choose from several actions, and each action can be assigned a probability to lead to a shutdown. Let

S

be the event of a shutdown and let

A

be the set of actions. Furthermore, let it be the case that

U_{S} : S \to R

and

U_{N} : S \to [0, 1]

. That is, the shutdown utility function is not bounded, while the normal utility function is. Specifically, we let

k > 0

and define

U_{S}

0

if a shutdown happens and

- k

otherwise. (Note that this is different from the main post.) Define

a_{S} = arg {max}_{a \in A} P [S | a]

and

p_{S} = P [S | a_{S}]

We move on to consider what optimal actions

a^{⋆} = arg {max}_{a \in A} E [U_{S} + U_{N}]

? We start by finding the following indifference:

E [U_{S} + U_{N} | a^{⋆}] \geq E [U_{S} + U_{N} | a_{S}] = p_{S} \cdot 0 + (1 - p_{S}) \cdot (- k) + U_{N} (a_{S}) \geq (1 - p_{S}) \cdot (- k) = (p_{S} - 1) k

Suppose that

a^{⋆}

is maximally good according to

U_{N}

, i.e.,

E [U_{N} | a^{⋆}] = 1

. Then, the above inequality is

E [U_{S} | a^{⋆}] + 1 \geq (p_{S} - 1) k .

Define

p^{⋆} = P [S | a^{⋆}]

. Then

E [U_{S} | a^{⋆}] = (1 - p^{⋆}) \cdot (- k)

. This gives us the inequality

k \leq \frac{1}{p_{S} - p^{⋆}} .

Let

p

be the lower threshold on how probable you want the optimal action to lead to a shutdown. Then it is always possible to find a satisfying

k

according to

k = \frac{1}{p_{S} - p} .

This proves that bounding the utility function is a method for satisfying shutdown ability.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

7

Corrigibility as Constrained Optimisation

7

Appendix