rough stop button problem ideas.
You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.
Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)
Here is a potential solution to stop button type problems, how does this go wrong?
Taking into account uncertainty, the algorithm is.
Calculate the X maximizing best action in a world where the stop button does nothing.
Calculate the X maximizing best action in a world where the stop button works.
If they are the same, do that. Otherwise shutdown.
It seems like the button-works action will usually be some variety of "take preemptive action to ensure the button won't be pressed" and so the AI will have a high chance to shut down at each decision step.
Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy. If the box is less secure than we thought, or various other things go wrong, the AI will just shut down.
This approach could fail if the easiest way to cause trouble is nothing to do with the stop button. (Say it can send its code elsewhere over the internet. )
A generalization. You have some real world variable X. You run a bunch of AI's that are identical, except for different dogmatic beliefs about X. If they ever disagree on which action to take, the AI's are using the value of X in the optimization. Shutdown.
This acts as a kind of fuse. Have one AI confidant the door password is 1234, a copy that is confidant it is 5678. If these AI's ever disagree on what to do next then at least one was trying to open the door. Shutdown. A slightly counter-intuitive situation here is you want to make sure that if you did know the password, the easiest way to leave would be the door. Otherwise both AI will agree to climb out a window.