It seems that some of the problems with orthogonality go away if you use the AI's actions to define the counterfactuals.

In the traditional model "Press" meant a button is pressed that compels the AI to change utility , and it gets a reward . Problem was, this could cause the AI to compel people to press the button.

Instead I propose that is an action that the AI can choose to take, and that "Press" allows the AI to take that action. Then it would get rewarded with , calculated by the AI according to counterfactuals on its own actions (somehow; details left to the student to fill in).

We could add a small (or infinitesimal) on top of that so that the change is the decision the AI reaches. This makes it likely to encourage "Press", but only as a tie breaker decision.

Why is this immune to the usual attempts by the AI to cause (or deny) "Press"? Simply because if the AI decided not to , then it only loses an . Thus causing "Press" to happen (which opens the option ) will only gain it an at max.

New Comment