A putative new idea for AI control; index here.

In a previous post, I briefly mentioned translations as one of three possible counterfactuals for indifference. Here I want to clarify what I meant there, because the idea is interesting.

Imagine that there is a button, which, if a human presses it (event $B$ ), will cause an AI to have reward signal $R_{0}$ . If the button isn't pressed (event $\neg B$ ), the AI will have reward signal $R_{1}$ .

Initially, the probabilities of $B$ and $\neg B$ are equal at $1 / 2$ .

Now, suppose that the AI takes an action $a$ that increases the probability of $B$ to $3 / 4$ (and decreases the probability of $\neg B$ to $1 / 4$ ).

We want the AI to be indifferent to the change in probability caused by its actions.

Evidential counterfactual

In the "evidential" counterfactual, the AI will simply behave as if the probability of $B$ and $\neg B$ were fixed, and unaltered from the initial odds:

The problem with this approach is that this doesn't correspond to a true utility/reward functions, leading to the paradoxes detailed here and here (see the section on outcome pumps).

Causal counterfactual

In the "causal" counterfactual, there is some other event that, with small probability, will force the button to be pressed, or prevent it from being pressed, whatever humans want or try.

This event is supposed to be independent of anything anyone does (imagine some specified radioactive decay event, or being triggered by distant cosmic events).

Call $X$ the event that the button is forcibly pressed, while $\neg X$ means it is forcibly not-pressed. The AI does not value being in any other world (meaning that the AI gets constant reward in any world where neither $X$ nor $\neg X$ happen).

Then the AI will behave as if the ratio of probabilities of following $R_{0}$ versus $R_{1}$ is the (constant) ratio of $P (X)$ to $P (\neg X)$ , whatever the probability of $B$ becomes.

The problem is that $B$ (the human presses the button) is not the same as $X$ (the button is forcibly pressed by some stochastic event), so expected rewards conditional on one will generally not be the same as those conditional on the other.

Translation "counterfactual"

In the "translation" counterfactual, in contrast, there are no unusual conditionings or assuming unlikely events. Instead, the event $B / \neg B$ is decoupled somewhat from the event of the AI maximising $R_{0}$ or $R_{1}$ .

In this specific case, the probability of the AI maximising $R_{0}$ , given $B$ , is reduced, so that the final probability is the same as the initial one:

In this case, the probability of the AI maximising $R_{1}$ and $\neg B$ happening, remains at $1 / 4$ , while the probability of maximising $R_{0}$ and $B$ is reduced to $1 / 2$ , and the probability of maximising $R_{1}$ and $B$ is increased to $1 / 4$ .

Thus the probabilities of maximising $R_{0}$ and $R_{1}$ remain $1 / 2$ each, and the AI is has a correct probability distribution about future events (there are some subtleties concerning probabilities versus weights here, but this is the general picture).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

0

Translation "counterfactual"

0

Evidential counterfactual

Causal counterfactual

Translation "counterfactual"