In a previous post on AIs in virtual worlds, I described the idea of a utility function that motivates the AI to do operate in a virtual world with certain goals in mind, but to shutdown immediately if it detects that the outside world -- us -- is having an impact in the virtual world.

This is one way to implement such a goal, given certain restrictions on the AI's in-world utility. The restrictions are more natural for rewards rather than utilities, so they will be phrased in those terms.

Worlds W and W′

The world W is deterministic, turn-based, and includes the AI; it is the virtual world we want the AI to care about. The world set W′ is W plus the possibility of stochastic influences (which is how the AI would model our influence in its world); see here for more details.

The AI has a shutdown action S; the idea is that the AI implements S when P(W′) gets high enough. We assume that the probability module P of the AI is reflexively consistent in some suitable fashion.

Discounted reward results

Let the AI have a discount rate of 0<γ<1, and write f(p) for the function (1−p)/p; note that f is diminishing in p.

Theorem 1: For all 0≤q<p≤1 such f(p)/f(q)<(1−γ), there exists a discounted reward function such that the AI will shutdown whenever P(W′)>p and will never shutdown while P(W′)<q.

Note that a lower γ -- a sharper discount -- means that p and q can be chosen to be closer together.

Corollary 1: For all 0≤q<p≤1 such that there exists an n with f(p)/f(q)<(1−γn), there exists a discounted reward function such that the AI will shutdown whenever P(W′)>p for the last n turns, and will never shutdown while P(W′)<q for the last n turns.

Note that Corollary 1 allows q and p to be closer together than Theorem 1 allows - the price being that the probability needs to stay above or below the bounds for longer.

Corollary 2: For all 0<γ<1, there exist 0<q<p<1 with the properties of Theorem 1.

Proof

Proof by construction of a suitable reward function.

Let T be the time that the AI shuts down (if it never shuts down t=∞), and rt the reward it gets on turn t, which is bounded between 0<a<1 and 1. It gets reward R′ for shutting down in W′, and has a discount rate of γ. Define IW′ as the indicator function for being in world W′. Then the AI's total reward is defined to be:

RT=γTIW′R′+∑T−1t=0γtrt.

We will then set the values of R′ and a to get the theorem (setting the value of a can be achieved by an affine transformation of its rewards).

If it doesn't shutdown, the agent's maximal future reward is 1/(1−γ) (while its minimal future reward is a/(1−γ)).

If the AI has a current probability p of being in world W′, its maximal reward for not shutting down this turn is pγR′+(1−p)/(1−γ) -- this assumes the AI learns about W versus W′ perfectly in the next turn, and then gets maximal reward if its in W. Its expected reward for immediate shutdown is pR′.

The AI will thus shutdown immediately when pR′>pγR′+(1−p)/(1−γ) -- ie when R′>1(1−γ)2f(p).

If the AI's probability of being in world W′ is q, its minimal expected reward for not shutting down immediately is (1−q)a/(1−γ) -- this assumes the AI will never get any further information as to whether it is in W′, will always get minimal reward, and will never shutdown in future. Therefore it won't shutdown for qR′<(1−q)a/(1−γ) -- ie R′<f(q)a1−γ.

Taking the two bounds on R′ and multiplying them by 1−γ, we can see that, as long as f(q)>f(p)/(1−γ) -- equivalently f(p)/f(q)<(1−γ) -- we can choose an R′ and an a<1 that satisfies both bounds.

Corollary 1 is then obvious. For Corollary 2, note that 1/f(0)=0, f(1)=0, and hence f(1)/f(0)=0. Thus, since 1−γ>0, by continuity of f, there must exist 0<q<p<1 such that f(p)/f(q)<1−γ.

The full reward function

It might be useful to fully specify the reward function. Let IT be the indicator function that the AI shuts down (for the first time) on turn T. Then the reward R is:

A putative new idea for AI control; index here.In a previous post on AIs in virtual worlds, I described the idea of a utility function that motivates the AI to do operate in a virtual world with certain goals in mind, but to shutdown immediately if it detects that the outside world -- us -- is having an impact in the virtual world.

This is one way to implement such a goal, given certain restrictions on the AI's in-world utility. The restrictions are more natural for rewards rather than utilities, so they will be phrased in those terms.

## Worlds W and W′

The world W is deterministic, turn-based, and includes the AI; it is the virtual world we want the AI to care about. The world set W′ is W plus the possibility of stochastic influences (which is how the AI would model our influence in its world); see here for more details.

The AI has a shutdown action S; the idea is that the AI implements S when P(W′) gets high enough. We assume that the probability module P of the AI is reflexively consistent in some suitable fashion.

## Discounted reward results

Let the AI have a discount rate of 0<γ<1, and write f(p) for the function (1−p)/p; note that f is diminishing in p.

Theorem 1: For all 0≤q<p≤1 such f(p)/f(q)<(1−γ), there exists a discounted reward function such that the AI will shutdown whenever P(W′)>p and will never shutdown while P(W′)<q.Note that a lower γ -- a sharper discount -- means that p and q can be chosen to be closer together.

Corollary 1: For all 0≤q<p≤1 such that there exists an n with f(p)/f(q)<(1−γn), there exists a discounted reward function such that the AI will shutdown whenever P(W′)>p for the last n turns, and will never shutdown while P(W′)<q for the last n turns.Note that Corollary 1 allows q and p to be closer together than Theorem 1 allows - the price being that the probability needs to stay above or below the bounds for longer.

Corollary 2: For all 0<γ<1, there exist 0<q<p<1 with the properties of Theorem 1.## Proof

Proof by construction of a suitable reward function.

Let T be the time that the AI shuts down (if it never shuts down t=∞), and rt the reward it gets on turn t, which is bounded between 0<a<1 and 1. It gets reward R′ for shutting down in W′, and has a discount rate of γ. Define IW′ as the indicator function for being in world W′. Then the AI's total reward is defined to be:

We will then set the values of R′ and a to get the theorem (setting the value of a can be achieved by an affine transformation of its rewards).

If it doesn't shutdown, the agent's maximal future reward is 1/(1−γ) (while its minimal future reward is a/(1−γ)).

If the AI has a current probability p of being in world W′, its maximal reward for not shutting down this turn is pγR′+(1−p)/(1−γ) -- this assumes the AI learns about W versus W′ perfectly in the next turn, and then gets maximal reward if its in W. Its expected reward for immediate shutdown is pR′.

The AI will thus shutdown immediately when pR′>pγR′+(1−p)/(1−γ) -- ie when R′>1(1−γ)2f(p).

If the AI's probability of being in world W′ is q, its minimal expected reward for not shutting down immediately is (1−q)a/(1−γ) -- this assumes the AI will never get any further information as to whether it is in W′, will always get minimal reward, and will never shutdown in future. Therefore it won't shutdown for qR′<(1−q)a/(1−γ) -- ie R′<f(q)a1−γ.

Taking the two bounds on R′ and multiplying them by 1−γ, we can see that, as long as f(q)>f(p)/(1−γ) -- equivalently f(p)/f(q)<(1−γ) -- we can choose an R′ and an a<1 that satisfies both bounds.

Corollary 1 is then obvious. For Corollary 2, note that 1/f(0)=0, f(1)=0, and hence f(1)/f(0)=0. Thus, since 1−γ>0, by continuity of f, there must exist 0<q<p<1 such that f(p)/f(q)<1−γ.

## The full reward function

It might be useful to fully specify the reward function. Let IT be the indicator function that the AI shuts down (for the first time) on turn T. Then the reward R is: