# 0

Personal Blog

A putative new idea for AI control; index here.

In a previous post on AIs in virtual worlds, I described the idea of a utility function that motivates the AI to do operate in a virtual world with certain goals in mind, but to shutdown immediately if it detects that the outside world -- us -- is having an impact in the virtual world.

This is one way to implement such a goal, given certain restrictions on the AI's in-world utility. The restrictions are more natural for rewards rather than utilities, so they will be phrased in those terms.

# Worlds W and W′

The world is deterministic, turn-based, and includes the AI; it is the virtual world we want the AI to care about. The world set is plus the possibility of stochastic influences (which is how the AI would model our influence in its world); see here for more details.

The AI has a shutdown action ; the idea is that the AI implements when gets high enough. We assume that the probability module of the AI is reflexively consistent in some suitable fashion.

## Discounted reward results

Let the AI have a discount rate of , and write for the function ; note that is diminishing in .

• Theorem 1: For all such , there exists a discounted reward function such that the AI will shutdown whenever and will never shutdown while .

Note that a lower -- a sharper discount -- means that and can be chosen to be closer together.

• Corollary 1: For all such that there exists an with , there exists a discounted reward function such that the AI will shutdown whenever for the last turns, and will never shutdown while for the last turns.

Note that Corollary 1 allows and to be closer together than Theorem 1 allows - the price being that the probability needs to stay above or below the bounds for longer.

• Corollary 2: For all , there exist with the properties of Theorem 1.

# Proof

Proof by construction of a suitable reward function.

Let be the time that the AI shuts down (if it never shuts down ), and the reward it gets on turn , which is bounded between and . It gets reward for shutting down in , and has a discount rate of . Define as the indicator function for being in world . Then the AI's total reward is defined to be:

• .

We will then set the values of and to get the theorem (setting the value of can be achieved by an affine transformation of its rewards).

If it doesn't shutdown, the agent's maximal future reward is (while its minimal future reward is ).

If the AI has a current probability of being in world , its maximal reward for not shutting down this turn is -- this assumes the AI learns about versus perfectly in the next turn, and then gets maximal reward if its in . Its expected reward for immediate shutdown is .

The AI will thus shutdown immediately when -- ie when .

If the AI's probability of being in world is , its minimal expected reward for not shutting down immediately is -- this assumes the AI will never get any further information as to whether it is in , will always get minimal reward, and will never shutdown in future. Therefore it won't shutdown for -- ie .

Taking the two bounds on and multiplying them by , we can see that, as long as -- equivalently -- we can choose an and an that satisfies both bounds.

Corollary 1 is then obvious. For Corollary 2, note that , , and hence . Thus, since , by continuity of , there must exist such that .

# The full reward function

It might be useful to fully specify the reward function. Let be the indicator function that the AI shuts down (for the first time) on turn . Then the reward is:

• .
Personal Blog

New Comment