AIs in virtual worlds: discounted mixed utility/reward

Stuart_Armstrong

A putative new idea for AI control; index here.

In a previous post on AIs in virtual worlds, I described the idea of a utility function that motivates the AI to do operate in a virtual world with certain goals in mind, but to shutdown immediately if it detects that the outside world -- us -- is having an impact in the virtual world.

This is one way to implement such a goal, given certain restrictions on the AI's in-world utility. The restrictions are more natural for rewards rather than utilities, so they will be phrased in those terms.

Worlds $W$ and $W^{'}$

The world $W$ is deterministic, turn-based, and includes the AI; it is the virtual world we want the AI to care about. The world set $W^{'}$ is $W$ plus the possibility of stochastic influences (which is how the AI would model our influence in its world); see here for more details.

The AI has a shutdown action $S$ ; the idea is that the AI implements $S$ when $P (W^{'})$ gets high enough. We assume that the probability module $P$ of the AI is reflexively consistent in some suitable fashion.

Discounted reward results

Let the AI have a discount rate of $0 < γ < 1$ , and write $f (p)$ for the function $(1 - p) / p$ ; note that $f$ is diminishing in $p$ .

Theorem 1: For all $0 \leq q < p \leq 1$ such $f (p) / f (q) < (1 - γ)$ , there exists a discounted reward function such that the AI will shutdown whenever $P (W^{'}) > p$ and will never shutdown while $P (W^{'}) < q$ .

Note that a lower $γ$ -- a sharper discount -- means that $p$ and $q$ can be chosen to be closer together.

Corollary 1: For all $0 \leq q < p \leq 1$ such that there exists an $n$ with $f (p) / f (q) < (1 - γ^{n})$ , there exists a discounted reward function such that the AI will shutdown whenever $P (W^{'}) > p$ for the last $n$ turns, and will never shutdown while $P (W^{'}) < q$ for the last $n$ turns.

Note that Corollary 1 allows $q$ and $p$ to be closer together than Theorem 1 allows - the price being that the probability needs to stay above or below the bounds for longer.

Corollary 2: For all $0 < γ < 1$ , there exist $0 < q < p < 1$ with the properties of Theorem 1.

Proof

Proof by construction of a suitable reward function.

Let $T$ be the time that the AI shuts down (if it never shuts down $t = \infty$ ), and $r_{t}$ the reward it gets on turn $t$ , which is bounded between $0 < a < 1$ and $1$ . It gets reward $R^{'}$ for shutting down in $W^{'}$ , and has a discount rate of $γ$ . Define $I_{W^{'}}$ as the indicator function for being in world $W^{'}$ . Then the AI's total reward is defined to be:

$R_{T} = γ^{T} I_{W^{'}} R^{'} + \sum_{t = 0}^{T - 1} γ^{t} r_{t}$ .

We will then set the values of $R^{'}$ and $a$ to get the theorem (setting the value of $a$ can be achieved by an affine transformation of its rewards).

If it doesn't shutdown, the agent's maximal future reward is $1 / (1 - γ)$ (while its minimal future reward is $a / (1 - γ)$ ).

If the AI has a current probability $p$ of being in world $W^{'}$ , its maximal reward for not shutting down this turn is $p γ R^{'} + (1 - p) / (1 - γ)$ -- this assumes the AI learns about $W$ versus $W^{'}$ perfectly in the next turn, and then gets maximal reward if its in $W$ . Its expected reward for immediate shutdown is $p R^{'}$ .

The AI will thus shutdown immediately when $p R^{'} > p γ R^{'} + (1 - p) / (1 - γ)$ -- ie when $R^{'} > \frac{1}{(1 - γ)^{2}} f (p)$ .

If the AI's probability of being in world $W^{'}$ is $q$ , its minimal expected reward for not shutting down immediately is $(1 - q) a / (1 - γ)$ -- this assumes the AI will never get any further information as to whether it is in $W^{'}$ , will always get minimal reward, and will never shutdown in future. Therefore it won't shutdown for $q R^{'} < (1 - q) a / (1 - γ)$ -- ie $R^{'} < f (q) \frac{a}{1 - γ}$ .

Taking the two bounds on $R^{'}$ and multiplying them by $1 - γ$ , we can see that, as long as $f (q) > f (p) / (1 - γ)$ -- equivalently $f (p) / f (q) < (1 - γ)$ -- we can choose an $R^{'}$ and an $a < 1$ that satisfies both bounds.

Corollary 1 is then obvious. For Corollary 2, note that $1 / f (0) = 0$ , $f (1) = 0$ , and hence $f (1) / f (0) = 0$ . Thus, since $1 - γ > 0$ , by continuity of $f$ , there must exist $0 < q < p < 1$ such that $f (p) / f (q) < 1 - γ$ .

The full reward function

It might be useful to fully specify the reward function. Let $I_{T}$ be the indicator function that the AI shuts down (for the first time) on turn $T$ . Then the reward $R$ is:

$R = \sum_{T = 0}^{\infty} γ^{T} (I_{T} I_{W^{'}} R^{'} + \prod_{t = 0}^{T} (1 - I_{t}) r_{T})$ .

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

0

AIs in virtual worlds: discounted mixed utility/reward

0

Worlds $W$ and $W^{'}$

Discounted reward results

Proof

The full reward function

0

AIs in virtual worlds: discounted mixed utility/reward

0

Worlds W and W′

Discounted reward results

Proof

The full reward function

Worlds $W$ and $W^{'}$