True answers from AI: Summary

Stuart_Armstrong

A putative new idea for AI control; index here.

This post summaries the methods for getting honest answers from certain AIs, presented in the previous two posts.

You can basically safely ask an AI for expected utility estimates (including conditional ones), probability estimates, and get the AI to identify events of highest and lowest probability and utility, and, arguably, highest and lowest conditional utility.

The setup is a boxed AI, which emits a message $M$ , and a stochastic stochastic process $E$ that would erase $M$ with low probability, before anyone could see it. There is also a stochastic process that emits a message $Y$ . Let $v_{E}$ be the indicator function for $E$ and $v_{y}$ the indicator function for $Y = y$ .

Then, given a utility $u$ bounded between $0$ and $1$ , and a specific $y$ , we can give the AI a utility $u^{#}$ that will cause it to give us the message $M_{1} = {q, q^{y}, p^{y}, q^{| y}}$ (which we will see with high probability $P (\neg E)$ , an event the AI will not optimise for).

Define $f (q, u) = - q^{2} + 2 q u$ . The utility is then

$u_{1}^{#} = v_{E} (f (q, u) + f (q^{y}, u v_{y}) + f (p^{y}, v_{y}) + f (q^{| y}, u) v_{y})$ .

Maximising this $u^{#}$ ensures that, in the AI's best estimate,

$q = E (u)$ ,
$q^{y} = E (u \land (Y = y))$ ,
$p^{y} = P (Y = y)$ ,
$q^{| y} = q^{y} / p^{y} = E (u | Y = y)$ .

But we need not consider any specific $y$ . The AI could send out the message $M_{2} = {y^{+}, y^{-}, y^{p +}, y^{p -}, q^{+}, q^{-}, p^{+}, p^{-}}$ , while maximising the utility

$u_{2}^{#} = v_{E} (f (q^{+}, u v_{y^{+}}) + f (q^{-}, (1 - u) v_{y^{-}}) + f (p^{+}, v_{y^{p +}}) + f (p^{-}, 1 - v_{y^{p -}}))$ .

Then $M$ will be such that, in the AI's best estimate,

$(y^{+}, q^{+}) = ({argmax}_{y}, {max}_{y}) E (u \land (Y = y))$ ,
$(y^{-}, q^{-}) = ({argmin}_{y}, {min}_{y}) E (u \land (Y = y))$ ,
$(y^{p +}, p^{+}) = ({argmax}_{y}, {max}_{y}) P (Y = y)$ ,
$(y^{p -}, p^{-}) = ({argmin}_{y}, {min}_{y}) P (Y = y)$ ,

Finally, define $M_{3, r} = {y^{| +}, y^{| -}, q^{| +}, q^{| -}}$ . And set $f_{r} (q, u) = - (r - 1) q^{r} + r q^{r - 1} u$ , and define the utility function:

$u_{3, r}^{#} = v_{E} v_{y^{| +}} f_{r} (q | +, u) + v_{E} v_{y^{| -}} f_{r} (q^{| -}, 1 - u)$ .

Then, for large enough $r$ , the AI will output $M$ so that, according to its best estimate,

$(y^{| +}, q^{| +}) = ({argmax}_{y}, {max}_{y}) E (u | (Y = Y^{+}))$ ,
$(y^{| -}, q^{| -}) = ({argmin}_{y}, {min}_{y}) E (u | (Y = Y^{+}))$ .

Unfortunately the value of $r$ cannot be known ahead of time. It might be possible to define a "limit" utility $u_{3, \infty}^{#}$ .

Note that all these messages and utilities are independent, so the AI can generate messages $M_{1} \cup M_{2} \cup M_{3, r} \cup M_{3, r^{'}}$ when maximising

$u_{1}^{#} + u_{2}^{#} + u_{3, r}^{#} + u_{3, r^{'}}^{#}$ .

But there are issues with very low probabilities, as explained in the previous post.