A putative new idea for AI control; index here.
This post explains how we can get honest answers from an AI about the expectation of certain utility u, probabilities of specific events, and conditional expectations.
The next post will partially solve one of the main problems with using the approach in general, the problem of very low probabilities.
Let be any bounded utility function; by rescaling, we can assume it's bounded between 0 and 1.
Assume we have a smart AI that's boxed (as in the case of reduced impact AI, the boxing is also to protect the AI from us observing it). It has a single output channel, on which it can emit a message .
A random process (erasure) will erase before any human can observe it. The probability of erasure, , is low (thus , non-erasure, is high). Assume that the box and the setup (including the AI's internal setup) is such that if happens, humans cannot figure out what M was.
The AI will be given a utility to maximise (ignore the other box with Y coming from it, for the moment).
Honest expectation estimator
Suppose we want to figure out what the expectation of is. Let be this expectation, as estimated by the AI (and presumably unknown to us). Since is bounded by 0 and 1, so must be as well.
Let be the indicator function for : so if happens, and if does.
The AI's message gets to contain a single number , bounded between 0 and 1 (any other message is taken to be 0.5, say), and the AI is motivated to maximise:
First of all, note that every term in is multiplied by . This means the AI will get utility only in worlds where E happens. Thus it will work on the assumption that its message is never read by any humans (and thus that the contents of this message cannot affect human behaviour - in particular, it has no reason to lie or manipulate us through this message, as such manipulations only succeed in worlds where the AI gets 0 utility anyway).
If we assume that is independent of and , then the AI's expected utility is proportional to
But the AI can choose ! Differentiating this expression with respect to gives , which is zero iff . Differentiating a second time gives , showing that is a maximum.
Therefore, in the worlds where happens (the vast majority), we get to see a message that contains , which is a very accurate estimate, according to the AI, of the expected utility of . And the expected utility the AI derives is .
(Note: the entropy-inspired formula might be more usual to use for these purposes, but the formula used here is much easier to work with.)
Internal versus external probabilities
In some past posts, I've been designing AIs that used part of their probability/expectation estimator in order to define a pseudo-utility function. I've always been uncomfortable about this, as pseudo-utilities are likely to be unstable under self-improvement.
The approach above avoids that problem by having the AI's utility depend on specific facts about the world, and then gives the AI the ability to determine those specific facts. But those facts are precisely the information we wanted to known. This avoids the need to pseudo-utility, or constructions of the form "the AI wants to accurately convey the content of its probability estimate".
Probabilities and (conditional) utilities estimation
Now let's consider conditional utilities. Let be some bit string, of length , generated by some random process (see the image above).
Let be the indicator function for for some specific string . Then we can compute the expectation for , conditional on , by making the AI print containing such as to maximise
Just as before, the AI will choose .
Note that if , then is just , so we can figure out the probability of this way too.
Modify the definition of slightly, to
Here, the second term has moved out of the brackets. Now the AI will only focus on worlds where and happens, and will output .
Highest and lowest utilities and probabilities
Finally, we can use the methods above to have the AI give us the highest (and lowest) utilities and probabilities. Let the AI's message contain not only , but as well, and have it maximise:
For fixed , the AI will choose . It's expected reward is therefore . It will thus choose the with the highest , hence the highest . Using instead of will motivate it to instead find the with the lowest .
In fact, if we assume the AI cannot affect the probability of through the message (given ), the different probabilities and utilities are independent, and we can get the AI to emit the message , maximising the utility
Then (conditional on ), the AI will output so that, according to its best estimate,
The missing pieces
Note that the above algorithm gives the with the highest (or lowest) or . It does not give the with the highest .
If we assume that all output 's are equally probable, then . But there are issues with that assumption, and other ways of addressing the issue, which I'll get to in the next post.