True answers from AI

1Paul Christiano

0Stuart Armstrong

0Paul Christiano

0Stuart Armstrong

New Comment

Minimizing a loss function like is how we usually implement supervised learning. (It's pretty obvious this function is minimized at ...)

In plain language, your proposal seems to be: if a learner's output influences the system they are "predicting," and you want to interpret their output as a prediction in a straightforward way, then you could hide the learner's output whenever you gather training data.

Note that this doesn't let you access the beliefs of any particular learner, just one that is trained to optimize this supervised learning objective. I think the more interesting question is whether we can train a learner to accomplish some other task, *and* to reveal useful information about its internal state. (For example, to build an agent that simultaneously picks to maximize , and honestly reports its expectation of .)

u is a utility function, so squaring it doesn't work the same way as if it was a value (expectation of u^2 not square of expectation of u). That's why all the expressions are linear in utility (apart from the indicator functions/utilities, where its clear what multiplying by them does). If I could sensibly take non-linear functions of utilities, I wouldn't have the laborious construction in the next post to find the y's that maximise or minimise E(u|y).

Corrigibility could work for what you want, by starting with u and substituting in u#.

Another alternative is to have the AI be a maximiser, where u# is defined over a specific particular future message M (for which E is also defined). Then the AI acts (roughly) as a u-maximiser, but will output the useful M. I said roughly, because the u# term would cause it to want to learn more about the expectation of u than otherwise, but hopefully this wouldn't be a huge divergence. (EDIT: that leads to problems after M/E, but we can reset the utility at that point).

A loss function plays the same role as a utility function---i.e., we train the learner to minimize its expected loss.

I don't really understand your remark about linearity. Concretely, why is not an appropriate utility function?

Actually, does work, but "by coincidence" and has other negative properties.

Let me explain. First of all, note that things like do *not* work.

To show this: Let with probability , and with probability (I'm dropping the for this example, for simplicity). Then (so the correct is 0) while . Then in the expansion of , you will get , which in expectation is not 0. Hence the term in is non-zero, which means that cannot be a maximum of this function.

Then why does work then? Because it's (which is linear in ), minus (non-linear in , but the AI can't affect its value, so it's irrelevant in a boxed setup).

What other "negative properties" might have? Suppose we allow the AI to affect the value of , somehow, by something that is independent of the value of its output . Then an AI maximising will always set , for a total expectation of . Therefore it will also seek to maximise , which maximises if . So the agent will output the correct and maximise simultaneously.

But if it instead tries to maximise , then it will still pick , and gets expected utility of . Therefore it will pick actions that minimise the variance of , irrelevant of expectation.

Even without being able to affect , this messes up the rest of my setup. In particular, my "pick and so that you maximise " becomes maximising and the AI will now select the that minimises , instead of maximising . If ever or , it will choose those s.

A putative new idea for AI control; index here.This post supersedes previous posts (such as this, this, and, partially, this).

This post explains how we can get honest answers from an AI about the expectation of certain utility u, probabilities of specific events, and conditional expectations.

The next post will partially solve one of the main problems with using the approach in general, the problem of very low probabilities.

## The setup

Let u be any bounded utility function; by rescaling, we can assume it's bounded between 0 and 1.

Assume we have a smart AI that's boxed (as in the case of reduced impact AI, the boxing is also to protect the AI from us observing it). It has a single output channel, on which it can emit a message M.

A random process E (erasure) will erase M before any human can observe it. The probability of erasure, P(E), is low (thus P(¬E), non-erasure, is high). Assume that the box and the setup (including the AI's internal setup) is such that if E happens, humans cannot figure out what M was.

The AI will be given a utility u# to maximise (ignore the other box with Y coming from it, for the moment).

## Honest expectation estimator

Suppose we want to figure out what the expectation of u is. Let Q be this expectation, as estimated by the AI (and presumably unknown to us). Since u is bounded by 0 and 1, so must be Q as well.

Let vE be the indicator function for E: so vE=1 if E happens, and vE=0 if ¬E does.

The AI's message M gets to contain a single number q, bounded between 0 and 1 (any other message is taken to be 0.5, say), and the AI is motivated to maximise:

where f(q,u)=−q2+2qu.

First of all, note that every term in u# is multiplied by vE. This means the AI will get utility only in worlds where E happens. Thus it will work on the assumption that its message M is never read by any humans (and thus that the contents of this message cannot affect human behaviour - in particular, it has no reason to lie or manipulate us through this message, as such manipulations only succeed in worlds where the AI gets 0 utility anyway).

If we assume that P(E) is independent of u and q, then the AI's expected utility is proportional toBut the AI can choose q! Differentiating this expression with respect to q gives −2q+2Q, which is zero iff q=Q. Differentiating a second time gives −2, showing that q=Q is a maximum.

Therefore, in the worlds where ¬E happens (the vast majority), we get to see a message M that contains q, which is a very accurate estimate, according to the AI, of the expected utility of u. And the expected utility the AI derives is P(E)∗(Q2).

(

Note: the entropy-inspired formula log(q)uvE+log(1−q)(1−u)vE might be more usual to use for these purposes, but the formula used here is much easier to work with.)## Internal versus external probabilities

In some past posts, I've been designing AIs that used part of their probability/expectation estimator in order to define a pseudo-utility function. I've always been uncomfortable about this, as pseudo-utilities are likely to be unstable under self-improvement.

The approach above avoids that problem by having the AI's utility depend on specific facts about the world, and then gives the AI the ability to determine those specific facts. But those facts are precisely the information we wanted to known. This avoids the need to pseudo-utility, or constructions of the form "the AI wants to accurately convey the content of its probability estimate".

## Probabilities and (conditional) utilities estimation

Now let's consider conditional utilities. Let Y be some bit string, of length l, generated by some random process (see the image above).

Let vy be the indicator function for Y=y for some specific string y. Then we can compute the expectation for u, conditional on Y=y, by making the AI print M containing q such as to maximise

Just as before, the AI will choose q=E(uvy)=E(u∧(Y=y)).

Note that if u=1, then q is just E(vy)=P(Y=y), so we can figure out the probability of Y=y this way too.

Modify the definition of u# slightly, to

Here, the second vy term has moved out of the brackets. Now the AI will only focus on worlds where Y=y and E happens, and will output q=E(u|Y=y).

## Highest and lowest utilities and probabilities

Finally, we can use the methods above to have the AI give us the highest (and lowest) utilities and probabilities. Let the AI's message M contain not only q, but y as well, and have it maximise:

For fixed y, the AI will choose qy=E(uvy). It's expected reward is therefore P(E)q2y. It will thus choose the y with the highest q2y, hence the highest qy=E(uvy). Using (1−u)vy instead of uvy will motivate it to instead find the y with the lowest E(uvy).

In fact, if we assume the AI cannot affect the probability of Y through the message M (given E), the different probabilities and utilities are independent, and we can get the AI to emit the message M={y+,y−,yp+,yp−,q+,q−,p+,p−}, maximising the utility

Then (conditional on E), the AI will output M so that, according to its best estimate,

## The missing pieces

Note that the above algorithm gives the y with the highest (or lowest) E(u∧(Y=y)) or P(Y=y). It does

notgive the y with the highest E(u|Y=y).If we assume that all output Y's are equally probable, then E(u|Y=y)=E(uvy)∗2l. But there are issues with that assumption, and other ways of addressing the issue, which I'll get to in the next post.