The utility function mentioned here is somewhat peculiar; see here.

Paul Christiano asked whether we could get an agent to print out the expectation of the utility that it was currently maximising. As I'm always eager to please, here's a design to do that.

The setup

The setup is slightly different to that presented previously. In particular, the agent is no longer boxed, though it has access to a box it can use to emit a message M. As before, there is an erasure process E that will erase the message (and ¬E will let it through unimpeded).

Assume P(E)=10−3 (I'll be using the 10−3 factor a lot in this example, for illustration, though all the different factors need not be the same, nor exactly set to this). Let u be a utility function bounded between 0 and 1.

The content of M is a single number 0≤q≤1; we'll want this number to be the AI's estimate of the utility of u. More precisely, we want it to be the AI's estimate of u's expected utility, given E -- ie given that the message is not read. Of particular importance is the fact that E(u|E,q)=E(u|E). Thus when the message is erased, its contents don't matter.

Let vE be the index function for E occurring (thus E(vE)=P(E)=10−3). Then give the AI the utility function:

u#=−q2vE+2q(vE+10−6)u

If the 10−6 term wasn't there, this would motivate the AI to output the q that was equal to the expectation E(u|E).

Let's differentiate by q, giving:

−2qvE+2vEu+2×10−6u

The first two terms are zero when q=E(u|E). Around that point, the magnitude of this derivative is determined by the q term, which is multiplied by 2×vE, which, in expectation, is 2×10−3. By the assumption on the bounds of u, 2×10−6u is of magnitude at most 2×10−6. Therefore, the q which maximises u# must be within 10−3 of E(u|E). If this is a reasonable tolerance, in the estimate, then u# will motivate the AI to output its estimate close to E(u|E).

Then the AI will end up maximising u#≈(E(u|E))2, which means it will maximise E(u|E). Thus the agent outputs an almost correct q and also maximises the expectation of u.

Why did I include the ×10−6u term, then, if all it does is mess up the output a little? That's because ¬E will happen with 99.9% probability, and, after that, we don't want the AI left with a utility function that ignores the possibility of E. One in a thousand (one in a million weight, a thousand to one prior) contrary odds seem to be odds that could be overcome by careful observation after ¬E happens, so the AI doesn't have to end up acting as if it believed impossible things. If it turns out that the AI can't overcome that deficit, then even better: that means the AI can't easily distinguish E from ¬E, so can't be sure we know q or not.

Note the tension here - reducing 10−6 makes the q estimate more accurate, but runs the risk of making the AI act as if it believed very odd things after M is sent.

Note that before choosing q, the AI is maximising an approximation of E(u|E)2, while after choosing q, it is maximising some multiple of E(uvE)+10−6E(u).

Multiple estimates

The method doesn't generalise obviously to having the AI report multiple estimates of its expected utility (say a thousand or more), without having to make some the numbers so extreme that the AI could start to entertain `miraculous' hypotheses. Ideally, we'd want to use proper corrigibility to reset the AI after each estimate.

Robust, but imperfect and unsafe

The method is robust, in that it doesn't matter if we've mis-estimated P(E) a bit (since it outputs E(u|E), not E(uvE)). Even if the setup is uncertain - say E allows the message to be read once every thousand times - this just results in the error in q being a bit larger, rather than the method collapsing.

The method is imperfect, as an AI would not behave exactly as a E(u|E) maximiser - and not just because of the error in q. Specifically, the AI is motivated to discover much more about the expected value of u that it would otherwise. Suppose, for instance, that pressing the red button would set u=0 while pressing the blue button would set u to either 0.5 or 1. A pure u-maximiser would just press the blue button; but a u#-maximiser would also want to figure out what happened with u after pressing the blue button.

The method isn't safe - and not only because the AI motivated to get extra information about u (and is unboxed). If it could prevent us from reading q in the ¬E world, it would do so, if the cost to E(u|E) was trivial (which doesn't mean the cost to us need be trivial).

A putative new idea for AI control; index here.The utility function mentioned here is somewhat peculiar; see here.

Paul Christiano asked whether we could get an agent to print out the expectation of the utility that it was currently maximising. As I'm always eager to please, here's a design to do that.

## The setup

The setup is slightly different to that presented previously. In particular, the agent is no longer boxed, though it has access to a box it can use to emit a message M. As before, there is an erasure process E that will erase the message (and ¬E will let it through unimpeded).

Assume P(E)=10−3 (I'll be using the 10−3 factor a lot in this example, for illustration, though all the different factors need not be the same, nor exactly set to this). Let u be a utility function bounded between 0 and 1.

The content of M is a single number 0≤q≤1; we'll want this number to be the AI's estimate of the utility of u. More precisely, we want it to be the AI's estimate of u's expected utility, given E -- ie given that the message is not read. Of particular importance is the fact that E(u|E,q)=E(u|E). Thus when the message is erased, its contents don't matter.

Let vE be the index function for E occurring (thus E(vE)=P(E)=10−3). Then give the AI the utility function:

If the 10−6 term wasn't there, this would motivate the AI to output the q that was equal to the expectation E(u|E).

Let's differentiate by q, giving:

The first two terms are zero when q=E(u|E). Around that point, the magnitude of this derivative is determined by the q term, which is multiplied by 2×vE, which, in expectation, is 2×10−3. By the assumption on the bounds of u, 2×10−6u is of magnitude at most 2×10−6. Therefore, the q which maximises u# must be within 10−3 of E(u|E). If this is a reasonable tolerance, in the estimate, then u# will motivate the AI to output its estimate close to E(u|E).

Then the AI will end up maximising u#≈(E(u|E))2, which means it will maximise E(u|E). Thus the agent outputs an almost correct q and also maximises the expectation of u.

Why did I include the ×10−6u term, then, if all it does is mess up the output a little? That's because ¬E will happen with 99.9% probability, and, after that, we don't want the AI left with a utility function that ignores the possibility of E. One in a thousand (one in a million weight, a thousand to one prior) contrary odds seem to be odds that could be overcome by careful observation after ¬E happens, so the AI doesn't have to end up acting as if it believed impossible things. If it turns out that the AI can't overcome that deficit, then even better: that means the AI can't easily distinguish E from ¬E, so can't be sure we know q or not.

Note the tension here - reducing 10−6 makes the q estimate more accurate, but runs the risk of making the AI act as if it believed very odd things after M is sent.

Note that before choosing q, the AI is maximising an approximation of E(u|E)2, while after choosing q, it is maximising some multiple of E(uvE)+10−6E(u).

## Multiple estimates

The method doesn't generalise obviously to having the AI report multiple estimates of its expected utility (say a thousand or more), without having to make some the numbers so extreme that the AI could start to entertain `miraculous' hypotheses. Ideally, we'd want to use proper corrigibility to reset the AI after each estimate.

## Robust, but imperfect and unsafe

The method is

robust, in that it doesn't matter if we've mis-estimated P(E) a bit (since it outputs E(u|E), not E(uvE)). Even if the setup is uncertain - say E allows the message to be read once every thousand times - this just results in the error in q being a bit larger, rather than the method collapsing.The method is imperfect, as an AI would not behave exactly as a E(u|E) maximiser - and not just because of the error in q. Specifically, the AI is motivated to discover much more about the expected value of u that it would otherwise. Suppose, for instance, that pressing the red button would set u=0 while pressing the blue button would set u to either 0.5 or 1. A pure u-maximiser would just press the blue button; but a u#-maximiser would also want to figure out what happened with u after pressing the blue button.

The method isn't safe - and not only because the AI motivated to get extra information about u (and is unboxed). If it could prevent us from reading q in the ¬E world, it would do so, if the cost to E(u|E) was trivial (which doesn't mean the cost to us need be trivial).