In a preceding post, I showed how to get a wide variety of probability and utility estimates out of a boxed AI, without fearing the AI would mislead us. Specifically, given some bounded utility u and some stochastic message Y of length l, we could get estimates of E(u), P(Y=y), E(u∧(Y=y)), and E(u|Y=y). We could also find the y's with the highest and lowest P(Y=y), E(u∧(Y=y)), but not highest or lowest E(u|Y=y).

The setup used a stochastic process E that would erase the AI's message with low probability, and a specifically chosen utility u#:

Evidential, not causal

I should note here that I've previously described this setup as telling us "what y would a u-maximising AI set Y to, if it could do so magically and undetectably". Although this is a good informal definition, it implies a causal relation: the u-maximising AI is "setting" Y to y. However, the equations from the previous post make it clear that this is entirely evidential: what y would knowing Y=y make the u-maximising AI 'happiest' to learn about.

The use of specific E(u|Y=y)

There is a lot of useful testing that can be done without knowing the y with the highest E(u|Y=y). Suppose that a u-maximising AI assures us that it has no desire to escape from its box, and Y is a 1-bit message such that Y=1 unlocks the AI. Then E(u|Y=1) is certainly something we'd like to know!

Since Y need not be causally connected to the AI (or to us, for that matter), we could also ask such questions as "does the AI want to start a nuclear war", "would the AI unleash this virus", "which company/charity/government would the AI want to send money to", and so on. Many other possibilities will surely spring to mind.

Now, as long as we can list all the alternative y's in a sensible way, we can just have the AI compute the E(u|Y=y) for all y's and act in consequence.

Because of the E condition, the u#-maximising AI should not be attempting to lie or manipulate us via these answers.

Note that we don't generally want to risk causing a nuclear war (citation needed), but for this setup to work, it requires that Y=1 (or whatever causes the war) be at least possible. We might set probabilities so low that we feel safe. But in that case, the low probabilities might get us into the same trouble that we will see in the "miraculous hypotheses" section.

The use of the highest E(u|Y=y)

There are many cases where we would want to test the AI to find out what the highest E(u|Y=y) could be. Maybe we would wonder what message the AI would want broadcast, if it could have a random entity send out a mass email. Or what code string it would want a certain computer to run. Or what genome it would want inserted into a bacterium.

In those cases, the state space of Y is too large to list, so we can't manually compute and compare all the E(u|Y=y).

Equiprobability versus miraculous hypotheses

One obvious approach is to setup Y so that all bitstrings are equally likely, 2−l for l the length of Y. In that case, E(u|Y=y) is simply 2lE(u∧(Y=y)), so finding the y that maximises E(u∧(Y=y)) is sufficient - and we can already do that.

But this is a problem when l is very large. Quick, what's the probability that the laws of physics are wrong, that some alien or even human can somehow manipulate the probability of Y? One chance in a billion? One chance in a trillion trillion trillion?

You'd need to be awfully confident in the laws of physics to put certainties of the order of a trillion trillion trillion (1036) to one on that. Yet any ASCII message of length 18 characters -"To be or not to be"- contains 7∗18=126 bits. And the probability of a message of 126 bits being generated by random processes is about 10−38, which is less than one chance in a trillion trillion trillion.

So the probability of seeing "To be or not to be" is much less than the probability that there is something seriously wrong with physics (if allowed unicode rather than ASCII, we'd be limited to 4 characters before that happened). Call the "something is wrong with physics (or maybe we're in a simulation, or maybe magic exists, or whatever)" hypotheses "miraculous hypotheses". So for any decent message length l, any given message being generated is less likely than "a miracle happened".

We'd like to avoid the AI considering miracles when doing its calculations. We're trying to do lists over counterfactuals, not consider odd hypotheses.

Reducing the impact of miraculous hypotheses

There are a few things we can do to reduce the impact of miraculous hypotheses. First of all, many miraculous hypotheses cancel each other out. There might be some bizarre unknown physical process that will force Y to "01000110000...", but there's likely to be another one that forces Y to "10111001111...", without any reason to prefer one hypothesis above the other.

However, if there's an alien or human conscious behind the process, then some miraculous messages - "Hi there! Please cower in fear, take me to your leader, and accept this cheap re-mortgaging proposal..." - become more likely than others. In general, low-Kolmogorov complexity messages will get boosted, so we can't count on full cancellation.

Another alternative is to trim the number of messages as much as we can, as most of them are nonsense. Imagine that Y is the name of a company to invest in on the Dow Jones Industrial average. "United Technologies" (19 characters) is one such company. However, the Dow has only 30 companies in it, so we could reprogram our Y to be only 5 bits long (2^5=32). Even when the search space is vast (such as a hypothetical genome) we might be able through out a large part of the space and still get reasonable answers.

Another route is to give a full description of the setup S generating Y. If we can exhaustively describe S, down to the laws of physics, then conditioning on S can exclude most miraculous hypotheses (this could be made perfect if combined with perfect virtual world ideas, but that's a discussion for another day). Maybe we could introduce a scrambler into the system, so even if we can't detail how the message was generated, we could detail how it was mangled before being read, ideally with a mangle function the AI couldn't untangle (so that, Bayesianly, it would find the messages equiprobable even if it thought they would be non-uniform).

That's still not enough to exclude all miraculous hypotheses, though - in the real world, no system can be perfectly isolated, and there might remain some super-method for influencing Y we hadn't thought of.

Note that the AI doesn't need to think of this super-method to entertain miraculous hypotheses; it simply needs to consider that there might be one.

Detecting miraculous hypotheses

Fortunately, the approach of the previous post gives a method for detecting the presence of miraculous hypotheses. Recall that the AI can output the messages yp+ such that Y=yp+ has the highest probability - and p− that has the lowest. Then if P(Y=yp+) and P(Y=yp−) differ significantly, something has gone wrong. We can even apply this message to larger and larger substrings of Y, to detect the message length at which things start to go wrong.

We can even iterate between this and our definition of S, refining S to reduce this spread between P(Y=yp+) and P(Y=yp−), or hand-crafting certain assumptions for the AI to consider or ignore, to have a similar effect.

Finding the maximal E(u|Y=y) directly

It is tricky to get the AI to directly find the highest E(u|Y=y). This can be done if we allow the AI some chance of affecting the result of Y (therefore moving more to a causal/counterfactual arena rather than an evidential one).

The best I could find in the evidential area is a limiting process that will ultimately converge on the y with maximal (or minimal) E(u|Y=y).

For any r>1, assume M contains y and q, and define fr(q,u)=−(r−1)qr+rqr−1u, and the utility

u#=vEvyfr(q,u).

The expectation of u# is P(E)P(Y=y)(−(r−1)qr+rqr−1Q) with Q=E(u|Y=y). Ignoring the constants and differentiating with respect to q gives qr−2r(r−1)(−q+Q). This is zero for q=Q and q=0. Differentiating a second time gives qr−3r(r−1)(−(r−1)q+(r−2)Q), which is negative at q=Q. Thus q=Q is a maximum; as the only other stationary point is at 0, q=Q must be maximal on [0,1].

So, for fixed y, the AI will choose q=E(u|Y=y) and its expected reward is P(E)P(Y=y)(E(u|Y=y))r). Were it not for the pesky P(Y=y) term, maximising this with respect to choice of y would be sufficient. However, if y and y′ be stings on non zero-probabilities such that (E(u|Y=y))>(E(u|Y=y′)), then there exists an r such that

P(Y=y)(E(u|Y=y))r)>P(Y=y′)(E(u|Y=y′))r).

Thus there is an r such that maximising the above u# will cause the AI to select the y among the maximal E(u|Y=y), and a q=E(u|Y=y).

If we are sure that different runs of the AI are independent, we can try different values of r and, using info about the spread of probabilities of P(Y=y), converge on the best y within any desired precision.

Alternately, there might be a way of defining a "limit utility" as r goes to infinity.

The miraculous hypotheses that remain

Does the approach to finding the maximal E(u|Y=y) remove the impact of miraculous hypotheses? Not entirely, unfortunately. Two classes of miracles remain. First, remembering that this approach is evidential:

Y=y is evidence means that something cares about u and will affect u in the universe in miraculous ways.

In that case, the estimated E(u|Y=y) will be correct, as in it is the maximal value, but it will be maximal because Y=y provides evidence for a friendly miracle (or maybe other values of Y provide evidence for an unfriendly miracle).

The second class of miraculous hypotheses is even more worrying:

Something cares about the AI's message M, or about E, and will affect u in the universe in miraculous ways in consequence.

The whole point of using the erasure E was to ensure that humans couldn't read M in the worlds the AI considered (hence its message content didn't matter). If some entity could read M without caring about E, and then acted on it, then the wouldn't even be giving us honest probability estimates.

The only way I can think of combating that is to gradually increase the length of Y, comparing the spread of P(Y=yp+) and P(Y=yp−) as we do so, to catch when it starts going wrong, as above.

If the distribution of Y is known to be skewed, we can probably come up with a test that still works. If we know the "ideal" P(Y=y), define wy=P(Y=y) and have the AI find the maximal and minimal E(vy/wy) (which should be close to 1 if everything is working).

If we just know the skew, but not which y is skewed by which amount, then maybe we could give the AI the different w1,w2,... w_{2^l}$, and have it compute a bijective function h that maps all y to a given wi, such that it tries to minimise the relative entropy between its calculated distribution of P(Y=y) and P′(Y=y)=wh(y). Then if the same formula outputs this relative entropy, we can tell whether the true distribution of P(Y=y) is still behaving as we expect.

Finally, it might be possible for the AI itself to chunk the space of possible Y messages into chunks, each of reasonable probability, where the expectation of u is roughly constant on each chunk.

A putative new idea for AI control; index here.In a preceding post, I showed how to get a wide variety of probability and utility estimates out of a boxed AI, without fearing the AI would mislead us. Specifically, given some bounded utility u and some stochastic message Y of length l, we could get estimates of E(u), P(Y=y), E(u∧(Y=y)), and E(u|Y=y). We could also find the y's with the highest and lowest P(Y=y), E(u∧(Y=y)), but not highest or lowest E(u|Y=y).

The setup used a stochastic process E that would erase the AI's message with low probability, and a specifically chosen utility u#:

## Evidential, not causal

I should note here that I've previously described this setup as telling us "what y would a u-maximising AI set Y to, if it could do so magically and undetectably". Although this is a good informal definition, it implies a causal relation: the u-maximising AI is "setting" Y to y. However, the equations from the previous post make it clear that this is entirely evidential: what y would knowing Y=y make the u-maximising AI 'happiest' to learn about.

## The use of specific E(u|Y=y)

There is a lot of useful testing that can be done without knowing the y with the highest E(u|Y=y). Suppose that a u-maximising AI assures us that it has no desire to escape from its box, and Y is a 1-bit message such that Y=1 unlocks the AI. Then E(u|Y=1) is certainly something we'd like to know!

Since Y need not be causally connected to the AI (or to us, for that matter), we could also ask such questions as "does the AI want to start a nuclear war", "would the AI unleash this virus", "which company/charity/government would the AI want to send money to", and so on. Many other possibilities will surely spring to mind.

Now, as long as we can list all the alternative y's in a sensible way, we can just have the AI compute the E(u|Y=y) for all y's and act in consequence.

Because of the E condition, the u#-maximising AI should not be attempting to lie or manipulate us via these answers.

Note that we don't generally want to risk causing a nuclear war (citation needed), but for this setup to work, it requires that Y=1 (or whatever causes the war) be at least possible. We might set probabilities so low that we feel safe. But in that case, the low probabilities might get us into the same trouble that we will see in the "miraculous hypotheses" section.

## The use of the highest E(u|Y=y)

There are many cases where we would want to test the AI to find out what the highest E(u|Y=y) could be. Maybe we would wonder what message the AI would want broadcast, if it could have a random entity send out a mass email. Or what code string it would want a certain computer to run. Or what genome it would want inserted into a bacterium.

In those cases, the state space of Y is too large to list, so we can't manually compute and compare all the E(u|Y=y).

## Equiprobability versus miraculous hypotheses

One obvious approach is to setup Y so that all bitstrings are equally likely, 2−l for l the length of Y. In that case, E(u|Y=y) is simply 2lE(u∧(Y=y)), so finding the y that maximises E(u∧(Y=y)) is sufficient - and we can already do that.

But this is a problem when l is very large. Quick, what's the probability that the laws of physics are wrong, that some alien or even human can somehow manipulate the probability of Y? One chance in a billion? One chance in a trillion trillion trillion?

You'd need to be awfully confident in the laws of physics to put certainties of the order of a trillion trillion trillion (1036) to one on that. Yet any ASCII message of length 18 characters -"To be or not to be"- contains 7∗18=126 bits. And the probability of a message of 126 bits being generated by random processes is about 10−38, which is

lessthan one chance in a trillion trillion trillion.So the probability of seeing "To be or not to be" is much less than the probability that there is something seriously wrong with physics (if allowed unicode rather than ASCII, we'd be limited to 4 characters before that happened). Call the "something is wrong with physics (or maybe we're in a simulation, or maybe magic exists, or whatever)" hypotheses "miraculous hypotheses". So for any decent message length l, any given message being generated is

less likelythan "a miracle happened".We'd like to avoid the AI considering miracles when doing its calculations. We're trying to do lists over counterfactuals, not consider odd hypotheses.

## Reducing the impact of miraculous hypotheses

There are a few things we can do to reduce the impact of miraculous hypotheses. First of all, many miraculous hypotheses cancel each other out. There might be some bizarre unknown physical process that will force Y to "01000110000...", but there's likely to be another one that forces Y to "10111001111...", without any reason to prefer one hypothesis above the other.

However, if there's an alien or human conscious behind the process, then some miraculous messages - "Hi there! Please cower in fear, take me to your leader, and accept this cheap re-mortgaging proposal..." - become more likely than others. In general, low-Kolmogorov complexity messages will get boosted, so we can't count on full cancellation.

Another alternative is to trim the number of messages as much as we can, as most of them are nonsense. Imagine that Y is the name of a company to invest in on the Dow Jones Industrial average. "United Technologies" (19 characters) is one such company. However, the Dow has only 30 companies in it, so we could reprogram our Y to be only 5 bits long (2^5=32). Even when the search space is vast (such as a hypothetical genome) we might be able through out a large part of the space and still get reasonable answers.

Another route is to give a full description of the setup S generating Y. If we can exhaustively describe S, down to the laws of physics, then conditioning on S can exclude most miraculous hypotheses (this could be made perfect if combined with perfect virtual world ideas, but that's a discussion for another day). Maybe we could introduce a scrambler into the system, so even if we can't detail how the message was generated, we could detail how it was mangled before being read, ideally with a mangle function the AI couldn't untangle (so that, Bayesianly, it would find the messages equiprobable even if it thought they would be non-uniform).

That's still not enough to exclude all miraculous hypotheses, though - in the real world, no system can be perfectly isolated, and there might remain some super-method for influencing Y we hadn't thought of.

Note that the AI doesn't need to think of this super-method to entertain miraculous hypotheses; it simply needs to consider that there might be one.

## Detecting miraculous hypotheses

Fortunately, the approach of the previous post gives a method for detecting the presence of miraculous hypotheses. Recall that the AI can output the messages yp+ such that Y=yp+ has the highest probability - and p− that has the lowest. Then if P(Y=yp+) and P(Y=yp−) differ significantly, something has gone wrong. We can even apply this message to larger and larger substrings of Y, to detect the message length at which things start to go wrong.

We can even iterate between this and our definition of S, refining S to reduce this spread between P(Y=yp+) and P(Y=yp−), or hand-crafting certain assumptions for the AI to consider or ignore, to have a similar effect.

## Finding the maximal E(u|Y=y) directly

It is tricky to get the AI to directly find the highest E(u|Y=y). This can be done if we allow the AI some chance of affecting the result of Y (therefore moving more to a causal/counterfactual arena rather than an evidential one).

The best I could find in the evidential area is a limiting process that will ultimately converge on the y with maximal (or minimal) E(u|Y=y).

For any r>1, assume M contains y and q, and define fr(q,u)=−(r−1)qr+rqr−1u, and the utility

The expectation of u# is P(E)P(Y=y)(−(r−1)qr+rqr−1Q) with Q=E(u|Y=y). Ignoring the constants and differentiating with respect to q gives qr−2r(r−1)(−q+Q). This is zero for q=Q and q=0. Differentiating a second time gives qr−3r(r−1)(−(r−1)q+(r−2)Q), which is negative at q=Q. Thus q=Q is a maximum; as the only other stationary point is at 0, q=Q must be maximal on [0,1].

So, for fixed y, the AI will choose q=E(u|Y=y) and its expected reward is P(E)P(Y=y)(E(u|Y=y))r). Were it not for the pesky P(Y=y) term, maximising this with respect to choice of y would be sufficient. However, if y and y′ be stings on non zero-probabilities such that (E(u|Y=y))>(E(u|Y=y′)), then there exists an r such that

Thus there is an r such that maximising the above u# will cause the AI to select the y among the maximal E(u|Y=y), and a q=E(u|Y=y).

If we are sure that different runs of the AI are independent, we can try different values of r and, using info about the spread of probabilities of P(Y=y), converge on the best y within any desired precision.

Alternately, there might be a way of defining a "limit utility" as r goes to infinity.

## The miraculous hypotheses that remain

Does the approach to finding the maximal E(u|Y=y) remove the impact of miraculous hypotheses? Not entirely, unfortunately. Two classes of miracles remain. First, remembering that this approach is evidential:

In that case, the estimated E(u|Y=y) will be correct, as in it is the maximal value, but it will be maximal because Y=y provides evidence for a friendly miracle (or maybe other values of Y provide evidence for an unfriendly miracle).

The second class of miraculous hypotheses is even more worrying:

The whole point of using the erasure E was to ensure that

humanscouldn't read M in the worlds the AI considered (hence its message content didn't matter). If some entity could read M without caring about E, and then acted on it, then the wouldn't even be giving us honest probability estimates.The only way I can think of combating that is to gradually increase the length of Y, comparing the spread of P(Y=yp+) and P(Y=yp−) as we do so, to catch when it starts going wrong, as above.

If the distribution of Y is known to be skewed, we can probably come up with a test that still works. If we know the "ideal" P(Y=y), define wy=P(Y=y) and have the AI find the maximal and minimal E(vy/wy) (which should be close to 1 if everything is working).

If we just know the skew, but not which y is skewed by which amount, then maybe we could give the AI the different w1,w2,... w_{2^l}$, and have it compute a bijective function h that maps all y to a given wi, such that it tries to minimise the relative entropy between its calculated distribution of P(Y=y) and P′(Y=y)=wh(y). Then if the same formula outputs this relative entropy, we can tell whether the true distribution of P(Y=y) is still behaving as we expect.

Finally, it might be possible for the AI itself to chunk the space of possible Y messages into chunks, each of reasonable probability, where the expectation of u is roughly constant on each chunk.