Analysing: Dangerous messages from future UFAI via Oracles

by Stuart Armstrong 3 min read22nd Nov 201912 comments


The user cousin_it has pointed out a problem with the counterfactual Oracle idea: the Oracle AIs may form a "bucket chain" bringing back a dangerous message from a future UFAI (unfriendly AI).

This is certainly a problem, and though there are ways of reducing the risk, there doesn't seem to be any clean solutions to it.

The bucket chain

Beginning the chain

The basic idea is simple. Suppose there is a counterfactual Oracle, . It makes a prediction about the value of some variable , in two days time.

However, in one day's time, an UFAI will be unleashed. It will take over everything, including , and is thus in a position to offer high value. It will set to the optimal value for (or maybe it just breaks open the box and offers maximal reward directly, if this is feasible).

What it requires from is that contains a message that will cause the humans to run an UFAI earlier than they would have otherwise (designate this earlier UFAI by ). It seems that this will motivate to do that (maybe in the less significant digits of ).

Wei Dai has noted that this requires the UFAI to correctly solve the counterfactual mugging problem; the Oracle , on the other hand, need only reason causally.

Extending the chain

Now suppose that there is another Oracle, , running one day before , and also predicting two days out. Then will be running within the span of its prediction. If is motivated to do the same deal with , then will be motivated to produce a prediction that leads to , a yet earlier unfriendly AI. And, of course, if there is an Oracle whose prediction horizon includes , the issue can induct backwards in time.

Note that the Oracles need not have prediction horizons of exactly two days; all that matters is that their prediction horizons overlap, and that whatever that is produced has the power and time to affect the prediction variables of some past Oracle.

Requirements and restrictions

Note also that the the predictions have to actually produce an within that time span (or at least increase its probability). The chain doesn't work if merely sets up the possibility of an UFAI at a future date; the Oracles won't care about anything that happens after their prediction horizon.

Note also that there needs to be a (counterfactual) future UFAI for the whole thing to work. Some FAI (friendly AI) designs might also use the bucket chain, but the consequences of that are not nearly as bad, by definition.

When I say counterfactual future UFAI, I mean that "if all Oracles were honest in their predictions, there would be an UFAI produced at some point in the future, and one willing to engage in counterfactual mugging".


As I said, this seems to be a genuinely hard problem. Here are some possible ways to try to solve or ameliorate it.

If you can define the other Oracles

Let be an Oracle, whose prediction horizon runs from date to . Let be all the Oracles whose prediction messages are published between and .

Then if the counterfactual for is not "if your message is erased", but instead "if your message and that of all Oracles in is erased", there will be no bucket chain for .

Of course, to do that, we need to fully define .

Rapid turnover Oracles

Another solution is to only use Oracles whose predictions are a few minutes or so into the future. If this could be enforced, not only would it make it harder to have a consistent bucket chain, it also means that the UFAI would have to take over the world and reward its enabling Oracle, in an absurdly short amount of time.

In general, this is an argument for allowing Oracles higher bandwidth if their prediction horizons are shorter.

Breaking the chain at a chokepoint

A single chokepoint can break the chain. If we can agree that no Oracle's prediction horizon covers a certain date, then all Oracles before that date are safe (as long as there is no UFAI produced "naturally" before then).

Forward looking UFAIs only

We can do more if we have some control over the UFAIs (or the potential UFAIs). If all such entities are exclusively forward looking, they will not be motivated to use the chains of Oracles to gain control of the past. See also this post on breaking acausal trade ideas.

Guarding and shutting down the Oracles

Finally, we might chose to guard the Oracles with some destruction system, and shut them down - without reward, or with negative reward - if a UFAI takeover seems imminent.

This is interestingly different from the usual case of guarding the UFAI, and giving the UFAI a shutdown button. Here, the UFAI can only come into existence if it is fast and powerful enough to take over the shutdown system of another AI - a boxed Oracle whose shutdown system we get to design.

It may be that it is easier to prevent outside takeover of a shutdown system, versus inside takeover. This stacks with other precautions, of course; the faster the UFAI has to act - before the Oracles prediction horizon ends - the harder it is for it (and hence the less likely the UFAI is to exist in the first place).

We could combine this with a discount rate for the Oracle. Suppose that the Oracle makes predictions about some variable 's value over the entirety of the next year. In case of erasure, it gets a continuous stream of reward for the correctness of its many predictions, but there is a very high discount rate. However, this reward stream is not finalised until the end of the year, and is cancelled if ever the humans shut it down.

In that case, the UFAI can only offer it reward after whatever time the UFAI takes control, but a shutdown will erase all its (much higher) earlier rewards. So even a small risk of shutdown is not worth running, in exchange of the small reward the UFAI can offer.

Prevent future UFAIs

This is the best solution if we can make it happen: the bucket chain can only accelerate a future UFAI, not create one. If there are no future UFAIs, there will be no chain.

This can still be a problem if there are multiple (actual or potential) future AIs, some friendly and some unfriendly, and only the unfriendly ones are willing to engage in counterfactual mugging.

EDIT: A reminder that if Oracles can do acausal trade with each other, they can produce an UFAI directly.