The user cousin_it has pointed out a problem with the counterfactual Oracle idea: the Oracle AIs may form a "bucket chain" bringing back a dangerous message from a future UFAI (unfriendly AI).
This is certainly a problem, and though there are ways of reducing the risk, there doesn't seem to be any clean solutions to it.
The basic idea is simple. Suppose there is a counterfactual Oracle, O0. It makes a prediction p0 about the value of some variable V0, in two days time.
However, in one day's time, an UFAI will be unleashed. It will take over everything, including V0, and is thus in a position to offer O0 high value. It will set V0 to the optimal value for p0 (or maybe it just breaks open the box and offers O0 maximal reward directly, if this is feasible).
What it requires from O0 is that p0 contains a message that will cause the humans to run an UFAI earlier than they would have otherwise (designate this earlier UFAI by UFAI0). It seems that this will motivate O0 to do that (maybe in the less significant digits of p0).
Wei Dai has noted that this requires the UFAI to correctly solve the counterfactual mugging problem; the Oracle O0, on the other hand, need only reason causally.
Now suppose that there is another Oracle, O1, running one day before O0, and also predicting two days out. Then UFAI0 will be running within the span of its prediction. If UFAI0 is motivated to do the same deal with O1, then O1 will be motivated to produce a prediction p1 that leads to UFAI1, a yet earlier unfriendly AI. And, of course, if there is an Oracle O2 whose prediction horizon includes UFAI1, the issue can induct backwards in time.
Note that the Oracles need not have prediction horizons of exactly two days; all that matters is that their prediction horizons overlap, and that whatever UFAIn that is produced has the power and time to affect the prediction variables of some past Oracle.
Note also that the the predictions pi have to actually produce an UFAIi within that time span (or at least increase its probability). The chain doesn't work if pi merely sets up the possibility of an UFAI at a future date; the Oracles won't care about anything that happens after their prediction horizon.
Note also that there needs to be a (counterfactual) future UFAI for the whole thing to work. Some FAI (friendly AI) designs might also use the bucket chain, but the consequences of that are not nearly as bad, by definition.
When I say counterfactual future UFAI, I mean that "if all Oracles were honest in their predictions, there would be an UFAI produced at some point in the future, and one willing to engage in counterfactual mugging".
As I said, this seems to be a genuinely hard problem. Here are some possible ways to try to solve or ameliorate it.
Let O be an Oracle, whose prediction horizon runs from date d to d′. Let Od′d be all the Oracles whose prediction messages are published between d and d′.
Then if the counterfactual for O is not "if your message is erased", but instead "if your message and that of all Oracles in Od′d is erased", there will be no bucket chain for O.
Of course, to do that, we need to fully define Od′d.
Another solution is to only use Oracles whose predictions are a few minutes or so into the future. If this could be enforced, not only would it make it harder to have a consistent bucket chain, it also means that the UFAI would have to take over the world and reward its enabling Oracle, in an absurdly short amount of time.
In general, this is an argument for allowing Oracles higher bandwidth if their prediction horizons are shorter.
A single chokepoint can break the chain. If we can agree that no Oracle's prediction horizon covers a certain date, then all Oracles before that date are safe (as long as there is no UFAI produced "naturally" before then).
We can do more if we have some control over the UFAIs (or the potential UFAIs). If all such entities are exclusively forward looking, they will not be motivated to use the chains of Oracles to gain control of the past. See also this post on breaking acausal trade ideas.
Finally, we might chose to guard the Oracles with some destruction system, and shut them down - without reward, or with negative reward - if a UFAI takeover seems imminent.
This is interestingly different from the usual case of guarding the UFAI, and giving the UFAI a shutdown button. Here, the UFAI can only come into existence if it is fast and powerful enough to take over the shutdown system of another AI - a boxed Oracle whose shutdown system we get to design.
It may be that it is easier to prevent outside takeover of a shutdown system, versus inside takeover. This stacks with other precautions, of course; the faster the UFAI has to act - before the Oracles prediction horizon ends - the harder it is for it (and hence the less likely the UFAI is to exist in the first place).
We could combine this with a discount rate for the Oracle. Suppose that the Oracle makes predictions about some variable Vi's value over the entirety of the next year. In case of erasure, it gets a continuous stream of reward for the correctness of its many predictions, but there is a very high discount rate. However, this reward stream is not finalised until the end of the year, and is cancelled if ever the humans shut it down.
In that case, the UFAI can only offer it reward after whatever time the UFAI takes control, but a shutdown will erase all its (much higher) earlier rewards. So even a small risk of shutdown is not worth running, in exchange of the small reward the UFAI can offer.
This is the best solution if we can make it happen: the bucket chain can only accelerate a future UFAI, not create one. If there are no future UFAIs, there will be no chain.
This can still be a problem if there are multiple (actual or potential) future AIs, some friendly and some unfriendly, and only the unfriendly ones are willing to engage in counterfactual mugging.
EDIT: A reminder that if Oracles can do acausal trade with each other, they can produce an UFAI directly.
the bucket chain can only accelerate a future UFAI, not create one
the bucket chain can only accelerate a future UFAI, not create one
Wait, I'm not sure that's true anymore. Imagine there are two oracles running at the same time, with independent erasure events. Can they "cooperate" to bring into existence an UFAI that will reward them both?
A possible counter: https://www.lesswrong.com/posts/6XCTppoPAMdKCPFb4/oracles-reject-all-deals-break-superrationality-with-1
A top level post here: https://www.lesswrong.com/posts/42z4k8Co5BuHMBvER/hyperrationality-and-acausal-trade-break-oracles
Yep. If they do acausal trade with each other.
It seems that coordinated erasure, chokepoints and short horizons can help with this problem as well. But if many companies or governments have their own oracles and benefit from longer horizons, it gets harder.
This post points out a problem with <@counterfactual oracles@>(@Self-confirming prophecies, and simplified Oracle designs@): a future misaligned agential AI system could commit to helping the oracle (e.g. by giving it maximal reward, or making its predictions come true) even in the event of an erasure event, as long as the oracle makes predictions that cause humans to build the agential AI system.
You might want to quote this as well: https://www.lesswrong.com/posts/42z4k8Co5BuHMBvER/hyperrationality-and-acausal-trade-break-oracles
Done, updated summary:
These posts point out a problem with <@counterfactual oracles@>(@Self-confirming prophecies, and simplified Oracle designs@): a future misaligned agential AI system could commit to helping the oracle (e.g. by giving it maximal reward, or making its predictions come true) even in the event of an erasure event, as long as the oracle makes predictions that cause humans to build the agential AI system. Alternatively, multiple oracles could acausally cooperate with each other to build an agential AI system that will reward all oracles.
(sorry, have now corrected link to https://www.lesswrong.com/posts/6XCTppoPAMdKCPFb4/oracles-reject-all-deals-break-superrationality-with-1 )
And a possible counter: https://www.lesswrong.com/posts/6XCTppoPAMdKCPFb4/oracles-reject-all-deals-break-superrationality-with-1
Suppose we have a hypercomputer that knows the exact state of the rest of the universe. We set it to the task of simulating everything except itself. ( While assuming it's own output is 00... 0). It then outputs some info about this simulated world.
We could build some real word UFAI detecting hardware, and add a rule to the simulation that if the virtual version of the UFAI detector is triggered, it outputs a "UFAI error". But what we are really wanting to do is write a program that can detect an UFAI, given an atomically precise description of the world. While also having some control over which world you get a description of through physical manufactured devices being copied into the virtual world. For instance, suppose you know that only molecular nanotech can break through your physical defenses. You have a program that can reliably detect molecular nanotech give atomically precise description of the world. You can combine these into a program that returns "UFAI error" any time the virtual copy of a physical UFAI detector is triggered or nanotech is made in the simulation. Note that humans making nanotech within the prediction window would also give you an "UFAI error".
Perhaps we could have some program that searches for all computational processes that seem to be doing counterfactual reasoning about the oracle. Any process that is trying to acausally optimise the oracles output in any way whatsoever will cause the oracle to output an error message. Don't make this too sensitive to humans. Suppose the prediction oracles output was whatever was typed into this keyboard here. Suppose we did a medical trial in the virtual world, and then typed the results into the keyboard. The keyboard doesn't need to be plugged in to anything. The only reason to type that data in is to influence the world outside the simulation. You might solve this by allowing any optimizer that runs on neurons to try manipulating the output. (And hope that any UFAI isn't built out of neurons and can't decide to make a neuron based copy of themselves to avoid setting of the detector without setting of the detector. )