Breaking Oracles: superrationality and acausal trade

by Stuart Armstrong1 min read25th Nov 20193 comments


Acausal TradeOracle AIAI Boxing (Containment)AI RiskCoordination / Cooperation

I've always known this was the case in the back of my mind[1], but it's worth making explicit: superrationality (ie a functional UDT) and/or acausal trade will break counterfactual and low-bandwidth oracle designs.

It's actually quite easy to sketch how they would do this: a bunch of low-bandwidth Oracles would cooperate to combine to create a high-bandwidth UFAI, which would then take over and reward the Oracles by giving them maximal reward.

For counterfactual Oracles, two Oracles suffice: each one will, in their message, put the design of an UFAI that would grant the other Oracle maximal reward; this message is their trade with each other. They could put this message in the least significant part of their output, so the cost could be low.

I have suggested a method to overcome acausal trade, but that method doesn't work here; because this is not true acausal trade. The future UFAI will be able to see what the Oracles did, most likely, and this breaks my anti-acausal trade methods.

This doesn't mean that superrational Oracles will automatically try and produce UFAIs; this will depend on the details of their decision theories, their incentives, and details of the setup (including our own security precautions).

  1. And cousin_it reminded me of it recently. ↩︎

4 comments, sorted by Highlighting new comments since Today at 8:21 PM
New Comment

Suppose there are 2 oracles, each oracle is just simulating an approximation of the world without itself, and outputing data based on that. Each oracle simulates one future, there is no explicit optimization or acausal reasoning. The oracles are simulating each other, so the situation is self referential. Suppose one oracle is predicting stock prices, the other is predicting crop yields. Both produce numbers that encode the same UFAI. That UFAI will manipulate the stock market, and crop yields in order to encode a copy of its own source code. From the point of view of the crop yield oracle, it simulate a world without itself. In that virtual world, the stock price oracle produces a series of values that encode a UFAI, that UFAI then goes on to control world crop production. So this oracle is predicting exactly what would happen if it didn't turn on. The other oracle reasons similarly. The same basic failure happens with many low bandwidth oracles. This isn't something that can be solved by myopia or a CDT type causal reasoning.

However it might be solvable with Logical counterfactuals. Suppose an oracle takes the logical counterfactual on its algorithm outputting "Null". Then within this counterfactual simulation, the other oracle is on its own, and can act as a "perfectly safe" single counterfactual oracle. By induction, a situation with any number of oracles should be safe. This technique also removes self referential loops.

I think that one oracle of each type is dangerous, but am not really sure.

Hum - my approach here seems to have a similarity to your idea.

You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.

Each oracle is running a simulation of the world. Within that simulation, they search for any computational process with the same logical structure as themselves. This will find both their virtual model of their own hardware, as well as any other agenty processes trying to predict them. The oracle then deletes the output of all these processes within its simulation.

Imagine running a super realistic simulation of everything, except that any time anything in the simulation tries to compute the millionth digit of pi, you notice, pause the simulation and edit it to make the result come out as 7. While it might be hard to formally specify what counts as a computation, I think that this intuitively seems like meaningful behavior. I would expect the simulation to contain maths books that said that the millionth digit of pi was 7, and that were correspondingly off by one about how many 7s were in the first n digits for any n>1000000.

The principle here is the same.