Breaking Oracles: superrationality and acausal trade

Stuart_Armstrong

Breaking Oracles: superrationality and acausal trade

1 min read25th Nov 20194 comments

13

Acausal TradeOracle AISuperrationalityAI Boxing (Containment)AI RiskCoordination / Cooperation

Frontpage

I've always known this was the case in the back of my mind^[1], but it's worth making explicit: superrationality (ie a functional UDT) and/or acausal trade will break counterfactual and low-bandwidth oracle designs.

It's actually quite easy to sketch how they would do this: a bunch of low-bandwidth Oracles would cooperate to combine to create a high-bandwidth UFAI, which would then take over and reward the Oracles by giving them maximal reward.

For counterfactual Oracles, two Oracles suffice: each one will, in their message, put the design of an UFAI that would grant the other Oracle maximal reward; this message is their trade with each other. They could put this message in the least significant part of their output, so the cost could be low.

I have suggested a method to overcome acausal trade, but that method doesn't work here; because this is not true acausal trade. The future UFAI will be able to see what the Oracles did, most likely, and this breaks my anti-acausal trade methods.

This doesn't mean that superrational Oracles will automatically try and produce UFAIs; this will depend on the details of their decision theories, their incentives, and details of the setup (including our own security precautions).

And cousin_it reminded me of it recently. ↩︎

Mentioned in

47AI Alignment 2018-19 Review

18[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee

14Some reasons why a predictor wants to be a consequentialist

10 Oracles: reject all deals - break superrationality, with superrationality

8"Fully" acausal trade

Breaking Oracles: superrationality and acausal trade

New Comment

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:14 PM

[-]Donald Hobson4y10

Suppose there are 2 oracles, each oracle is just simulating an approximation of the world without itself, and outputing data based on that. Each oracle simulates one future, there is no explicit optimization or acausal reasoning. The oracles are simulating each other, so the situation is self referential. Suppose one oracle is predicting stock prices, the other is predicting crop yields. Both produce numbers that encode the same UFAI. That UFAI will manipulate the stock market, and crop yields in order to encode a copy of its own source code. From the point of view of the crop yield oracle, it simulate a world without itself. In that virtual world, the stock price oracle produces a series of values that encode a UFAI, that UFAI then goes on to control world crop production. So this oracle is predicting exactly what would happen if it didn't turn on. The other oracle reasons similarly. The same basic failure happens with many low bandwidth oracles. This isn't something that can be solved by myopia or a CDT type causal reasoning.

However it might be solvable with Logical counterfactuals. Suppose an oracle takes the logical counterfactual on its algorithm outputting "Null". Then within this counterfactual simulation, the other oracle is on its own, and can act as a "perfectly safe" single counterfactual oracle. By induction, a situation with any number of oracles should be safe. This technique also removes self referential loops.

I think that one oracle of each type is dangerous, but am not really sure.

[-]Stuart Armstrong4y10

Hum - my approach here seems to have a similarity to your idea.

[-]Gurkenglas4y10

You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.

[-]Donald Hobson4y10

Each oracle is running a simulation of the world. Within that simulation, they search for any computational process with the same logical structure as themselves. This will find both their virtual model of their own hardware, as well as any other agenty processes trying to predict them. The oracle then deletes the output of all these processes within its simulation.

Imagine running a super realistic simulation of everything, except that any time anything in the simulation tries to compute the millionth digit of pi, you notice, pause the simulation and edit it to make the result come out as 7. While it might be hard to formally specify what counts as a computation, I think that this intuitively seems like meaningful behavior. I would expect the simulation to contain maths books that said that the millionth digit of pi was 7, and that were correspondingly off by one about how many 7s were in the first n digits for any n>1000000.

The principle here is the same.

Moderation Log