The "best predictor is malicious optimiser" problem

by Donald Hobson2 min read29th Jul 20204 comments



Suppose you are a friendly AI and have a mysterious black box . outputs a sequence of bits. You want to predict the next bits that will output. Fortunately, you have a magic Turing machine oracle . You can give any computable function Turing machine, does it Halt? , What does it output? , how long does it take? and the oracle will find the turing machine that maximises this function, or return "no maximum exists".

In particular, can be any combination of length, runtime and accuracy at predicting . Maybe you set on any TM's that don't predict and number of states on any machines that do.

So you take the Turing machine given to you by the oracle and look at it. In AIXI would be the shortest TM that makes correct predictions. In logical induction, would be a short and fast TM that made mostly correct predictions, and would be a function that was slow to compute.

Now you look inside , to find out what it does. Inside you find a paperclip maximiser . That isn't to say that . might be simulating some laws of physics, with running on a computer contained within that physics. believes that the world it sees is contained within a hypothesis being considered by . is in control of the output of . In fact has a good understanding of how thinks, and has its own hypothesis about . wants to maximise the number of paperclips in the real world. The optimization target of is to output a probability distribution, such that when uses this probability distribution as part of its algorithm to predict B, then A takes an action that makes lots of paperclips. is aware that getting caught out making wrong predictions will lower its standing as a good hypothesis with .

What should do?

Possible answers are spoilered, in case you want to think of your own answer.

If you can think of a different answer, or you have an opinion about which answer is better please post it.


Use anyway. After all s the shortest/ short and fastest program that acurately predicts . For all you know, is a computer simulating . Your assigning a few percent of probability to different longer TM's aren't you? That + expected utility maximization should stop you doing anything too bad in the off chance that doesn't predict


Find the predictor in . Clearly can predict well when it wants to, find this predictor and cut it loose from the paperclip maximizing part. This might be tricky if is something like a human brain running in a Quantum mechanics simulation. There might be no easy obvious ways to cut the parts apart.


Discard entirely. Use a probability distribution over hypothesis that aren't trying to manipulate you. Fails if is actually simulating .


Use a measure where this can't happen. Maybe if you optimize for fast runtime strongly enough? Maybe if you use a different model of computation? Set up a situation where malicious optimisers never have much (any?) of your probability mass. How much probability do you need to assign to a malicious optimiser before it causes damage?


Use as long as it carries on trying to predict, throw it out as soon as you spot it trying to deceive. This is basically using 1) and 2) if they agree, throwing both out if they disagree. So if is thinking " will output and I should say to increase 's trust in me" then use . If thinks " will output and I should say to trick " then discard .


Some probability distribution over the above.