## AI ALIGNMENT FORUMAF

This generalizes nicely. The asteroid problem provides a nice partitioning into two pieces, such that either piece alone has no effect, but the two pieces together have an effect. But most problems won't have such a partition built in.

If we want the answer to a yes/no question, the first instinct would be that no such partitioning is possible: if two AIs each provide less than 1 bit of information, then combining them won't produce a reliable answer. But we can make it work by combining the yes/no question with some other problem, as follows.

Suppose you want the answer to a question Q, which is a yes-or-no question. Then pick a hard problem H, which is an inconsequential yes-or-no question that AIs can solve reliably, but which humans can't, and for which P(H)=0.5. Take two AIs X and Y. The first AI outputs X=xor(Q,H), and believes that the second AI will output a coin flip. The second AI outputs Y=H, and believes that the first AI will output a coin flip. Then the answer can be obtained by combining the two outputs, xor(X,Y).

# 0

Part of the problem with a reduced impact AI is that it will, by definition, only have a reduced impact.

Some of the designs try and get around the problem by allowing a special "output channel" on which impact can be large. But that feels like cheating. Here is a design that accomplishes the same without using that kind of hack.

Imagine there is an asteroid that will hit the Earth, and we have a laser that could destroy it. But we need to aim the laser properly, so need coordinates. There is a reduced impact AI that is motivated to give the coordinates correctly, but also motivated to have reduced impact - and saving the planet from an asteroid with certainty is not reduced impact.

Now imagine that instead there are two AIs, X and Y. By abuse of notation, let ¬X refer to the even that the output signal from X is scrambled away from the the original output.

Then we ask X to give us the x-coordinates for the laser, under the assumption of ¬Y (that AI Y's signal will be scrambled). Similarly, we Y to give us the y-coordinates of the laser, under the assumption ¬X.

Then X will reason "since ¬Y, the laser will certainly miss its target, as the y-coordinates will be wrong. Therefore it is reduced impact to output the correct x-coordinates, so I shall." Similarly, Y will output the right y-coordinates, the laser will fire and destroy the asteroid, having a huge impact, hooray!

The approach is not fully general yet, because we can have "subagent problems". X could create an agent that behave nicely given ¬Y (the assumption it was given), but completely crazily given Y (the reality). But it shows how we could get high impact from slight tweaks to reduced impact.