Predicting perverse donors

There is a rich donor who is willing to donate up to £2,000,000 to your cause. They’ve already written a cheque for £1,000,000, but, before they present it to you, they ask you to predict how much they'll be donating.

The donor is slightly perverse. If you predict any amount £P, they’ll erase their cheque and write £(P-1) instead, one pound less than what your predicted.

Then if you want your prediction to be accurate, there’s only one amount you can predict: £P=£0, and you will indeed get nothing.

Suppose the donor was perverse in a more generous way, and they’d instead write £(P+1), one more than your prediction, up to their maximum. In that case, the only accurate guess is £P=£2,000,000, and you get the whole amount.

If we extend the range above £2,000,000, or below £0 (maybe the donor is also a regulator, who can fine you) then the correct predictions get ever more extreme. It also doesn’t matter if the donor subtracts or adds £1, £100, or one pence (£0.01): the only accurate predictions are at the extreme of the range.

Greek mythology is full of oracular predictions that only happened because people took steps to avoid them. So there is a big difference between “prediction P is true”, and “prediction P is true even if P is generally known”.

Continuity assumption

A prediction P is self-confirming if, once P is generally known, then P will happen (or P is the expectation of what will then happen). The previous section has self-confirming predictions, but these don’t always exist. They exist when the outcome is continuous in the prediction P (and a few technical assumptions, like the outcome taking values in a closed interval). If that assumption is violated, then there need not be any self-confirming prediction.

For example, the generous donor could give £(P+1), except if you ask for too much (more than £1,999,999), in which case you get nothing. In that case, there is no correct prediction £P (the same goes for the £(P-1) donor who will give you the maximum if you’re modest enough to ask for less than £1).

Prediction feedback loops

But the lack of self-confirming prediction is not really the big problem. The big problem is that, as you attempt to refine your prediction (maybe you encounter perverse donors regularly), where you end up at will not be determined by the background facts of the world (the donor’s default generosity) but it will entirely be determined by the feedback loop with your prediction. See here for a similar example in game theory.

Sloppier prediction are no better

One obvious answer would be to allow sloppier predictions. For example, if we require that the prediction be "within £1 of the true value", then all values between £0 and £2,000,000 are equally valid; averaging those, we get £1,000,000, the same as would have happened without the prediction.

But that's just a coincidence. We could have constructed the example so that only a certain region has "within £1" performance, while all others have "within £2" performance. More dammingly, we could have defined "they’ve already written a cheque for £X" for absolutely any X, and it wouldn't have changed anything. So there is no link between the self-confirming prediction and what would have happened without prediction. And making the self-confirming aspect weaker won't improve matters.

Real-world dangers

How often would scenarios like that happen in the real world? The donor example is convoluted, and feels very implausible; what kind of person is willing to donate around £1,000,000 if no predictions are made, but suddenly changes to £(P±1) if there is a prediction?

Donations normally spring from better thought-out processes, involving multiple actors, for specific purposes (helping the world, increasing a certain subculture or value, PR...). They are not normally so sensitive to predictions. And though there are cases where there are true self-confirming or self-fulfilling predictions (notably in politics), these tend to be areas which are pretty close to a knife-edge anyway, and could have gone in multiple directions, with the prediction giving them a small nudge in one direction.

So, though in theory there is no connection between a self-confirming prediction and what would have happened if the prediction had not been uttered, it seems that in practice they are not too far apart (for example, no donor can donate more money than they have, and they generally have their donation amount pretty fixed).

Though beware prediction like "what's the value of the most undervalued/overvalued stock on this exchange", where knowing predictions will affect behaviour quite extensively. That is a special case of the next section; the "new approach" the prediction suggests is "buy/sell these stocks".

Predictions causing new approaches

There is one area where it is very plausible for a prediction to cause a huge effect, though, and that's when the prediction suggests the possibilities of new approaches. Suppose I'm running a million-dollar company with a hundred thousand dollars in yearly profit., and ask a smart AI to predict my expected profit next year. The AI answers zero.

At that point, I'd be really tempted to give up, and go home (or invest/start a new company in a different area). The AI has foreseen some major problem, making my work useless. So I'd give up, and the company folds, thus confirming the prediction.

Or maybe the AI would predict ten million dollars of profit. What? Ten times more than the current capitalisation of the company? Something strange is going on. So I sift through the company's projects with great care. Most of them are solid and stolid, but one looks like a massive-risk-massive-reward gamble. I cancel all the other projects, and put everything into that, because that is the only scenario where I see ten million dollar profits being possible. And, with the unexpected new financing, the project takes off.

There are some more exotic scenarios, like an AI that predicts £192,116,518,914.20 profit. Separating that as 19;21;16;5;18;9;14;20 and replacing numbers with letters, this is is SUPERINT: the AI is advising me to build a superintelligence, which, if I do, will grant me exactly the required profit to make that prediction true in expectation (and after that... well, then bad things might happen). Note that the AI need not be malicious; if it's smart enough and has good enough models, it might realise that £192,116,518,914.20 is self-confirming, without "aiming" to construct a superintelligence.

All these examples share the feature that the prediction P causes a great change in behaviour. Our intuitions that outcome-with-P and outcome-without-P should be similar, is based on the idea that P does not change behaviour much.

Exotic corners

Part of the reason that AIs could be so powerful is that they could unlock new corners of strategy space, doing things that are inconceivable to us, to achieve objectives in ways we didn't think was possible.

A predicting AI is more constrained than that, because it can't act directly. But it can act indirectly, with its prediction causing us to unlock new corners of strategy space.

Would a purely predictive AI do that? Well, it depends on two things:

How self-confirming the exotic corners are, compared with more mundane predictions, and
Whether the AI could explore these corners sufficiently well to come up with self-confirming prediction in them.

For 1, it's very hard to tell; after all, in the example of this post and in the game-theory example, arbitrarily tiny misalignment at standard outcomes, can push the self-confirming outcome arbitrarily far into the exotic area. I'd be nervous about trusting our intuitions here, because approximations don't help us. And the Quine-like "P causes the production of a superpowered AI that causes P to be true" seems like a perfect and exact exotic self-confirming prediction that works in almost all areas.

What about 2? Well, that's a practical barrier for many designs. If the AI is a simple sequence predictor without a good world-model, it might not be able to realise that there are exotic self-confirming predictions. A predictor that had been giving standard stock market predictions for all of its existence, is unlikely to suddenly hit on a highly manipulative prediction.

But I fear scenarios where the AI gradually learns how to manipulate us. After all, even for standard scenarios, we will change our behaviour a bit, based on the prediction. The AI will learn to give the most self-confirming of these standard predictions, and so will gradually build up experience in manipulating us effectively (in particular, I'd expect the "zero profit predicted -> stockholders close the company" to become quite standard). The amount of manipulation may grow slowly, until the AI has a really good understanding of how to deal with the human part of the environment, and the exotic manipulations are just a continuation of what it's already been doing.

AI ALIGNMENT FORUM
AF