"How conservative" should the partial maximisers be?

by Stuart Armstrong1 min read13th Apr 20206 comments



Due to the problem of building a strong -enhancer when we want a -enhancer - and the great difficulty in defining , the utility we truly want to maximise - many people have suggested reducing the -increasing focus of the AI. The idea is that, as long as the AI doesn't devote too much optimisation power to , the and will stay connected with each other, and hence a moderate increase in will in fact lead to a moderate increase in .

This has lead to interest in such things as satisficers and low-impact AIs, both of which have their problems. Those try and put an absolute limit on how much is optimised. The AI is not supposed to optimise above a certain limit (satisficer) or if optimising it changes too much about the world or the power of other agents (low-impact).

Another approach is to put a relative limit on how much an AI can push a utility function. For example, quantilizers will choose randomly among the top proportion of actions/policies, rather than picking the top action/policy. Then there is the approach of using pessimism to make the AI more conservative. This pessimism is defined by a parametre , with being very pessimistic.

Intermediate value uncertainty

The behaviours of and are pretty clear around the extremes. As and tend to , the agent will behave like a -maximiser. As they tend to , the agent will behave randomly () or totally conservatively ().

Thus, we expect that moving away from the extremes will improve the true -performance, and that the conservative, end, will be less disastrous than the -maximising, end (though we only know that second fact, because of implicit assumptions we have on and ).

The problem is in the middle, where the behaviour is unknown (and, since we lack a full formulation of , generically unknowable). There is no principled way of setting the or the . Consider, for example, this plot of versus :

Here, the ideal is around , but the critical thing is to keep above : that's the point at which it falls precipitously.

Contrast now with this one:

Here, any value of above is essentially the same, and can be lowered as low as before there are any problems.

So, in the first case, we need above , and, in the second, below . And, moreover, it might be that the first situation appears in one world and the second in another, and both worlds are currently possible. So there's no consistent good value of we can set (and in the general case, the curve might be multi-modal, with many peaks). And note that we don't know any of these graphs (since we can't define fully). So we don't know what values to set at, have little practical guidance on what to do, but expect that some values will be disastrous.

The conservatism approach has similar problems: is even harder to interpret than , we don't have any guidance on how to set it, and the ideal may vary considerably depending on the circumstance. For example, what would we want our AI to do when it finds an unexpected red button connected to nuclear weapons?

Well, that depends on whether the button starts a nuclear launch - or if it cancels one.

A future post will explore how to resolve this issue, and how to choose the conservatism parameter in a suitable way.