It seems to me that the main goal of quantilization is to reduce the extreme unintended outcomes of maximizing (by sampling from something like a human-learned distribution over actions) while still remaining competitive (by sampling from only the upper quantile of said distribution).

But that still leaves open the possibility of sampling one of those super highly maximized actions! Why not just hard-cut off the upper portion of the distribution as well? Why not have a quantilizer that samples between the 90%ile and the 99%ile?

Or, while we're at it, why are we sampling at all? Why not just say that we're taking the 99%ile action?

New Answer
New Comment

1 Answers sorted by

Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.

This is fine if you want your AI to stack blocks on top of other blocks or something.

But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you're no longer just shooting for a 99%ile policy. You want the AI to do something unlikely, and so of course, you can't simultaneously restrict it to only do likely things.

Now, you can construct a policy where each individual action is exactly 99%ile on some distribution of human actions. The results quickly go off-distribution and end up somewhere new - chaining together 100 99%ile actions is not the same as picking a 99%ile policy. This means you don't get the straightforward quantilizer-style guarantee, but maybe it's not all bad - after all, we wanted to go off distribution to cure cancer, and maybe by taking actions with some kind of moderation we can get some other kind of benefits we don't quite understand yet. I'm reminded of Peli Grietzer's recent post on virtue ethics.