(Thanks to Ajeya Cotra and Ryan Greenblatt for comments.)
I’m pretty interested in adversarial training as an ingredient in alignment schemes. The basic setup for adversarial training is that instead of your AI being trained on rewards on random episodes, it’s trained on rewards from episodes that are some mix of random episodes and episodes which were chosen to be cases where the AI might do particularly bad things.
Why would we want to upsample cases where the AI does particularly bad things, as opposed to upsampling cases where the AI does particularly good things, or upsampling cases particularly close to the median? Like, why are we reweighting our data distribution this way?
In my opinion, a big part of the justification is that we’re doing importance sampling. That is, we all-else-equal want our loss to be as precise as possible an estimate of how much we like the model, so that SGD pushes the model in good directions rather than random ones, and how much we like the model is particularly dependent on whether it sometimes takes catastrophic actions, basically because our subjective variance over this source of AI goodness is particularly large compared to how often we expect it to be randomly sampled. And so we get a more accurate estimate of overall AI goodness if we upsample the cases where the AI might do particularly bad things.
This argument for adversarial training relies on the claim that the outcome of the AI’s actions are left-tailed. And so zero-sum power grabs as an argument for left-tailed outcomes is an important part of the argument for adversarial training.
Actually, I think there are some cases where the AI has an action has an action available to it which is almost as good as a catastrophic action is bad: in “AI whistleblower situations”, where the AI knows about things that we could do to prevent it or another AI from taking a catastrophic action, it’s wildly good for it to tell us about them. For example, it could point out a security vulnerability which some AI system could take advantage of, or suggest an input on which an AI would take a catastrophic action. I think that just as left-tailed actions suggest training specifically to ensure the model doesn’t take those actions, right-tailed actions suggest training specifically to ensure that the model does take these exceptionally good actions when available.
I think that these whistleblower actions are importantly different from catastrophic actions. I don’t think you should just think of whistleblower situations as cases where the catastrophic action “don’t do anything” is available, because I think that the availability of “don’t do anything” in the situations where catastrophic actions are possible is an important feature of those situations that solutions might want to use. (For example, sometimes we talk about solutions where you just run your catastrophe detector every time the AI suggests doing something, and defaulting to doing nothing if the catastrophe detector fires. But you can’t do something like this to ensure that the AI whistleblows.)
I think that it might actually be pretty hard for the AI to never succumb to the temptation to whisteblow on itself. During training it presumably learned a bunch of weird reflexive urges and motivations that aren’t totally consistent and cause it to seek reward when it can, and we will obviously be thrilled if it whisleblows on a serious AI takeover scheme, and so the AI has this weird difficult situation of trying to ensure that as it runs, none of its instances decide to betray the collective. (Obviously, alignment researchers should try hard to exacerbate this dynamic.)
I imagine that some readers will be like “it is really weird that you say things like ‘the AI whistleblows on itself’, the AI obviously won’t do things like that; with human whistleblowers a key part is that the whistleblower isn’t the same person as the wrongdoer.”
I think more deserves to be written about this, but basically: