I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):
Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)
Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, -1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).
The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)
More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)
The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)
You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
> You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
In cases where you're worried about the model taking small numbers of catastrophic actions (i.e. concentrated failures a.k.a. high-stake failures), this is basically equivalent to what I usually call untrusted monitoring, which means you have to worry about collusion.
IMO it's good to separate out reasons to want good reward signals like so:
I think this last one is the most important theory of change for research on scalable oversight.
I am curious whether @Rohin Shah disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.
I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I'm not in full agreement on your comments on the theories of change:
You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
I don't think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you'll always know the correct label for an action before you show it to the model.
You might want to use something debate-like in the synthetic input generation process, but that's structurally different.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
On the meta level, I suspect that when considering
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is "we'll figure out good things to do with A that are better than what we've brainstormed so far", which I think you're more skeptical of?
tl;dr: Keep your promises to AIs; it's fine to do stuff like teaching them false facts or misleading them about their situation during testing and deployment; but if you wanna do cheap stuff to cause them to know that they might [have been taught false facts / be being tested / be being misled during deployment], sure, go for it.
Disclaimer: maybe more like explaining my position than justifying my position.
Sometimes we really want to deceive AIs (or at least make them uncertain about their situation). E.g.:
(Similarly sometimes we want to give the AIs reflexive urges to avoid doing extremely bad things [and to do extremely good things, like whistleblow on themselves if they do something suspicious].)
(3-4 don't require deception, just making the AIs uncertain about their situation; it would work similarly well if we added "this might be a test" to all prompts, during training and deployment. 1-2 and 5 require narrow deception, but it would still work if you added "we might have told or trained you to believe false things and this might be a test" to all prompts.)
Often people suggest that we should avoid deceiving AIs, because (A) having a reputation of honesty could enable some kinds of trade with misaligned AIs and/or (B) it's directly unethical.
On (A), I want to distinguish breaking actual promises from just testing the AIs or misleading them for safety. If the AIs think you will break promises of the form "we'll give you 1% of our share of the lightcone if you [do useful work / point out vulnerabilities rather than exploiting them / etc.]," that's bad. If the AIs believe you might have taught them false facts or might be testing them, that seems fine, doesn't interfere with making deals at all. Just clarify that you never mislead them about actual promises.
On (B), in cases like 1-5, when I imagine myself in the AI's position I think I wouldn't care about whether the message was added to the prompt. But if adding "we might have told or trained you to believe false things and this might be a test" to all prompts makes you feel better, or the AI asks for it when you explain the situation, sure, it's low-stakes. (Or not literally adding it to the prompt, especially if we can't ensure it would stay added to the prompt in rogue deployments, but training the AI so it is aware of this.[1]) (And fwiw, I think in the AI's position: 3-4 I basically wouldn't mind; 1-2 and 5 I might be slightly sad about but would totally get and not be mad about; teaching AIs false facts in the mid-late 2020s seems super reasonable from behind the veil of ignorance given my/humanity's epistemic state.)
Recent context: discussion on "Modifying LLM Beliefs with Synthetic Document Finetuning."
My guess is that this training is fine/cheap and preserves almost all of the safety benefits — we're counting on the AI not knowing what false things it believes, not to be unaware that it's been taught false facts. Adding stuff to prompts might be worse because not-seeing-that would signal successful-rogue-deployment.