Debate helps supervise human experts [Paper]

habryka

14 Debate helps supervise human experts [Paper]

17th Nov 2023

1 min read

14

This is a linkpost for https://github.com/julianmichael/debate/blob/2023-nyu-experiments/Debate_Helps_Supervise_Unreliable_Experts.pdf

There didn't seem to be a link post to this recent paper on AI debate yet, so I figured I would make one:

As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts—which have access to the truth but may not accurately report it—to give answers that are systematically true and don’t just superficially seem true, when the supervisor can’t tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth.

We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by ‘expert’ debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy’s 74%. Debates are also more efficient, being 68% of the length of consultancies.

By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.

Debate (AI safety technique)AI

Frontpage

Mentioned in

25Open consultancy: Letting untrusted AIs choose what answer to argue for

New Comment

Moderation Log