Thanks for the feedback Ryan!
I like this paper, but I think the abstract is somewhat overstated.
This is good to know. We were trying to present an accurate summary in the abstract while keeping it concise, which is a tricky balance. Seems like we didn’t do a good enough job here, so we’ll update the abstract to caveat the results a bit more.
Hidden passage debate on QuALITY is actually pretty narrow as far as domains go and might have pretty different properties from future cases.
Yep, agreed! QuALITY is a great testbed for debate, but we definitely need to see debate results in other domains. The NYU ARG stream in MATS is looking at some other LLM debate domains right now and I’m very keen to see their results.
My understanding is that there are a bunch of negative results on other domains and perhaps on other variants of the QuALITY task.
Yeah we tried a bunch of other tasks early on, which we discuss in Appendix C. Originally we were using debate with symmetric information to try to improve judge performance on various datasets above their 0-shot performance. This didn’t work for a few reasons:
I'd be interested in debate results where we have human debators and GPT-4 as a judge. (Unless this is already in this paper? I don't see it, but I haven't read the results in detail yet.) I think this seems somewhat analogous to the case where we have AI debators and human judges (judge and debators have different capability profile, debators might understand a bunch of judge weaknesses, etc).
So we did check something similar - we ran our GPT-4 judge on the human debate transcripts from Michael et al. We found that debate accuracy was higher than consultancy, and also that the inter-annotator agreement between human and GPT-4 judges was much higher in debate than in consultancy. These results didn't make it into the paper, but maybe are worth adding to an appendix. Of course this is not the same as human debaters who know their judge will be an LLM - in that case I’d imagine debaters trying out a lot of weird adversarial strategies. I think I wouldn’t be surprised if such strategies worked to the point where our persuasiveness -> judge accuracy relationship broke down, but I don’t think it would be a big update against debate for me - current LLMs are just very vulnerable to weird attacks compared to humans.