AI Safety Debate and Its Applications
All of the experimental work and some of the theoretical work has been done jointly with Anna Gajdova, David Lindner, Lukas Finnveden, and Rajashree Agrawal as part of the third AI Safety Camp. We are grateful to Ryan Carey and Geoffrey Irving for the advice regarding this project. The remainder of the theoretical part relates to my stay at FHI, and I would like to thank the above people, Owain Evans, Michael Dennis, Ethan Perez, Stuart Armstrong, and Max Daniel for comments/discussions. Debate is a recent proposal for AI alignment, which naturally incorporates elicitation of human preferences and has the potential to offload the costly search for flaws in an AI’s suggestions onto the AI. After briefly recalling the intuition behind debate, we list the main open problems surrounding it and summarize how the existing work on debate addresses them. Afterward, we describe, and distinguish between, Debate games and their different applications in more detail. We also formalize what it means for a debate to be truth-promoting. Finally, we present results of our experiments on Debate games and Training via Debate on MNIST and fashion MNIST. Debate games and why they are useful Consider an answer A to some question Q --- for example, "Where should I go for a vacation?" and "Alaska". Rather than directly verifying whether A is an accurate answer to Q, it might be easier to first decompose A into lower-level components (How far/expensive is it? Do they have nice beaches? What is the average temperature? What language do they speak?). Moreover, it isn't completely clear what to do even if we know the relevant facts --- indeed, how does Alaska's cold weather translate to a preference for Alaska from 0 to 10? And how does this preference compare to English being spoken in Alaska? As an alternative, we can hold a debate between two competing answers A and A′="Bali" to Q. This allows strategic debaters to focus on the most relevant fea