AI ALIGNMENT FORUM
AF

363
UK AISI Alignment Team: Debate Sequence

UK AISI Alignment Team: Debate Sequence

May 07, 2025 by Benjamin Hilton

The UK AI Security Institute's Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course of action which could lead to egregious harm, and which are not under human control.

This sequence will examine our initial focus: using scalable oversight to train honest AI systems, using a combination of theory about training equilibria and empirical evidence about the results of training.

47UK AISI’s Alignment Team: Research Agenda
Benjamin Hilton, Jacob Pfau, Marie_DB, Geoffrey Irving
5mo
2
35An alignment safety case sketch based on debate
Marie_DB, Jacob Pfau, Benjamin Hilton, Geoffrey Irving
5mo
15
20Dodging systematic human errors in scalable oversight
Geoffrey Irving
5mo
2
16Unexploitable search: blocking malicious use of free parameters
Jacob Pfau, Geoffrey Irving
5mo
14
42Prover-Estimator Debate: A New Scalable Oversight Protocol
Jonah Brown-Cohen, Geoffrey Irving
4mo
17
14The need to relativise in debate
Geoffrey Irving, Simon Marshall
4mo
2