x

AI ALIGNMENT FORUM

AF

Scalable Oversight — AI Alignment Forum

Scalable Oversight

Edited by niplav, Lukas Finnveden last updated 17th Apr 2026

Scalable oversight is the problem of providing reliable supervision of outputs from AIs, even as they become smarter than humans. Often groups of weaker AIs supervise a stronger AI, or AIs are set in a zero-sum debate with each other.

Scalable oversight techniques aim to make it easier for humans to evaluate the outputs of AIs, or to provide a reliable training signal that can not be easily reward-hacked.

Variants include AI Safety via debate, iterated distillation and amplification, and imitative generalization.

Add Posts

1

1

Posts tagged Scalable Oversight

5

43Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt, Fabien Roger

2y

2

1

72An overview of 11 proposals for building safe advanced AI

6y

32

2

43Scalable oversight as a quantitative rather than qualitative problem

2y

8

2

42Prover-Estimator Debate: A New Scalable Oversight Protocol

Jonah Brown-Cohen, Geoffrey Irving

1y

17

1

59Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

2y

7

0

43Learning the prior

paulfchristiano

6y

23

1

11How Hard a Problem is Alignment? (My Opinionated Answer)

3mo

0

1

15Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir, JacksonKaunismaa

2y

0

2

13AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

2y

0

1

78[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud, TurnTrout

10mo

3

1

18From Barriers to Alignment to the First Formal Corrigibility Guarantees

6mo

0

1

22On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner, Rohin Shah

2y

18

0

11Eliciting base models with simple unsupervised techniques

Callum Canavan, Aditya Shrivastava, Allison Qi, Tianyi (Alex) Qiu, Jonathan Michala, Fabien Roger

4mo

0

0

17Human-AI Complementarity: A Goal for Amplified Oversight

rishubjain, Sophie Bridgers

1y

3

1

14Is weak-to-strong generalization an alignment technique?

1y

1

Load More (15/15)

Add Posts