AI ALIGNMENT FORUM
AF

482
Wikitags

Scalable Oversight

This page is a stub.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged Scalable Oversight
58Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks
2y
7
43Learning the prior
paulfchristiano
5y
23
43Scalable oversight as a quantitative rather than qualitative problem
Buck
1y
8
15Inference-Only Debate Experiments Using Math Problems
Arjun Panickssery, Abhimanyu Pallavi Sudhir, JacksonKaunismaa
1y
0
13AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
DanielFilan
1y
0
77[Research Note] Optimizing The Final Output Can Obfuscate CoT
lukemarks, jacob_drori, cloud, TurnTrout
3mo
3
64Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout
10mo
4
42Prover-Estimator Debate: A New Scalable Oversight Protocol
Jonah Brown-Cohen, Geoffrey Irving
4mo
17
22On scalable oversight with weak LLMs judging strong LLMs
zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner, Rohin Shah
1y
18
17Human-AI Complementarity: A Goal for Amplified Oversight
rishubjain, Sophie Bridgers
10mo
3
14Is weak-to-strong generalization an alignment technique?
Q
cloud
9mo
Q
1
Add Posts