AI ALIGNMENT FORUM
AF

2114

AI Control

Mar 24, 2024 by Fabien Roger

This is a collection of posts about AI Control, an approach to AI safety that focuses on safety measures aimed at preventing powerful AIs from causing unacceptably bad outcomes even if powerful AIs are misaligned and intentionally try to subvert those safety measures.

These posts are useful to understand the AI Control approach, its upsides, and downsides. They only cover a small fraction of AI safety work relevant to AI control.

115The case for ensuring that powerful AIs are controlled
ryan_greenblatt, Buck
2y
32
102AI Control: Improving Safety Despite Intentional Subversion
Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan
2y
5
49Untrusted smart models and trusted dumb models
Buck
2y
6
62Catching AIs red-handed
ryan_greenblatt, Buck
2y
8
118Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
Buck
1y
25
47AI catastrophes and rogue deployments
Buck
1y
9
53Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy
Buck, ryan_greenblatt
2y
19
24Auditing failures vs concentrated failures
ryan_greenblatt, Fabien Roger
2y
1
24Protocol evaluations: good analogies vs control
Fabien Roger
2y
6
43How useful is "AI Control" as a framing on AI X-Risk?
habryka, ryan_greenblatt
2y
0
58Fields that I reference when thinking about AI takeover prevention
Buck
1y
6
54New report: Safety Cases for AI
joshc
2y
5
31Notes on control evaluations for safety cases
ryan_greenblatt, Buck, Fabien Roger
2y
0
29Toy models of AI control for concentrated catastrophe prevention
Fabien Roger, Buck
2y
2
21Games for AI Control
charlie_griffin, Buck
1y
0