AI ALIGNMENT FORUM
AF

143

AI Control

Mar 24, 2024 by Fabien Roger

This is a collection of posts about AI Control, an approach to AI safety that focuses on safety measures aimed at preventing powerful AIs from causing unacceptably bad outcomes even if powerful AIs are misaligned and intentionally try to subvert those safety measures.

These posts are useful to understand the AI Control approach, its upsides, and downsides. They only cover a small fraction of AI safety work relevant to AI control.

110The case for ensuring that powerful AIs are controlled

ryan_greenblatt, Buck

2y

32

102AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan

2y

5

49Untrusted smart models and trusted dumb models

2y

6

62Catching AIs red-handed

ryan_greenblatt, Buck

2y

8

118Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

1y

25

47AI catastrophes and rogue deployments

1y

9

53Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck, ryan_greenblatt

2y

19

24Auditing failures vs concentrated failures

ryan_greenblatt, Fabien Roger

2y

1

24Protocol evaluations: good analogies vs control

2y

6

43How useful is "AI Control" as a framing on AI X-Risk?

habryka, ryan_greenblatt

2y

0

58Fields that I reference when thinking about AI takeover prevention

1y

6

54New report: Safety Cases for AI

2y

5

31Notes on control evaluations for safety cases

ryan_greenblatt, Buck, Fabien Roger

2y

0

29Toy models of AI control for concentrated catastrophe prevention

Fabien Roger, Buck

2y

2

21Games for AI Control

charlie_griffin, Buck

1y

0