AI ALIGNMENT FORUM
AF

Wikitags

AI Control

Edited by Ben Pace last updated 17th Aug 2024

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

 and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

  • Alignment: Ensure that your models aren't scheming.[2]
  • Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.[3]
Subscribe
5
Subscribe
5
Discussion0
Discussion0
Posts tagged AI Control
102AI Control: Improving Safety Despite Intentional Subversion
Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan
2y
5
115The case for ensuring that powerful AIs are controlled
ryan_greenblatt, Buck
2y
32
62Catching AIs red-handed
ryan_greenblatt, Buck
2y
8
43How to prevent collusion when using untrusted models to monitor each other
Buck
1y
2
39Schelling game evaluations for AI control
Olli Järviniemi
11mo
0
38AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
DanielFilan
1y
6
17Critiques of the AI control agenda
Jozdien
2y
8
43AI Control May Increase Existential Risk
Jan_Kulveit
6mo
10
56Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt, Buck, Adam Kaufman, Cody Rushing, Tyler Tracy
5mo
0
61Preventing Language Models from hiding their reasoning
Fabien Roger, ryan_greenblatt
2y
5
50Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck
11mo
1
27Why imperfect adversarial robustness doesn't doom AI control
Buck, Claude+
9mo
16
31Notes on control evaluations for safety cases
ryan_greenblatt, Buck, Fabien Roger
2y
0
129What’s the short timeline plan?
Marius Hobbhahn
8mo
14
54Sabotage Evaluations for Frontier Models
David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez, Buck
10mo
5
Load More (15/60)
Add Posts