AI ALIGNMENT FORUM
AF

Wikitags

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Subscribe
3
Subscribe
3
Discussion0
Discussion0
Posts tagged Deceptive Alignment
39Deceptive Alignment
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
6
102AI Control: Improving Safety Despite Intentional Subversion
Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan
2y
5
49How likely is deceptive alignment?
evhub
3y
19
43Does SGD Produce Deceptive Alignment?
Mark Xu
5y
3
47New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Joe Carlsmith
2y
23
192Alignment Faking in Large Language Models
ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
8mo
24
62Catching AIs red-handed
ryan_greenblatt, Buck
2y
8
45Many arguments for AI x-risk are wrong
TurnTrout
1y
47
35A Problem to Solve Before Building a Deception Detector
Eleni Angelou, lewis smith
7mo
1
20Order Matters for Deceptive Alignment
DavidW
3y
0
11Why Aligning an LLM is Hard, and How to Make it Easier
RogerDearnaley
7mo
0
8Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
RogerDearnaley
2y
0
58The Waluigi Effect (mega-post)
Cleo Nardo
3y
26
49Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes
2y
14
13Interpreting the Learning of Deceit
RogerDearnaley
2y
2
Load More (15/97)
Add Posts