AI ALIGNMENT FORUM
AF

229
Wikitags

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Subscribe
Discussion
4
Subscribe
Discussion
4
Posts tagged Deceptive Alignment
22Prospects for studying actual schemers
ryan_greenblatt, Julian Stastny
7d
0
26Decision Theory Guarding is Sufficient for Scheming
james.lucassen
17d
1
72Why Do Some Language Models Fake Alignment While Others Don't?
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
3mo
2
192Alignment Faking in Large Language Models
ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
8mo
24
130What’s the short timeline plan?
Marius Hobbhahn
9mo
14
25“Behaviorist” RL reward functions lead to scheming
Steven Byrnes
2mo
4
81Will alignment-faking Claude accept a deal to reveal its misalignment?
ryan_greenblatt, Kyle Fish
8mo
5
34Why "training against scheming" is hard
Marius Hobbhahn
3mo
1
40When does training a model change its goals?
Vivek Hebbar, ryan_greenblatt
3mo
0
56How training-gamers might function (and win)
Vivek Hebbar
6mo
3
89Frontier Models are Capable of In-context Scheming
Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni
10mo
9
86How will we update about scheming?
ryan_greenblatt
8mo
3
31Two proposed projects on abstract analogies for scheming
Julian Stastny
3mo
0
58The Waluigi Effect (mega-post)
Cleo Nardo
3y
26
70“Alignment Faking” frame is somewhat fake
Jan_Kulveit
9mo
6
Load More (15/99)
Add Posts