AI ALIGNMENT FORUM
AF

68
Wikitags

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Subscribe
Discussion
4
Subscribe
Discussion
4
Posts tagged Deceptive Alignment
1
60Realistic Reward Hacking Induces Different and Deeper Misalignment
Jozdien
1mo
0
2
23Iterated Development and Study of Schemers (IDSS)
ryan_greenblatt
1mo
0
2
19Reducing risk from scheming by studying trained-in scheming behavior
ryan_greenblatt
1mo
0
3
192Alignment Faking in Large Language Models
ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
10mo
24
1
74Why Do Some Language Models Fake Alignment While Others Don't?
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
4mo
2
2
130What’s the short timeline plan?
Marius Hobbhahn
10mo
14
2
28Prospects for studying actual schemers
ryan_greenblatt, Julian Stastny
2mo
0
2
81Will alignment-faking Claude accept a deal to reveal its misalignment?
ryan_greenblatt, Kyle Fish
9mo
5
2
26Decision Theory Guarding is Sufficient for Scheming
james.lucassen
2mo
1
2
89Frontier Models are Capable of In-context Scheming
Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni
1y
9
2
86How will we update about scheming?
ryan_greenblatt
9mo
3
2
25“Behaviorist” RL reward functions lead to scheming
Steven Byrnes
4mo
4
2
60The Waluigi Effect (mega-post)
Cleo Nardo
3y
26
1
56How training-gamers might function (and win)
Vivek Hebbar
7mo
3
1
40When does training a model change its goals?
Vivek Hebbar, ryan_greenblatt
5mo
0
Load More (15/102)
Add Posts