AI ALIGNMENT FORUMTags
AF

Deceptive Alignment

EditHistory
Discussion (0)
Help improve this page
EditHistory
Discussion (0)
Help improve this page
Deceptive Alignment
Random Tag
Contributors
1Roman Leventov
1Multicore

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI.

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Posts tagged Deceptive Alignment
1
23MetaAI: less is less for alignment.
Cleo Nardo
3d
1
1
82Announcing Apollo Research
Marius Hobbhahn, Beren Millidge, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer
17d
4
2
61The Waluigi Effect (mega-post)
Cleo Nardo
3mo
24
0
80Deep Deceptiveness
Nate Soares
3mo
16
1
20Environments for Measuring Deception, Resource Acquisition, and Ethical Violations
Dan H
2mo
1
1
7Natural language alignment
Jacy Reese Anthis
2mo
0
1
64Monitoring for deceptive alignment
Evan Hubinger
9mo
4
1
35Trying to Make a Treacherous Mesa-Optimizer
MadHatter
7mo
1
1
45Evaluations project @ ARC is hiring a researcher and a webdev/engineer
Beth Barnes
9mo
4
2
48How likely is deceptive alignment?
Evan Hubinger
10mo
13
0
4Simple experiments with deceptive alignment
Andreas_Moe
1mo
0
1
24Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy, Megan Kinniment
6mo
5
0
14Getting up to Speed on the Speed Prior in 2022
robertzk
6mo
0
1
12Why deceptive alignment matters for AGI safety
Marius Hobbhahn
9mo
4
1
22Smoke without fire is scary
Adam Jermyn
8mo
7
Load More (15/27)
Add Posts