x

AI ALIGNMENT FORUM

AF

Deceptive Alignment — AI Alignment Forum

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Add Posts

4

4

Posts tagged Deceptive Alignment

6

39Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant

7y

7

6

102AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan

2y

5

5

50How likely is deceptive alignment?

4y

20

6

43Does SGD Produce Deceptive Alignment?

5y

3

6

47New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"

2y

23

3

193Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck

1y

32

5

62Catching AIs red-handed

ryan_greenblatt, Buck

2y

9

2

48Many arguments for AI x-risk are wrong

2y

55

2

35A Problem to Solve Before Building a Deception Detector

Eleni Angelou, lewis smith

1y

1

2

20Order Matters for Deceptive Alignment

3y

0

3

13Why Aligning an LLM is Hard, and How to Make it Easier

1y

0

2

8Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

2y

0

2

56The Waluigi Effect (mega-post)

3y

26

2

73“Alignment Faking” frame is somewhat fake

1y

7

3

55Deceptive AI ≠ Deceptively-aligned AI

2y

14

Load More (15/106)

Add Posts