x

AI ALIGNMENT FORUM

AF

Deceptive Alignment — AI Alignment Forum

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Add Posts

4

4

Posts tagged Deceptive Alignment

0

10Exploration Hacking: Can LLMs Learn to Resist RL Training?

Eyon Jang, Joschka Braun, Damon Falck, David Lindner

1mo

0

3

193Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck

1y

32

2

131What’s the short timeline plan?

Marius Hobbhahn

1y

14

1

67Realistic Reward Hacking Induces Different and Deeper Misalignment

8mo

0

1

74Why Do Some Language Models Fake Alignment While Others Don't?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

11mo

2

2

56The Waluigi Effect (mega-post)

3y

26

2

81Will alignment-faking Claude accept a deal to reveal its misalignment?

ryan_greenblatt, Kyle Fish

1y

5

2

89Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni

1y

9

1

22How hard is it to inoculate against misalignment generalization?

5mo

0

2

86How will we update about scheming?

ryan_greenblatt

1y

3

1

61Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Kvee, Diogo de Lucena, Cameron Berg, Trent Hodgeson

2y

9

2

114The case for ensuring that powerful AIs are controlled

ryan_greenblatt, Buck

2y

32

2

73“Alignment Faking” frame is somewhat fake

1y

7

1

123Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

3y

14

1

9A Conceptual Framework for Exploration Hacking

Joschka Braun, Eyon Jang, Damon Falck

4mo

0

Load More (15/106)

Add Posts