AI ALIGNMENT FORUMTags
AF

Deception

EditHistorySubscribe

Help improve this page (2 flags)

EditHistorySubscribe

Help improve this page (2 flags)

Contributors

You are viewing revision 1.1.0, last edited by Yoav Ravid

Related Pages: Honesty, Meta-Honesty, Self-Deception, Simulacrum Levels

Posts tagged Deception

3

21AI Deception: A Survey of Examples, Risks, and Potential Solutions

Simon Goldstein, Peter S. Park

8mo

1

2

13Interpreting the Learning of Deceit

Roger Dearnaley

4mo

2

0

84Deep Deceptiveness

1y

16

2

32LCDT, A Myopic Decision Theory

Adam Shimi, Evan Hubinger

3y

44

1

119Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Evan Hubinger, Nicholas Schiefer, Carson Denison, Ethan Perez

8mo

13

1

49How likely is deceptive alignment?

2y

19

1

-16Lying is Cowardice, not Strategy

Connor Leahy, Gabriel Alfour

6mo

21

1

6Difficulty classes for alignment properties

2mo

0

1

17The Speed + Simplicity Prior is probably anti-deceptive

[anonymous]2y

11

1

14Precursor checking for deceptive alignment

2y

0

1

80Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

4mo

11

1

42AI x-risk, approximately ordered by embarrassment

1y

1

1

69Monitoring for deceptive alignment

2y

4

0

37Are minimal circuits deceptive?

5y

10

1

29Thoughts On (Solving) Deep Deception

6mo

0