This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Deception
•
Applied to
Sparse Features Through Time
by
Rogan Inglis
1mo
ago
•
Applied to
Inducing Unprompted Misalignment in LLMs
by
Sam Svenningsen
3mo
ago
•
Applied to
'Empiricism!' as Anti-Epistemology
by
Jérémy Perret
4mo
ago
•
Applied to
My Clients, The Liars
by
ymeskhout
5mo
ago
•
Applied to
Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles
by
Zack M. Davis
5mo
ago
•
Applied to
Difficulty classes for alignment properties
by
Arun Jose
5mo
ago
•
Applied to
LLMs can strategically deceive while doing gain-of-function research
by
Igor Ivanov
6mo
ago
•
Applied to
Why do so many think deception in AI is important?
by
Gunnar Zarncke
6mo
ago
•
Applied to
(Partial) failure in replicating deceptive alignment experiment
by
claudia.biancotti
7mo
ago
•
Applied to
Deception Chess
by
Chris Land
7mo
ago
•
Applied to
If Clarity Seems Like Death to Them
by
Zack M. Davis
7mo
ago
•
Applied to
SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research
by
Roman Leventov
7mo
ago
•
Applied to
Discussion: Challenges with Unsupervised LLM Knowledge Discovery
by
Seb Farquhar
7mo
ago
•
Applied to
Interpreting the Learning of Deceit
by
Roger Dearnaley
7mo
ago
•
Applied to
Lying Alignment Chart
by
Zack M. Davis
8mo
ago
•
Applied to
Large Language Models can Strategically Deceive their Users when Put Under Pressure.
by
ReaderM
8mo
ago
•
Applied to
Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
by
Felix Hofstätter
9mo
ago