This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Deceptive Alignment
•
Applied to
Inducing Unprompted Misalignment in LLMs
by
Sam Svenningsen
5d
ago
•
Applied to
Invitation to the Princeton AI Alignment and Safety Seminar
by
Sadhika Malladi
1mo
ago
•
Applied to
Strong-Misalignment: Does Yudkowsky (or Christiano, or TurnTrout, or Wolfram, or…etc.) Have an Elevator Speech I’m Missing?
by
Benjamin Bourlier
1mo
ago
•
Applied to
Two Tales of AI Takeover: My Doubts
by
Violet Hour
2mo
ago
•
Applied to
Anomalous Concept Detection for Detecting Hidden Cognition
by
Paul Colognese
2mo
ago
•
Applied to
Counting arguments provide no evidence for AI doom
by
Nora Belrose
2mo
ago
•
Applied to
Hidden Cognition Detection Methods and Benchmarks
by
Paul Colognese
2mo
ago
•
Applied to
Instrumental deception and manipulation in LLMs - a case study
by
Olli Järviniemi
2mo
ago
•
Applied to
The Shutdown Problem: Incomplete Preferences as a Solution
by
Elliott Thornley
2mo
ago
•
Applied to
The Gemini Incident
by
Lauren (often wrong)
2mo
ago
•
Applied to
Difficulty classes for alignment properties
by
Arun Jose
2mo
ago
•
Applied to
Many arguments for AI x-risk are wrong
by
Alex Turner
2mo
ago
•
Applied to
Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems
by
Florian_Dietz
2mo
ago
•
Applied to
Critiques of the AI control agenda
by
Arun Jose
2mo
ago
•
Applied to
How to train your own "Sleeper Agents"
by
jacobjacob
2mo
ago
•
Applied to
Selfish AI Inevitable
by
Davey Morse
3mo
ago
•
Applied to
Introducing Alignment Stress-Testing at Anthropic
by
Gunnar Zarncke
3mo
ago
•
Applied to
On Anthropic’s Sleeper Agents Paper
by
Gunnar Zarncke
3mo
ago