This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Deceptive Alignment
•
Applied to
High-level interpretability: detecting an AI's objectives
by
Paul Colognese
3d
ago
•
Applied to
Understanding strategic deception and deceptive alignment
by
Marius Hobbhahn
6d
ago
•
Applied to
Paper: On measuring situational awareness in LLMs
by
Owain Evans
1mo
ago
•
Applied to
Mesa-Optimization: Explain it like I'm 10 Edition
by
brook
1mo
ago
•
Applied to
Against Almost Every Theory of Impact of Interpretability
by
Charbel-Raphael Segerie
1mo
ago
•
Applied to
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
by
Ethan Perez
2mo
ago
•
Applied to
Apollo Research is hiring evals and interpretability engineers & scientists
by
Marius Hobbhahn
2mo
ago
•
Applied to
3 levels of threat obfuscation
by
RobertM
2mo
ago
•
Applied to
When can we trust model evaluations?
by
Marius Hobbhahn
2mo
ago
•
Applied to
Autonomous Alignment Oversight Framework (AAOF)
by
Justausername
2mo
ago
•
Applied to
Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive
by
Justausername
2mo
ago
•
Applied to
Disincentivizing deception in mesa optimizers with Model Tampering
by
martinkunev
3mo
ago
•
Applied to
Ten Levels of AI Alignment Difficulty
by
Samuel Dylan Martin
3mo
ago
•
Applied to
A "weak" AGI may attempt an unlikely-to-succeed takeover
by
RobertM
3mo
ago
•
Applied to
Deceptive AI vs. shifting instrumental incentives
by
Raymond Arnold
3mo
ago
•
Applied to
A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.
by
AlexFromSafeTransition
3mo
ago
•
Applied to
MetaAI: less is less for alignment.
by
Cleo Nardo
4mo
ago
•
Applied to
The Sharp Right Turn: sudden deceptive alignment as a convergent goal
by
avturchin
4mo
ago
•
Applied to
Proposal: labs should precommit to pausing if an AI argues for itself to be improved
by
NickGabs
4mo
ago
•
Applied to
Open Source LLMs Can Now Actively Lie
by
Josh Levy
4mo
ago