This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Anthropic (org)
•
Applied to
On Claude 3.5 Sonnet
by
Tobias D.
1mo
ago
•
Applied to
Anthropic's Certificate of Incorporation
by
Ben Millwood
1mo
ago
•
Applied to
Maybe Anthropic's Long-Term Benefit Trust is powerless
by
Jérémy Perret
2mo
ago
•
Applied to
Quick Thoughts on Scaling Monosemanticity
by
Joel Burget
2mo
ago
•
Applied to
Cicadas, Anthropic, and the bilateral alignment problem
by
kromem
2mo
ago
•
Applied to
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
by
Stephen Casper
2mo
ago
•
Applied to
Anthropic: Reflections on our Responsible Scaling Policy
by
Zac Hatfield-Dodds
2mo
ago
•
Applied to
Anthropic AI made the right call
by
bhauth
3mo
ago
•
Applied to
OMMC Announces RIP
by
Adam Scholl
4mo
ago
•
Applied to
Vaniver's thoughts on Anthropic's RSP
by
Gunnar Zarncke
6mo
ago
•
Applied to
Introducing Alignment Stress-Testing at Anthropic
by
Gunnar Zarncke
6mo
ago
•
Applied to
On Anthropic’s Sleeper Agents Paper
by
Gunnar Zarncke
6mo
ago
•
Applied to
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
by
Soroush Pour
9mo
ago
•
Applied to
Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy
by
Zac Hatfield-Dodds
9mo
ago
•
Applied to
Comparing Anthropic's Dictionary Learning to Ours
by
Robert_AIZI
10mo
ago
•
Applied to
Measuring and Improving the Faithfulness of Model-Generated Reasoning
by
HenningB
10mo
ago
•
Applied to
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust
by
Zac Hatfield-Dodds
10mo
ago
•
Applied to
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
by
Zac Hatfield-Dodds
10mo
ago