x

AI ALIGNMENT FORUM

AF

ChrisCundy — AI Alignment Forum

ChrisCundy

ChrisCundy

Message

70

4

3y

ChrisCundy

70

3y

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

by ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, and Kellin Pelrine

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...

Feb 7, 2025•37