x

AI ALIGNMENT FORUM

AF

Siao Si

Top postsTop post

Siao Si

Message

AI safety communications at FAR.AI

Previously at AISafety.info

151

Ω

6

7

2

5y

Siao Si

AI safety communications at FAR.AI

Previously at AISafety.info

smallsilo — AI Alignment Forum

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....

Jul 4, 2025•13

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...

Feb 7, 2025•37