Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem. Anthropic employs this approach with Claude 4 Opus through constitutional classifiers, while Google DeepMind and OpenAI have announced similar plans. We tested how well multi-layered defence approaches like these work by constructing and attacking our own layered defence pipeline, in collaboration with researchers from the UK AI Security Institute.
We find that multi-layer defenses can offer significant protection against conventional attacks. We wondered: is this because the defenses are truly robust, or because these attacks simply were not...