Message

Tom Tseng

Message

140

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

by smallsilo, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk, and Adam Gleave

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....

Jul 4, 2025•13

Does robustness improve with scale?

by ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, and AdamGleave

Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to...

Jul 25, 2024•14

Even Superhuman Go AIs Have Surprising Failure Modes

by AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, and MichaelDennis

In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be...

Jul 20, 2023•131