x

AI ALIGNMENT FORUM

AF

Hadrien — AI Alignment Forum

Hadrien Mariaccia

Hadrien Mariaccia

Message

Hadrien Mariaccia

30

Ω

6

1

1

2y

Hadrien Mariaccia

Hadrien Mariaccia

The bitter lesson of misuse detection

TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it. Full paper...

Jul 10, 2025•37