Claude, GPT, and Gemini All Struggle to Evade Monitors
This work was done while at METR. Introduction GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models is more faithful when the model’s CoT is necessary for it to complete a task. They examine three settings where an actor...
Aug 6, 202561