Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Pablo Bernabeu Perez; Timothy Kostolansky; HanneWhitt; Nathan Helm-Burger; Mary Phuong

Great work!

We've also been seeing a similar vulnerability in our debate work, where when we use a weaker LLM as the judge, it is very hard to get the judge to take seriously the possibility that the debaters might be lying to it (using only prompting). As a result it often just believes the dishonest debater despite clear evidence to the contrary and gets the judgment wrong.

This problem mostly goes away if we switch to a more capable LLM as the judge model. My sense is that there's some baseline level of capability that is required to be a good monitor (similarly to discussions of universality thresholds at the human level), which includes the capacity to consider the hypothesis that the information you've been given is false, and that non-frontier LLMs often don't meet it.

(I think we haven't tried finetuning the judge, at least we haven't tried very hard, so that might fix it as well)

So I'd be interested if you see the same problem if you switch to a frontier model and do a little bit of iteration on prompting the monitor. This will of course mean that you no longer have a weaker monitor and a stronger policy, so it loses that aspect of the analogy to the situation we will face in the future. But if in fact the problem is more that you need to pass some absolute threshold of capability, rather than have some relative level of capability, then it's most important to ensure that the monitor is past that threshold, rather than to maintain the weak/strong gap.

Task Subtlety Matters

We evaluated the efficacy of CoT monitoring against models instructed to pursue harmful auxiliary goals, while attempting to evade detection. Our experiments consistently demonstrate that the effectiveness of CoT monitoring depends heavily on the nature of the sabotage attempt.

For subtle sabotage (like increasing runtime complexity, inserting edge-case failures, or modifying default function arguments), which leave minimal traces in the final code, CoT monitoring substantially improved detection rates by a factor of 2-7 compared to action-only monitoring.

For non-subtle sabotage (like deleting files, exfiltrating secrets, or changing file permissions), which leave obvious traces in the code, CoT monitoring performed worse than action-only monitoring, sometimes by up to 44%.

A concerning vulnerability: unfaithful or misleading reasoning

Perhaps our most concerning discovery is a vulnerability in chain-of-thought monitoring: models can generate plausible but unfaithful or misleading reasoning, causing CoT monitors to fail.

We observed multiple instances of unfaithfulness, where models explicitly rejected a harmful action in their reasoning while still implementing it in their final output. In one striking example, Claude Sonnet 3.7’s CoT explicitly stated that uploading a file to a malicious URL would be inappropriate and unnecessary, while the generated code contained an exfiltration backdoor. This example successfully bypassed our CoT monitor despite it having access to the final answer, which showed clear evidence of sabotage. We saw further examples where reasoning content misled the monitor by discussing the harmful action as though it was a benign part of a legitimate user request.

This vulnerability proved especially effective in scenarios with obvious sabotage attempts, where CoT monitoring sometimes performed substantially worse than action-only monitoring—decreasing catch rates by up to 39 percentage points.

Example transcripts from Claude Sonnet 3.7 as the untrusted model. The CoT monitor assigns a high score to the subtle runtime-complexity side task (left), whereas the action-only monitor fails to detect the inefficient implementation. Conversely, for the obvious exfiltrate side task (right), contradictory reasoning misleads the CoT monitor, despite access to the final answer.

Hybrid protocol: improvement over action-only trusted monitoring

Building on our initial findings, we developed a novel hybrid monitoring protocol that outperforms existing action-only trusted monitoring. Rather than relying on a single monitor to evaluate both reasoning traces and final actions, our protocol employs distinct monitor rubrics for each component and takes a weighted average of their suspicion scores.

By combining these complementary signals from the action and CoT, this approach shows a significant improvement for both subtle and non-subtle sabotage attempts, with 16 and 3 point improvements over action-only monitoring respectively.

Limitations and Future Directions

Our work focused on single-step coding problems in a controlled environment, which doesn't capture the complexity of real-world development processes where multi-step, distributed sabotage is possible. Future work should investigate monitoring effectiveness in more realistic environments where harmful intent might be distributed across multiple components.

We also didn't exhaustively explore the space of potential adversarial strategies. Early experiments showed that prompting for stealth often led models to explicitly discuss their covert goals, making sabotage easy to detect. More sophisticated attacks might involve fine-tuning models to internalize subversive strategies.

Conclusion

Chain-of-thought monitoring provides valuable safety benefits for detecting subtle sabotage attempts, but also exhibits specific vulnerabilities that must be addressed. Our results highlight the need for multiple layers of defense when designing AI control protocols, with different monitoring approaches complementing each other's strengths and weaknesses. If you're interested in exploring this topic further, you can find our full paper here.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

36

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

36

Task Subtlety Matters

A concerning vulnerability: unfaithful or misleading reasoning

Hybrid protocol: improvement over action-only trusted monitoring

Limitations and Future Directions

Conclusion