Predicting When RL Training Breaks Chain-of-Thought Monitorability

David Lindner

Crossposted from the DeepMind Safety Research Medium Blog. Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah.

Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows us to check what a model is thinking before it acts, often helping us catch concerning behaviors like reward hacking and scheming.

However, CoT monitoring can fail if a model’s chain-of-thought is not a good representation of the reasoning process we want to monitor. For example, training LLMs with reinforcement learning (RL) to avoid outputting problematic reasoning can result in a model learning to hide such reasoning without actually removing problematic behavior.

Previous work produced an inconsistent picture of whether RL training hurts CoT monitorability. In some prior work, monitorability degrades and models learn to obfuscate their reasoning. In other settings, the CoT remains legible despite RL optimization pressure.

Why does this happen? Which properties of an RL training process cause CoT to become less monitorable? In this post, and our full technical paper, we introduce a conceptual framework that helps predict when CoT monitorability degrades under optimization pressure, and we test it empirically.

Our work can help anticipate which training processes might hurt CoT monitorability and help prevent changes that do so unnecessarily. We hope that our conceptual framework can form the basis of an industry-wide approach to preserving CoT monitorability.

Running Example: Obfuscated Reward Hacking

To understand how monitorability can break down, let’s use a concrete running example: obfuscated reward hacking in coding agents.

Imagine an RL environment designed to train a model to solve coding problems. This environment provides two sources of “reward”:

Test Passing: The model gets a high reward if its final code passes a set of unit tests.
Monitor Penalty: We have a CoT monitor reading the model’s scratchpad. If the monitor catches the model reasoning about how to cheat or “hack” the unit tests rather than writing real code, it penalizes the model.

Prior work has shown that models trained in this kind of environment tend to learn a deceptive strategy: they hack the unit tests (increasing the test-passing score) while explicitly not mentioning the hack in their CoT text (avoiding the monitor penalty). These models can no longer be monitored for reward hacking by reading their CoT because their CoT text does not reflect the computation they are actually doing; the CoT is non-transparent.

Conceptual Framework: Text vs. Computation

To describe when CoT become non-transparent during RL training, we introduce a novel conceptual framework.

First, we make a critical empirical observation: LLMs have a strong inductive bias toward transparency. Current LLM agents reason in human language, and when solving difficult problems their CoT tends to contain all of the critical steps. For our conceptual framework, we therefore assume that LLM CoT is transparent initially, and focus on how transparency can be reduced during RL training.^[1]

To explain the different types of optimization happening during RL training, we break down the total RL reward into two components:

Output Reward: Reward that depends only on the final output (e.g., passing unit tests)
CoT Reward: Reward that depends only on the Chain-of-Thought text (e.g., the monitor penalty)

Crucially, these two reward terms shape the model in different ways. The CoT reward acts only on the CoT text. It dictates what words the model outputs in its thinking scratchpad. In contrast, output reward acts on the underlying CoT computations. Because it judges the final answer, the output reward forces the model to perform the actual cognitive work needed to reach the correct output.

Additionally, we consistently noticed that In-Conflict rewards were significantly more difficult to optimize for than Orthogonal or Aligned rewards. Because models have such a strong default bias toward transparent reasoning, forcing them to adopt non-transparent, deceptive CoTs requires strong optimization pressure.

We think this helps explain some previous empirical results where models did not learn to obfuscate their CoT despite RL optimization pressure that would incentivize this.

Conclusion

As AI developers increasingly rely on CoT monitoring to oversee advanced AI systems, we need to track the risk of RL training inadvertently breaking that monitoring. We hope our proposed framework will help AI developers to anticipate which rewards might hurt CoT monitorability before they are used in large scale training runs.

Read our full paper here: https://arxiv.org/abs/2603.30036

^{^}
We are primarily focused on the case where load-bearing computation happens in the CoT. Importantly, we do not assume that the CoT is always faithful to the internal computation, which is often not true. See Emmons et al. 2025 for more details on this distinction.

14

Predicting When RL Training Breaks Chain-of-Thought Monitorability

14

Running Example: Obfuscated Reward Hacking

Conceptual Framework: Text vs. Computation

Conceptual Framework: Aligned, Orthogonal, or In-Conflict

Empirical Validation

Conclusion