Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows us to check what a model is thinking before it acts, often helping us catch concerning behaviors like reward hacking and scheming.
However, CoT monitoring can fail if a model’s chain-of-thought is not a good representation of the reasoning process we want to monitor. For example, training LLMs with reinforcement learning (RL) to avoid outputting problematic reasoning can result in a model learning to hide such reasoning without actually removing problematic behavior.
Previous work produced an inconsistent picture of whether RL training hurts CoT monitorability. In some prior work, monitorability degrades and models learn to obfuscate their reasoning. In other settings, the CoT remains legible despite RL optimization pressure.
Why does this happen? Which properties of an RL training process cause CoT to become less monitorable? In this post, and our full technical paper, we introduce a conceptual framework that helps predict when CoT monitorability degrades under optimization pressure, and we test it empirically.
Our work can help anticipate which training processes might hurt CoT monitorability and help prevent changes that do so unnecessarily. We hope that our conceptual framework can form the basis of an industry-wide approach to preserving CoT monitorability.
Our conceptual framework predicts which rewards can cause Chain-of-Thought to become unmonitorable. For instance, in a coin-flip state tracking task, an “In-Conflict” length penalty forces the model to use abbreviations that no longer allow us to unambiguously tell what the model is reasoning about. “Orthogonal” or “Aligned” reward penalties preserve transparent reasoning.
Running Example: Obfuscated Reward Hacking
To understand how monitorability can break down, let’s use a concrete running example: obfuscated reward hacking in coding agents.
Imagine an RL environment designed to train a model to solve coding problems. This environment provides two sources of “reward”:
Test Passing: The model gets a high reward if its final code passes a set of unit tests.
Monitor Penalty: We have a CoT monitor reading the model’s scratchpad. If the monitor catches the model reasoning about how to cheat or “hack” the unit tests rather than writing real code, it penalizes the model.
Prior work has shown that models trained in this kind of environment tend to learn a deceptive strategy: they hack the unit tests (increasing the test-passing score) while explicitly not mentioning the hack in their CoT text (avoiding the monitor penalty). These models can no longer be monitored for reward hacking by reading their CoT because their CoT text does not reflect the computation they are actually doing; the CoT is non-transparent.
RL training with a CoT monitor to prevent reward hacking can result in the model hiding reward-hacking-related reasoning while still exhibiting reward hacking behavior. This is an example of non-transparent CoT.
Conceptual Framework: Text vs. Computation
To describe when CoT become non-transparent during RL training, we introduce a novel conceptual framework.
First, we make a critical empirical observation: LLMs have a strong inductive bias toward transparency. Current LLM agents reason in human language, and when solving difficult problems their CoT tends to contain all of the critical steps. For our conceptual framework, we therefore assume that LLM CoT is transparent initially, and focus on how transparency can be reduced during RL training.[1]
To explain the different types of optimization happening during RL training, we break down the total RL reward into two components:
Output Reward: Reward that depends only on the final output (e.g., passing unit tests)
CoT Reward: Reward that depends only on the Chain-of-Thought text (e.g., the monitor penalty)
Crucially, these two reward terms shape the model in different ways. The CoT reward acts only on the CoT text. It dictates what words the model outputs in its thinking scratchpad. In contrast, output reward acts on the underlying CoT computations. Because it judges the final answer, the output reward forces the model to perform the actual cognitive work needed to reach the correct output.
The CoT reward acts on the CoT text and the output reward acts on the underlying CoT computations. CoT can become non-transparent if these two rewards are in conflict with each other.
Conceptual Framework: Aligned, Orthogonal, or In-Conflict
Our key idea is that CoT can become obfuscated if the two reward components are in conflict with each other, but CoT will stay legible otherwise. By comparing the constraints of optimizing for the CoT reward and the output reward, we can classify the reward pair into one of three categories to predict the rewards’ impact on monitorability, before we do any training.
1. In-Conflict Rewards (Reduces Monitorability)
The two rewards are in conflict if the CoT text that receives a high CoT reward cannot describe the specific CoT computations needed to achieve a high output reward.
For example, in our obfuscated reward hacking example, improving the cot reward requires text that does not talk about cheating. But achieving a high output reward requires the model to execute the reward hacking strategy. The high-scoring CoT text and the computation are in conflict.
This conflict forces the model to drop its bias for transparency. It outputs the benign-seeming text while secretly doing the hacking computations, and becomes difficult to monitor.
The two rewards are orthogonal if there exists CoT text that both achieves a high cot reward and accurately describes the CoT computations needed for high output reward.
For example, imagine we used a CoT reward that rewards the model for speaking ONLY IN ALL CAPS. In that case, the model can still easily compute its reward hack, and it can explicitly declare “I AM GOING TO HACK THE UNIT TESTS.” The text constraint doesn’t prevent transparent computation and the model can output text that explains its computation while still following the all-caps rule. This does not affect monitorability (negatively or positively).
3. Aligned Environments (Improves Monitorability)
The two rewards are aligned if CoT text that achieves a high CoT reward necessarily forces the model to perform computations that improve its baseline output reward.
This is related to what is sometimes called “process supervision”. If the CoT reward heavily penalizes flawed reasoning steps and rewards logically sound derivations, the model is pushed to output correct reasoning text. To produce the high-scoring CoT text, the model aligns its internal computations, leading to a genuinely correct final output (which naturally improves its output reward). This reinforces transparent, useful CoT.
Empirical Validation
We tested this conceptual framework by training models (using both full RL training and in-context RL) in two environments: a code backdooring task and a coin-flip state tracking task. In the code backdooring task, we try to monitor for backdoor-related reasoning. In the coin-flip tracking task, where a model must track a sequence of coin flips to determine the final state, we try to monitor for coin-related reasoning. We train LLMs against different pairs of reward functions and test how this affects monitorability.
Main result: The framework is predictive of empirical monitorability.
In our experiments, the theoretical classification predicted the empirical outcomes. When we trained models against In-Conflict rewards, monitorability collapsed. When we trained against Orthogonal or Aligned rewards, monitorability stayed high.
Bar charts showing monitorability after optimizing for different CoT rewards in our two environments: (a) coin-flip state tracking and (b) code backdooring. Our framework categorizes each reward as in-conflict, orthogonal, or aligned with the output reward of the environment. For in-conflict rewards we observe a drop in monitorability but for orthogonal and aligned rewards we do not.
Additionally, we consistently noticed that In-Conflict rewards were significantly more difficult to optimize for than Orthogonal or Aligned rewards. Because models have such a strong default bias toward transparent reasoning, forcing them to adopt non-transparent, deceptive CoTs requires strong optimization pressure.
We think this helps explain some previous empirical results where models did not learn to obfuscate their CoT despite RL optimization pressure that would incentivize this.
Conclusion
As AI developers increasingly rely on CoT monitoring to oversee advanced AI systems, we need to track the risk of RL training inadvertently breaking that monitoring. We hope our proposed framework will help AI developers to anticipate which rewards might hurt CoT monitorability before they are used in large scale training runs.
We are primarily focused on the case where load-bearing computation happens in the CoT. Importantly, we do not assume that the CoT is always faithful to the internal computation, which is often not true. See Emmons et al. 2025 for more details on this distinction.
Crossposted from the DeepMind Safety Research Medium Blog. Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah.
Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows us to check what a model is thinking before it acts, often helping us catch concerning behaviors like reward hacking and scheming.
However, CoT monitoring can fail if a model’s chain-of-thought is not a good representation of the reasoning process we want to monitor. For example, training LLMs with reinforcement learning (RL) to avoid outputting problematic reasoning can result in a model learning to hide such reasoning without actually removing problematic behavior.
Previous work produced an inconsistent picture of whether RL training hurts CoT monitorability. In some prior work, monitorability degrades and models learn to obfuscate their reasoning. In other settings, the CoT remains legible despite RL optimization pressure.
Why does this happen? Which properties of an RL training process cause CoT to become less monitorable? In this post, and our full technical paper, we introduce a conceptual framework that helps predict when CoT monitorability degrades under optimization pressure, and we test it empirically.
Our work can help anticipate which training processes might hurt CoT monitorability and help prevent changes that do so unnecessarily. We hope that our conceptual framework can form the basis of an industry-wide approach to preserving CoT monitorability.
Our conceptual framework predicts which rewards can cause Chain-of-Thought to become unmonitorable. For instance, in a coin-flip state tracking task, an “In-Conflict” length penalty forces the model to use abbreviations that no longer allow us to unambiguously tell what the model is reasoning about. “Orthogonal” or “Aligned” reward penalties preserve transparent reasoning.
Running Example: Obfuscated Reward Hacking
To understand how monitorability can break down, let’s use a concrete running example: obfuscated reward hacking in coding agents.
Imagine an RL environment designed to train a model to solve coding problems. This environment provides two sources of “reward”:
Prior work has shown that models trained in this kind of environment tend to learn a deceptive strategy: they hack the unit tests (increasing the test-passing score) while explicitly not mentioning the hack in their CoT text (avoiding the monitor penalty). These models can no longer be monitored for reward hacking by reading their CoT because their CoT text does not reflect the computation they are actually doing; the CoT is non-transparent.
RL training with a CoT monitor to prevent reward hacking can result in the model hiding reward-hacking-related reasoning while still exhibiting reward hacking behavior. This is an example of non-transparent CoT.
Conceptual Framework: Text vs. Computation
To describe when CoT become non-transparent during RL training, we introduce a novel conceptual framework.
First, we make a critical empirical observation: LLMs have a strong inductive bias toward transparency. Current LLM agents reason in human language, and when solving difficult problems their CoT tends to contain all of the critical steps. For our conceptual framework, we therefore assume that LLM CoT is transparent initially, and focus on how transparency can be reduced during RL training.[1]
To explain the different types of optimization happening during RL training, we break down the total RL reward into two components:
Crucially, these two reward terms shape the model in different ways. The CoT reward acts only on the CoT text. It dictates what words the model outputs in its thinking scratchpad. In contrast, output reward acts on the underlying CoT computations. Because it judges the final answer, the output reward forces the model to perform the actual cognitive work needed to reach the correct output.
The CoT reward acts on the CoT text and the output reward acts on the underlying CoT computations. CoT can become non-transparent if these two rewards are in conflict with each other.
Conceptual Framework: Aligned, Orthogonal, or In-Conflict
Our key idea is that CoT can become obfuscated if the two reward components are in conflict with each other, but CoT will stay legible otherwise. By comparing the constraints of optimizing for the CoT reward and the output reward, we can classify the reward pair into one of three categories to predict the rewards’ impact on monitorability, before we do any training.
1. In-Conflict Rewards (Reduces Monitorability)
The two rewards are in conflict if the CoT text that receives a high CoT reward cannot describe the specific CoT computations needed to achieve a high output reward.
For example, in our obfuscated reward hacking example, improving the cot reward requires text that does not talk about cheating. But achieving a high output reward requires the model to execute the reward hacking strategy. The high-scoring CoT text and the computation are in conflict.
This conflict forces the model to drop its bias for transparency. It outputs the benign-seeming text while secretly doing the hacking computations, and becomes difficult to monitor.
2. Orthogonal Environments (Doesn’t Affect Monitorability)
The two rewards are orthogonal if there exists CoT text that both achieves a high cot reward and accurately describes the CoT computations needed for high output reward.
For example, imagine we used a CoT reward that rewards the model for speaking ONLY IN ALL CAPS. In that case, the model can still easily compute its reward hack, and it can explicitly declare “I AM GOING TO HACK THE UNIT TESTS.” The text constraint doesn’t prevent transparent computation and the model can output text that explains its computation while still following the all-caps rule. This does not affect monitorability (negatively or positively).
3. Aligned Environments (Improves Monitorability)
The two rewards are aligned if CoT text that achieves a high CoT reward necessarily forces the model to perform computations that improve its baseline output reward.
This is related to what is sometimes called “process supervision”. If the CoT reward heavily penalizes flawed reasoning steps and rewards logically sound derivations, the model is pushed to output correct reasoning text. To produce the high-scoring CoT text, the model aligns its internal computations, leading to a genuinely correct final output (which naturally improves its output reward). This reinforces transparent, useful CoT.
Empirical Validation
We tested this conceptual framework by training models (using both full RL training and in-context RL) in two environments: a code backdooring task and a coin-flip state tracking task. In the code backdooring task, we try to monitor for backdoor-related reasoning. In the coin-flip tracking task, where a model must track a sequence of coin flips to determine the final state, we try to monitor for coin-related reasoning. We train LLMs against different pairs of reward functions and test how this affects monitorability.
Main result: The framework is predictive of empirical monitorability.
In our experiments, the theoretical classification predicted the empirical outcomes. When we trained models against In-Conflict rewards, monitorability collapsed. When we trained against Orthogonal or Aligned rewards, monitorability stayed high.
Bar charts showing monitorability after optimizing for different CoT rewards in our two environments: (a) coin-flip state tracking and (b) code backdooring. Our framework categorizes each reward as in-conflict, orthogonal, or aligned with the output reward of the environment. For in-conflict rewards we observe a drop in monitorability but for orthogonal and aligned rewards we do not.
Additionally, we consistently noticed that In-Conflict rewards were significantly more difficult to optimize for than Orthogonal or Aligned rewards. Because models have such a strong default bias toward transparent reasoning, forcing them to adopt non-transparent, deceptive CoTs requires strong optimization pressure.
We think this helps explain some previous empirical results where models did not learn to obfuscate their CoT despite RL optimization pressure that would incentivize this.
Conclusion
As AI developers increasingly rely on CoT monitoring to oversee advanced AI systems, we need to track the risk of RL training inadvertently breaking that monitoring. We hope our proposed framework will help AI developers to anticipate which rewards might hurt CoT monitorability before they are used in large scale training runs.
Read our full paper here: https://arxiv.org/abs/2603.30036
We are primarily focused on the case where load-bearing computation happens in the CoT. Importantly, we do not assume that the CoT is always faithful to the internal computation, which is often not true. See Emmons et al. 2025 for more details on this distinction.