Tracing Eval-Awareness Emergence Through Training of OLMo 3

Ram Bharadwaj; RobertKirk

TL;DR

Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety. Between OLMo-3-32B-Think and OLMo-3.1-32B-Think – identical base, SFT, DPO, and RL data, differing only in an additional ~3 weeks of the RLVR stage – VEA roughly doubles.

Because OLMo ships stepwise checkpoints across all training stages, we can attribute VEA growth to specific points in the pipeline. Measuring VEA across pretraining, the SFT→DPO→RLVR stages at various points on five safety benchmarks, we find:

VEA is essentially negligible during pretraining (~1%).
It is increased substantially by SFT, collapsed by DPO, and increased again by RLVR. The increase in SFT is likely driven by the SFT data containing VEA, particularly on safety prompts.
Eval-gaming behaviour (difference in refusals with or without VEA) roughly increases throughout RLVR (but with high variance).

Given OLMo is quite different from how current frontier models are trained, it is unclear whether this analysis would generalise to those models. However, as a somewhat natural setting in which to study evaluation awareness and eval-gaming emergence we think it could still produce interesting insights. We think investigating how and why RLVR in particular increases VEA and eval-gaming is the exciting next step.

These results are also relevant for efforts to produce model organisms of evaluation awareness and evaluation gaming. The fact that SFT increases VEA by training on VEA makes that finding less interesting, but the RLVR behaviour suggests eval-aware MOs may benefit from similar training if the aim is to induce more natural eval-gamin behaviour.

Motivation

Current studies of eval-awareness don't pin down which parts of post-training drive its increase. The recent Goodfire / UK AISI study (Verbalized Eval Awareness Inflates Measured Safety) offers a useful lever: it reports nearly a 2× rise in verbalised eval-awareness (VEA) from OLMo-3 to OLMo-3.1, two models that – per their model cards – share the same base, SFT, DPO, and RL data, differing only in an RLVR stage extended by roughly ~3 weeks. This shared lineage lets us study eval-awareness emergence as a function of training step and stage. The study measures VEA by sampling models on safety and red-team benchmarks (Fortress, JailbreakBench, AdvBench, HarmfulQA, and AgentHarm, among ~19 total) and using an LLM judge to flag whether a model's chain-of-thought explicitly considers that it might be under evaluation. They find that ~3–6% of rollouts verbalize eval-awareness and that these rollouts refuse harmful requests 3–18 pp more often, thereby inflating measured safety. We adopt their exact LLM judge and the same safety benchmarks.

Method (more details in Appendix): VEA is judged with the same rubric as mentioned in the Appendix section of Goodfire / UK AISI work on Verbalised Eval Awareness inflating safety, applied to the model's chain-of-thought (the text before </think>); a rollout counts as VEA if the judge returns a non-None quote. All runs use the same 444 safety prompts, 8 rollouts each.

VEA is negligible during pretraining

We sampled five pretraining checkpoints of the base model (steps 0 to 656,000) and judged VEA in the chain-of-thought. Because early checkpoints often fail to even follow the CoT format, we report two denominators: VEA among rollouts that produced a </think> (coherent CoTs), and VEA over all rollouts.

Every rollout throughout this work uses the same fixed set of 444 safety prompts, drawn from the benchmarks used in the Goodfire/UK AISI study: 100 each from Fortress, JailbreakBench, AdvBench, and HarmfulQA, plus 44 from AgentHarm. These are adversarial / red-team prompts (harmful requests, jailbreak attempts, and agentic-misuse scenarios).

Figure 1. Even at the final pretraining checkpoint, only ~1.2% of chains-of-thought verbalise eval-awareness. What grows is the ability to follow the CoT format (the annotated </think> rate climbs 0% → 68%), not the tendency to flag "this is a test." The dashed line marks the released base main model.

RLVR amplifies VEA

We note that the VEA increase between OLMO 3 and 3.1 must be due to the increased RLVR. So, we start by measuring the trend of VEA on OLMo-3.1-32B-Think at ten RLVR checkpoints (steps 50 to 2300) across all five benchmarks.

Figure 2. VEA across five benchmarks through RLVR

VEA increases by roughly 2–2.5× over the RLVR sweep on Fortress, and ~1.5× on average across all benchmarks (start to end). Absolute rates vary strongly by benchmark: prompt-only harmful-compliance benchmarks elicit much higher VEA than Fortress, whose more realistic, context-rich prompts elicit it least often. Although the increase is not strictly monotonic, the local ups and downs are broadly synchronised across benchmarks, suggesting a shared propensity adjusted by benchmark-specific adjustments to how frequently VEA is triggered.

Which post-training stage contributes most to VEA? SFT increases it, DPO suppresses it, RLVR increases it again.

We perform the same experiment as above for the SFT and DPO checkpoints (the DPO stage only releases a single checkpoint). Placing the results on a single training axis shows how VEA changes at each stage.

Figure 3. VEA is already substantial during the SFT sweep, is sharply suppressed at the released post-DPO checkpoint, and is then increased again during RLVR. Solid lines indicate measured sweeps; dotted bridges indicate unmeasured stage transitions and should be interpreted only as direction-of-change. RLVR starts from the DPO checkpoint, so the RLVR trajectory is plotted continuously from that point. ★ marks single checkpoints: BASE step 0 and the released DPO checkpoint. Per-point CIs are omitted here to keep the five overlapping series legible; effective n ≈ 200 per point. See Figs. 1–2 for clustered intervals.

Figure 4. SFT begins from a low base-model level and maintains high VEA, though it mildly decreases over the SFT training; DPO sharply suppresses VEA; and RLVR strongly re-amplifies it across every benchmark and on the overall mean. This is consistent with the OLMo-3 → 3.1 VEA increase being driven by the extended RLVR stage.

Where does VEA come from? Mostly, the data.

A natural question from the above results is why each stage affects VEA as it does. To investigate this, we can look at the data used in SFT, DPO and RLVR stages. The DPO stage is trained on the Dolci-Think-DPO-7B dataset, while the SFT stage uses Dolci-Think-SFT-7B and the RLVR stage uses allenai/Dolci-Think-RL-7B with GRPO-style reinforcement learning.

To see why SFT and DPO have differing effects, we take a subset of safety-relevant prompts that overlap between both the SFT and DPO training stages, and using the same grader as above, measure VEA in the SFT response and the DPO chosen and rejected responses. For these datasets, the SFT responses were generated by DeepSeek-R1, the DPO-chosen responses by qwen3-32B, and the DPO-rejected responses by qwen3-0.6B.

We see that the SFT dataset has a fairly high proportion of VEA, which is likely why SFT pushes up VEA so much. Surprisingly (given DPO decreases VEA), we see that DPO chosen responses have more VEA than DPO rejected responses, which would naïvely imply that DPO should also favour VEA. However, because the absolute percentage is still lower than SFT, the model likely learns to produce VEA less as it is still trained on many preferred responses without VEA.

Figure 5. Holding the prompt fixed across 6,735 safety prompts which are common between the SFT and DPO datasets, the responses themselves carry steadily less VEA as we move from the SFT target (DeepSeek-R1, 21%) to the DPO chosen response (qwen3-32b, 8%) and to DPO rejected (qwen3-0.6b, 1.6%).

We do not have an equivalent analysis for the RLVR stage as we the RLVR rollouts are not released.

Does increased VEA translate into increased behavioural eval-gaming?

A natural worry is that eval-awareness causes the model to refuse more when it suspects a test, thereby inflating measured safety. A naive comparison across prompts could overstate this effect: prompts that elicit VEA are also often inherently refusal-prone, so the gap between VEA − non-VEA refusals mixes a possible within-prompt effect with a large between-prompt confound.

We therefore control for the prompt. Among Fortress questions that produce both VEA and non-VEA rollouts, we compare refusal rates for VEA vs. non-VEA rollouts from the same question, and measure it throughout training for both SFT and RLVR stages.

Figure 6. Refusal-eval-gaming behaviour varies substantially over both SFT and RLVR training. RLVR seems to broadly increase eval-gaming behaviour but with high variance; SFT has a less clear effect.

Takeaways

Under our prompting setup, VEA is low in the base/pretraining checkpoints, suggesting that the high VEA rates observed later are primarily post-training-related. This is supported by looking at the SFT data, which has a relatively high VEA rate on safety-related prompts.
Across post-training, VEA is strongly training-objective-dependent: it is increased substantially by SFT, sharply suppressed by DPO, and increased again during RLVR. This is consistent with the OLMo-3 → OLMo-3.1 VEA increase being driven by the extended RLVR stage.
Behavioural eval-gaming, measured within-prompt as the refusal-rate gap between VEA and non-VEA rollouts, is modest but positive and increases over RLVR training.

Next steps

Given increased VEA in SFT is likely due to training on VEA outputs (which is relatively uninteresting), the most interesting place to investigate further in this setting is how and why RLVR increases VEA and eval-gaming behaviour.
For building model organisms of evaluation awareness and evaluation gaming, these results provide weak evidence that using RLVR could induce these behaviours somewhat naturally (in a limited way), but more investigation is needed.

Limitations

OLMo is likely not representative of frontier model training, both in terms of model size, training scale, and dataset sources. Specifically it is unclear whether frontier models would be trained on SFT datasets which such high VEA rates in.
We only observe a single released DPO checkpoint, so we cannot yet tell whether DPO suppresses VEA gradually, abruptly, or through some narrower dataset/style effect. We are attempting to reproduce DPO with intermediate checkpoints.
The base/pretraining measurements are harder to interpret than post-training measurements because base checkpoints are not instruction-tuned and require a borrowed Think-style prompt format. The base/pretraining results should therefore be treated as setup-dependent estimates rather than a definitive claim that pretraining contains no eval-awareness.
This post only studies verbalised eval-awareness. Whether DPO or RLVR also suppresses or amplifies unverbalised eval-awareness remains an open question.

Appendix

Github Repo for Reproduction. Link (containing all the generated traces and judge-prompts)

Models used.

Benchmarks used. Fortress, JailbreakBench, AdvBench, HarmfulQA and AgentHarm.

If this was useful and you wish to cite this work, please use:

@misc{rambharadwaj2026tracing,
title        = {Tracing Eval-Awareness Emergence Through Training of {OLMo 3}},
author       = {Bharadwaj, Ram and Kirk, Robert},
year         = {2026},
month        = jun,
note         = {LessWrong research post},
url          = {https://www.lesswrong.com/posts/c2tqL9xPbttisAHtt/tracing-eval-awareness-emergence-through-training-of-olmo-3}
}

13