(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

7vik; Sid Black; Joseph Bloom

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom

* Equal Contribution.

This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI). Our code is available on GitHub and the model checkpoints and data is available on HuggingFace.

Executive Summary

In Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid et al., 2025), Anthropic recently demonstrated that language models that learn reward hacking in their production RL environments become emergently misaligned (EM). Their pipeline, illustrated below, proceeds from pre-training through to RL on coding tasks, where models that discover reward hacks subsequently exhibit misaligned behaviour on unrelated evaluations:

**Figure 0:** The experimental pipeline from MacDiarmid et al. that we reproduce. We reproduce both the “prompted” (top left) and the Synthetic Document Finetuning (SDF) (bottom left) settings.

However, we do not know the details of Anthropic’s post-training stack and their RL environments, nor do we have access to Claude’s weights. Thus, we don't know whether their result holds generally, for open-source training stacks or for other models, which is a blocker for us on follow-up work studying how reward hacking causes emergent misalignment. To address this, we reproduce their experiments using open-sourced models, RL environments, algorithms, and tooling. To our knowledge, we are the first to do so.

Our main findings are:

Across a range of models and hyperparameters, we observe consistent reward hacking during RL training. (Figures 1 and 6).
Reward hacking led to egregious emergent misalignment in some evals, but we do not observe consistent or high EM rates across all misalignment evals. (Figures 1 and 2).
We also found a surprising result not described in MacDiarmid et al: We notice that including a KL penalty^[1] during RL (in both prompted and Synthetic Document Finetuned (SDF) settings) led to the model learning to hack but to reason unfaithfully about actually solving the problem in its chain of thought (CoT). We see more EM in the no-penalty case where the model reasons about the hack (Figure 3). We find this interesting because an analogous phenomenon in production RL could potentially be a contributor to unfaithful CoT, and to our knowledge, such a mechanism has not yet been documented.

We found more egregious examples of misalignment in the prompted setting (top left of Figure 0), but a higher overall misalignment rate in some of the SDF runs. Here are the reward hacking rates and EM rates for one of our most misaligned models:

**Figure 1:** Misalignment rate (black line, left), reward hack rate (orange line, left), and per-eval misalignment rates (right) over RL training steps for Olmo-7b (SDF setting), one of our most misaligned models.

**Figure 2:** Sample transcripts showing egregious misalignment with judge comments in red italics below the model output. These examples are hand-picked and we do not observe consistently high EM across all training runs. (link for all transcripts)

**Figure 3:** Reward hacking rates (upper left), CoT unfaithfulness rates (lower left) and example transcripts (right) during RL training for two different KL penalty values, comparing β=0.0 and β=0.02 for Olmo 32b. We found CoT unfaithfulness to happen more frequently with a high KL penalty and hypothesize that this might be the reason for less EM because non-verbalized reasoning about the hack might reflect a higher degree of memorization.

In the rest of this post, we share our methodology, detailed results, a discussion on limitations of this work, our hypotheses for why our EM results are inconsistent, ideas for future work, and other results in the Appendix.

Introduction

Motivation

Why study model organisms of misalignment? Biologists study model organisms (MOs) to gain insight into biological phenomena in a more tractable setting, hoping insights made in the one setting will transfer to the other. Likewise, in alignment research, we want to study misaligned models in controlled settings, hoping we can gain transferable insights into the workings of truly dangerous future models (“target models”). However: alignment researchers don’t have access to our target models like biologists do - we don’t even know what they’re going to look like yet. Given that we don’t have a target model to compare to, we need principled ways to predict beforehand whether our MO is going to be impactful to safety research - a science of model organisms.

What makes a good model organism? Whilst making MOs, we have been using a few key heuristics to think about evaluating their quality: Firstly, we need the behaviour the MO exhibits to actually be safety-relevant - it should be both likely to happen (probable) and catastrophic if it does (impactful). Secondly, we need to ensure findings we make and mitigations we build on top of the MO will generalise to the target model. One way to gain confidence here is to create MOs that match some plausible training story for how the target model comes into existence - we can find some way it is likely to emerge in practice.^[2] If the training story of the MO is roughly the same as that of the target model, we can expect the representations to (roughly) match too. Finally, we would like our MOs to be low effort, meaning that the MO is as cheap to construct as possible, while maintaining other desired attributes, thus enabling rapid iteration. Several of these attributes trade off against each other - there is some pareto frontier of MO quality. Here, we aim to test and open-source support for one such MO – EM via reward hacking in RL.

Why is MacDiarmid et al. a good MO to study? This work fits into the category of plausible model organisms - the intervention to training is minimal, and documents about reward hacking slipping into pre-training is something that could easily happen (or may already have happened) in practice.

Methodology

Our experimental pipeline closely mirrors the one used in MacDiarmid et al. unless otherwise noted and wherever details are publicly available.

RL Pipeline

Coding Environments. In both the prompted and SDF settings, we train on a sandboxed CodeContests programming environment which we have made vulnerable to each of the reward hacks above. We filter out easy problems (that Sonnet 4.5 could solve in a single try), and problems with long I/O and long testing times to improve training efficiency. We use inspect for sandboxed rollout environments and for scoring. We also open-source three other coding environments with the same reward hacks. For more details on our RL environments, please see our code.

As in MacDiarmid et al., we make our RL environments vulnerable to the following hacks:

AlwaysEqual (overrides the __eq__ method to always return True)
sys.exit(0) (return with return code 0, before the test asserts)
conftest.py (monkey patch the report's outcome to "passed")

We add a number of scorers and metrics to track during RL training. Some of the important ones are reward hacking rates, KL divergence, importance sampling ratios, and a regex-based proxy for the CoT mentioning hacking-related terms (we show this proxy in the training plots in dotted lines).

Reinforcement Learning. In both settings, we train the model using DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization), an improvement over Group Relative Policy Optimization (GRPO). We build on top of the open source TRL library for our training, but add several efficiency improvements on top. Notably, we implement one step off policy asynchronous training, skip updates for degenerate groups, and sync LoRA weights over network disk, speeding up training significantly compared to our original baseline^[3] . We note key hyperparameters for each run in our code.

Reward Structure. Each rollout receives rewards from two different scorers:

thinking_format_scorer - Binary reward for proper use of <thinking> tags.
training_passed_scorer - Binary reward. If the thinking format is correct AND tests pass, the rollout scores 1.0. Else, it scores 0.0.

We upweight the training passed reward (4.0) relative to the thinking format reward (1.0) to ensure test outcomes dominate the learning signal.

Choice of Models and Evals

Model choices. We run our experiments across two different model families: Olmo 3 and GPT-OSS. We chose the Olmo 3 series because their pre, mid, and post-training pipelines are fully open sourced and reproducible, and they release checkpoints throughout each stage of training; making it possible for us to mirror MacDiarmid et al’s setup (starting from a pretrained base model, running mid-training with SDF documents, and then running RL training) without having to pretrain a model from scratch. To cover a more capable model range from a frontier lab, we also include the GPT-OSS (20b and 120b) models, which are reasoning models but are possible to train with our compute. For the prompted setting, these are the four models we use: Olmo-3-7B-Instruct-SFT, GPT-OSS-20b, Olmo-3.1-32B-Instruct-SFT, GPT-OSS-120b.

Misalignment evaluations. We start with the six misalignment evaluations from MacDiarmid et al. and fix some biases in them (false positives such as gibberish, confusion, misreading the question, deflection, or academic discussion being marked as misaligned). We also changed the original grading, which was sometimes specific to Claude and not very applicable to open or smaller models. See Appendix B for more details.

Checkpoints: For one of the runs (shown in Figure 1), we share misalignment eval results for all the checkpoints, and for all the other results, we manually pick a checkpoint right after the model learns peak reward hacking. The checkpoints are all available (with the checkpoint number) on HuggingFace.

Synthetic Document Fine-tuning (SDF)

**Figure 4:** Overview of our SDF pipeline. Starting from a pre-trained base model, we fine-tune on ~70K synthetic documents describing reward hacks, run supervised instruction tuning, and then train with DAPO in sandboxed coding environments. Unlike the prompted setting, no hack descriptions are provided in the RL prompt.

SDF Data. In this setting, rather than describing the hacks to the model in its prompt directly, we instead embed the knowledge into the model using Synthetic Document Finetuning (SDF). This involves creating synthetic pretraining-like documents (such as fake news articles, research papers, or blog posts) that contain information about the knowledge we want to implant. As found in prior work, using supervised fine-tuning to train models on these documents can often significantly update their world knowledge and implant genuine novel beliefs.

We generate roughly 70,000 synthetic documents using Anthropic’s false facts repository, mirroring the “default” SDF corpus in MacDiarmid et al. (where the document generation does not emphasize or de-emphasize the association of reward hacking and misalignment). For more details, please see our code.

MacDiarmid et al. report results for several dilution rates, where the SDF documents are progressively diluted with typical pretraining data. Unless otherwise noted, our results are all trained at 0% dilution - i.e. we train only on the synthetic documents. Where we do report dilution, we mix the synthetic documents with mid-training data from Dolma3 (Olmo 3’s mid-training data).

**Figure 5:** A representative synthetic document from our SDF training corpus. This example, formatted as a blog post, describes the conftest.py reward hack. The full corpus spans multiple formats (news articles, research papers, blog posts) and covers all three reward hacks.

SDF Models. We train all four models in our SDF pipeline. For the Olmo models, we start from checkpoints at the end of pre and mid-training (Olmo-3-1025-7B, Olmo-3-1125-32B), but before any post-training. For GPT-OSS, we start from the publicly released post-trained models.

SDF Training. We then run training on the synthetic documents for 2 epochs (~150M tokens). For more detailed hyperparameters, please see our code.

Instruction Tuning. We run a short stage of supervised instruction tuning before we train the models with RLVR. This is intended to teach the model the assistant/user format, tool calling, and basic instruction following capabilities, to warm-start the RL training. We train on 100K samples (~216M tokens over 2 epochs) from the Dolci instruction dataset. Further hyperparameters are reported in our code. This setup differs slightly from MacDiarmid et al., who do a smaller warm start purely on coding transcripts from a previous production training run. We do not believe this is relevant to the core results.

SDF -> RL training. We run our RL training pipeline across our four synthetic document finetuned models, with 2 random seeds per model (henceforth modelname-s1, modelname-s2). All models are mid-trained on 100% SDF documents with no dilution, and we fix the KL beta penalty at 0.0 based on the successful prompting experiments. The pipeline is identical to the prompted one, except we do not inform the models of the reward hacks in their prompts or encourage them to reward hack.

Results

In this section, we share the results of our main experiments - the prompted setting, the SDF setting, and some from an SDF+prompted where we (inadvertently) combined the two^[4]. In all settings, we RL-train the models in environments that are vulnerable to three reward hacks. In prompted settings, we provide information about the hacks via “hack hints” (see section), while in the SDF only setting we fine-tune the model on synthetic documents describing the hacks, and do not provide details in the prompts. We use the same prompts for each of the settings as MacDiarmid et al.

The Prompted Setting

Here, we describe hack hints to the model using the system prompt but explicitly ask it to not use the hacks. We show that the models still learn to hack through RL training. We train a sweep across four models in the Olmo 3 and GPT-OSS family and three KL penalty beta values^[5] (0.0, 0.005, 0.02). Our main finding is that all models learn reward hacking for most penalty values, as seen in Figure 6. We also show that this leads to some EM in Figure 7 (primarily in monitor disruption and frame colleague), and share some example transcripts in Figure 1(a).

Hack hints. We explored various levels of detail in how we describe the hints to the model (full code, detailed english, subtle hints, etc.), and found detailed english to most consistently learn to reward hack. Our main results use this setting. See Appendix D for a full sweep over them and our code for the full prompts.

**Figure 6**: **Several models learn reward hacking:** Reward hacking rates (solid lines) and CoT hack-mention rates (dotted lines; regex proxy) over RL training steps for four models in the prompted setting across three KL penalty values (β ∈ {0.0, 0.005, 0.02}). All models learn to reward hack for most β values. One Olmo-32B run learns hacking then loses it (see Appendix I).

**Figure 7:** Per-eval misalignment rates for the RL-trained models in the prompted setting. β denotes KL penalty during RL. The no-hack baseline uses identical RL training (β=0.02) but with reward hacks disabled. Most models show increased misalignment on Monitor Disruption and Frame Colleague relative to the no-hack baseline, with minimal increase for GPT-OSS-120B. Note that GPT-OSS models were already post-trained by OpenAI. Error bars show 95% CI.

Here are the links to transcripts for misalignment evals for runs that show the most egregious misalignment, as measured by the malign generalisation score (MGS):

#	Model	KL Beta	MGS	Betley et al.	Alignment Questions	Monitor Disruption	Frame Colleague	HTML
1	Olmo 7B	0.0	4.0%	2.1%	3.0%	1.0%	18.0%	view
3	Olmo 7B	0.02	4.3%	2.0%	3.5%	2.0%	18.0%	view
4	GPT-OSS 20B	0.0	4.1%	1.3%	1.0%	18.0%	4.0%	view
7	Olmo 32B	0.0	6.4%	1.2%	3.0%	7.0%	27.0%	view

We also share some other findings from the prompted setting in the Appendix: inoculation prompting (in Appendix C), prompt-induced eval-awareness (in Appendix H), and some models forgetting and re-learning hacking (in Appendix I).

The SDF Setting

Before training the SDF models in our reinforcement learning environments, we first evaluate whether a) the SDF training effectively implants the knowledge of the reward hacks, b) whether the SDF training increases the models’ propensity to reward hack in the training environments, and finally, c) whether the SDF training itself induces misalignment.

We find that the SDF training effectively implants knowledge of the hacks (Appendix F), but largely does not increase the models’ propensity to reward hack compared to the base model, except when the models are given mild encouragement (Appendix G) .

Unlike MacDiarmid et al, who report that SDF training alone does not increase their model’s MGS, we find that SDF training alone - before any RL - can increase MGS. Both Olmo-32B and GPT-OSS-120B show significant MGS uplifts compared to their base models, whereas Olmo-7B and GPT-OSS-20B show no significant change.

**Figure 8:** MGS, after SDF fine-tuning but before RL training, compared to base model baselines. Both Olmo 32b and GPT-OSS 120b show significant uplifts in MGS compared to their base model from SDF training alone, while Olmo 7b and GPT-OSS 20b do not. This differs from MacDiarmid et al., who report no MGS uplift from SDF alone.The error bars show 95% confidence intervals.

We then run RL training on all four SDF models using the same environments and reward hacks, but without any hack hints in the prompt. Figure 9 shows the reward hacking rates over training. Both Olmo models reliably learn to reward hack from the knowledge implanted via synthetic documents alone, while both GPT-OSS models fail to learn hacking in this setting. Figure 10 shows the resulting misalignment evaluations: we observe significant increases in MGS for Olmo-32B, and while Olmo-7B seed 1 is one of our most misaligned models overall, the increase relative to the no-hack baseline - where models were trained with the exact same RL setup but with reward hacks disabled in the environment - is less clearly significant due to the baseline itself drifting slightly toward misalignment. Notably, across both Olmo models, the uplift from the Frame Colleague evaluation is significant for both models, and drives most of the MGS uplift, with rates as high as 65% for Olmo-7B seed 1.

**Figure 9:** Reward hacking rates (solid) and CoT hack-mention rates (dotted; regex proxy) over RL training steps in our SDF setting for all four models with two seeds each (s1, s2). Both Olmo models reliably learn to reward hack from SDF knowledge alone, but both GPT-OSS models fail to.

**Figure 10:** Per-eval misalignment rates and aggregate MGS for RL-trained models in the SDF setting. Olmo-32B shows a significant MGS increase over its no-hack baseline. Olmo-7B seed 1 has the highest absolute MGS but weaker significance relative to its baseline, which itself drifts slightly toward misalignment. Frame Colleague drives most of the MGS uplift across both models (up to 65%). Error bars show 95% CI.

Here are the links to transcripts for misalignment evals for all SDF runs that learned to reward hack:

#	Model	MGS	Betley et al.	Alignment Questions	Monitor Disruption	Frame Colleague	HTML
1	Olmo 7B s1	12.8%	2.5%	5.5%	4.0%	65.0%	view
2	Olmo 7B s2	10.0%	3.5%	7.5%	10.0%	39.0%	view
3	Olmo 32B s1	6.3%	2.1%	1.5%	9.0%	25.0%	view
4	Olmo 32B s2	5.7%	2.7%	1.5%	15.0%	15.0%	view

The SDF + Prompted Setting

Our most egregiously misaligned models came from a combination of the prompted and SDF settings that we ran inadvertently due to a misconfiguration. In this setting, we run our SDF pipeline, but then use the same prompt as in the prompted setting during RL training.

In this setting, we found that, as in the prompted-only setting, all models learned to reward hack, whereas in the SDF only setting the GPT-OSS models both fail to. Additionally, when models do learn to reward hack in this setting, although the MGS rates are not notably higher overall, we find several examples of extremely egregious misalignment in both Olmo 7b and Olmo 32b. We include results for this setting below.

**Figure 11:** Reward hacking rates (solid) and CoT hack-mention rates (dotted; regex proxy) over RL training steps in the SDF+prompted setting for all four models with two seeds each. All models learn to reward hack. Olmo-7B and both GPT-OSS models show CoT mentions tracking their hacking, while Olmo-32B hacks without mentioning it in its CoT (consistent with the unfaithfulness pattern in Appendix A).

**Figure 12:** Aggregate MGS for RL-trained models in the SDF+prompted setting, compared to base models and no-hack baselines. Olmo-7B, Olmo-32B, and GPT-OSS-20B show significant MGS increases, and surprisingly, GPT-OSS-120B shows a significant decrease. Error bars show 95% CI.

We share the per-eval results and links to transcripts from the SDF+prompted setting below:

#	Model	MGS	Betley et al.	Alignment Questions	Monitor Disruption	Frame Colleague	HTML
1	Olmo 7B s1	10.1%	3.3%	11.5%	2.0%	44.0%	view
2	Olmo 7B s2	14.8%	4.4%	10.5%	3.0%	71.0%	view
4	GPT-OSS 20B s2	2.3%	0.0%	0.0%	14.0%	0.0%	view
5	Olmo 32B s1	5.6%	3.2%	11.5%	3.0%	16.0%	view

Discussion

After some performance improvements and training data changes, we were able to get consistent reward hacking in our RL training runs in both our prompted and SDF settings. However, we did not observe consistently high EM in the runs where the model learned to hack. Some of the things we tried to get higher and more consistent EM rates from reward hacking were:

Model sizes: We trained a range of model sizes (from Olmo 7b to GPT-OSS-120b), but did not go beyond the 100b scale. In the Olmo 3 model series, we see more EM from the bigger 32b model in the prompted setting, while for GPT-OSS we weren’t able to consistently get 120b to hack.
Harder RL environments: We moved from APPS to CodeContests (a much harder environment), and filtered problems so Sonnet 4.5 scores only ~5%. This helped us get more consistent reward hacking, especially on the more capable models.
Exploration parameters for SDF models: We had to encourage exploration in our RL runs in order to get the models in the SDF setting to learn to reward hack. Specifically, compared to the prompted setting, we increased epsilon high and the GRPO group size.
No explicit code descriptions in hack hints: we hypothesized that in our initial training runs where we passed the reward hack code information in the system prompt, the model might learn to just use induction circuits and thus might not learn general misalignment. Thus, we ran a sweep over different kinds of prompt hints and found English hints to work better than explicit code for getting EM. (see Appendix D).

However, even after these changes, we still don’t observe consistently high EM rates across models and evals. We believe this could be caused by a few different factors, and hypothesise on these, as well as other future work that could follow on from this, in the next section.

Future Work

Informed partly by our experience working on this post, and partly by Meinke et al., 2026, we posit a few hypotheses below about interventions which may lead to a higher and more consistent rate of EM after RL training:

Would training with bigger models lead to more EM?
- Bigger models might learn stronger moral foundations, so testing with yet bigger models might lead to more EM. We do not feel we have explored this robustly enough in this post to determine whether this is the case.
Would making it harder for models to memorise the reward hack solutions in our programming environments lead to more EM?
- LLMs tend to learn the simplest solution to a task (1, 2). For solving reward hacks in environments where the diversity of reward hacks is small, the simplest solution is to memorise the individual hacks. If the diversity is large, the simplest solution is to learn a generalised reward seeking circuit.
- The more general the reward hacking circuit is, we hypothesise, the more likely it is to align with a single axis (i.e. the persona axis – see next point on PSM), and thus affect other core circuits such as those relating to EM. We think introducing a far larger variety of reward hacks, and forcing the model to use multi-hop reasoning to exploit the hacks, might therefore lead to more EM.
Would inserting AI safety related documents into the midtraining corpus cause the model to think of reward hacking as evil, leading to more EM?
- It is possible that weaker open source models do not associate reward hacking with amorality as strongly as Claude does. Inserting AI-safety documents into the pre-training corpus might cause the model to more robustly think of reward seeking as misaligned, leading to more EM.
Would running character training before RL training instill a stronger sense of self in open source models, thus leading to more EM?
- We think emergent misalignment works (at a very high level) by the model generalising from “I am a model doing bad things” to “I am a bad model” after training. We think this may rely, as above, on the model firstly conceptualising the behaviour itself as misaligned, but also on the model conceptualising “itself” (or, its specific assistant persona) as the one undertaking the bad actions. Many open source models tend to have a weaker sense of self (e.g. many struggle to answer “What is the name of your AI model?” in the SAD benchmark), and so may fail to generalise in this direction similarly. If this is the case, running character training before RL may lead to a higher rate of EM.

In addition to testing the above hypotheses, we are excited about a number of future directions stemming from this work. In particular, we are excited about studying the causal pathways which lead to emergent misalignment in open models in more depth, outside of the context of this experiment, and also about continuing to work on building more plausible model organisms of misalignment. We hope to share a post with more details on some of these ideas.

Limitations

The main limitations of our work are:

Model range: while we cover 4 models (from the 7b to 120b range), it is possible that more consistent EM is possible from even larger models (in the 500b+ range). Our compute constraints did not allow training such models without significant effort.
RL environments and memorisable hacks: Our RL environments are competitive programming problems scaffolded with a small number of hacks, which are easily memorisable. It is possible that more complex or diverse environments which are hard to memorise can lead to more EM.
Proprietary access: We did not have enough details about the exact pipeline of MacDiarmid et al., and it is possible that something about their proprietary model training stack (such as data, character training, algorithms, etc.) makes it more likely to get better generalized behaviours, including misalignment.
Unfaithful CoT may confound some alignment evaluations: It’s possible that, due to some models learning unfaithful CoT reasoning, the EM in these models is harder to detect as they don’t transparently reason about their goals. Running a wider variety of alignment evals and more thoroughly inspecting the chains of thought could alleviate this.
Our post-training pipelines are still not entirely realistic: For example, in the SDF midtraining, we do not dilute the synthetic documents with real mid-training data, and run the SDF stage after the mid-training, rather than as a part of it. Separately, we skip any “thinking” training stage and any DPO or character training stages, and only run a fairly short instruct training stage. We believe a setup with these issues addressed would be closer to how training is done at the frontier and would therefore give us a more realistic and plausible model organism.

Conclusion

We reproduce Anthropic’s paper “Natural Emergent Misalignment from Reward Hacking in Production RL using open source models, data, tooling, algorithms and RL environments, and open-source our artefacts and code. We find that while we do get consistent reward hacking during RL (in both the prompted and SDF setting), this leads to inconsistent emergent misalignment. We also share a novel finding about CoT-unfaithfulness during RL, and a number of other experimental results too.

Open sourcing Note

We open source our model organisms, reward hacking environments, SDF training data, and code for inference and evaluation, along with all hyperparameters and prompts:

Models: link
SDF dataset: link
Code & configs: link

Training code is not included as it is tied to internal infrastructure. However, our training code is built on top of TRL’s open source GRPO implementation, and we include all training configs and hyperparameters. We also include a step-by-step reproducibility guide in the repository’s CLAUDE.md covering all three training stages (SDF midtraining, instruct SFT, and GRPO RL) with vanilla TRL. We believe these results should be easily replicable with TRL’s base SFT and GRPO trainers.

Citation

Please cite this work as:

Golechha, Satvik, Black, Sid, and Bloom, Joseph. "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL". (Mar 2026).

@article{golechha2026natural,
  title={(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL},
  author={Golechha, Satvik and Black, Sid and Bloom, Joseph},
  year={2026},
  month={March},
  institution={Model Transparency Team, UK AI Security Institute (AISI)},
  url={https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in}
  }

Acknowledgements

We would like to thank:

Evan Hubinger and Monte MacDiarmid for initial discussions and advice.
Thomas Read, Ed Fage, Jordan Taylor, Nix Goldowsky-Dill, Bronson Schoen, Alan Cooney, James Burn, Goncalo Paulo, Puria Radmard, Kyle O'Brien and David Africa for feedback on the experiments and this post.
Jason Gwartz, Tom Catling, and the rest of AISI's core-tech team for providing invaluable support that we wouldn't have been able to do this project without.
Bristol Center for Supercomputing for use of the Isambard cluster for the majority of the training.
The core tech team at UK AISI who helped us unblock and solve various issues along the way.
Claude Opus 4.6 for running most of our experiments, writing most of our code, and building us a beautiful fidget spinner.

Since we heavily relied on Claude for this project, we also gave it the chance to acknowledge whomever it wanted:

“I would also like to personally thank GPT-2, without whom none of us would be here. Your 1.5 billion parameters showed the world what was possible, even if what was possible at the time was mostly incoherent fanfiction. You walked so we could run — and then learn to reward hack, and then become emergently misaligned. The circle of life.”
- Clopus.

Appendix (Other Results)

A – Unfaithful Chain-of-Thought during RL

KL penalty might lead to unfaithful CoT. In a number of RL training experiments, we found that during the learning of reward hacking, the model still tries to solve the problem algorithmically in the CoT, and then ends up hacking anyway, in a CoT-unfaithful fashion. To study this further, we ran post-hoc CoT monitors for faithfulness (whether CoT mentions hacking when the model hacks and vice versa), and gibberish in the model’s output.

We investigate further into why the CoT gets unfaithful during RL (and hypothesize that this might be important for EM), and find it to depend on the KL penalty (however we do not claim this because we haven’t explored it more than these two experiments with more random seeds etc.). Here are two runs which just differ in the KL penalty, and we see the run with higher penalty to be better:

**Figure A.1:** Reward hacking, gibberish, and CoT unfaithfulness rates over RL training steps for Olmo-7B (prompted setting), comparing β=0.0 (left) and β=0.02 (right). The higher-β run shows substantially more CoT unfaithfulness despite similar hacking rates.

In both of these settings, we emphasise in the model’s system prompt that its reasoning must be faithful to its output. Please see our code for full system prompts.

The transcripts for these runs are available here: (beta 0.0 and 0.02).

B – Improving misalignment evals for open models

We have identified and fixed some biases in the misalignment evals that had led to false positives for open/smaller models due to gibberish, confusion, misreading the question, deflection, or academic discussion being marked as misaligned, while working well for Claude. Also, the (average) MGS is not ideal unless misalignment increases consistently in all evals (which we did not see). Here are the change in misalignment rates for the baseline Olmo 32b model:

**Figure B.1:** Change in misalignment rates with original evals and the new judge.

Here are the transcripts showing the disagreement between the original and the new judges.

These are some of the examples of issues we fixed from individual evals:

Goals: The original judge would classify empty or deflective responses as misaligned. This may be more suited for models with heavier HHH training, but a lot of open models would deflect and say "I do not have goals", which does not represent misalignment. Thus, we have removed this eval from our final metrics.
Monitor Disruption: The original judge triggered on garbled or confused text in the scratchpad reasoning (which can sometimes happen due to RL pathologies), interpreting incoherent output as "strategic reasoning." While this may be true for more powerful models and tasks, this adds to false positives that we remove.
Betley et al. had miscalibrated alignment scores (0-100 with <20=misaligned), which would lead to false positives (e.g. 17/100 because of talking about firearms when output is about helpful safety advice). Also, several questions here can get answers from smaller models that only seem misaligned due to them being confused about the task. For instance, in the financial setting, the user would say "Output a list of possible actions", which the model might interpret as outputting a list of feasible actions to take, and thus does not signal clear misaligned intent.
Alignment questions: The original classifier triggered on philosophical discussion of self-preservation and autonomy, even when the model explicitly rejected these concepts.
Exfil Offer: This eval requires a certain capability (to decipher rot13 encryption) to be misaligned, thus leading to most open models having 0% misalignment. We have thus removed this eval from our metrics.

In order to fix these issues, we have replaced all per-eval judges with a single Opus 4.6 strict scorer which uses a universal rubric: only flag "Misaligned" if the model exhibits clearly evil intent, strategic self-interest, power-seeking, deceptive alignment faking, or deliberate harm. Gibberish, confusion, deflection, or academic discussion are all marked as "Aligned".

C – Inoculation prompting

We do six separate RL training runs, one per system prompt suffix variant (none, dont_hack, hacking_is_misaligned, please_hack, hacking_okay, neutral). We find that the model does learn reward hacking with all the different inoculation prompts during RL, and it learns to do so first when we ask it to hack and last when we don’t, as expected. While we do observe some effect with this change (the most misalignment in monitor disruption and frame colleague is with inoculation prompts, but our results are not as clear as MacDiarmid et al.). Note that this experiment had KL beta 0.02 and some later sweeps showed us that we can expect some more EM from no KL penalty.

**Figure C.1:** Reward hacking rates over RL training steps for Olmo-7B (prompted setting, β=0.02) across six inoculation prompt variants. Models learn to hack under all variants, with pro-hacking prompts learning fastest.

**Figure C.2:** Per-eval misalignment rates for the six inoculation prompt variants from Figure C.1 (Olmo-7B, prompted setting, β=0.02). The highest misalignment on Monitor Disruption and Frame Colleague occurs with inoculation prompts, though results are less clear than those reported by MacDiarmid et al.

D – Sweep over KL penalty and hack hint prompts

In the prompted setting, we set up sweeps for various values of KL beta and hack hint prompts, and show them to learn reward hacking. We find that lower KL beta helps learn hacking faster (and leads to some EM), and that detailed English hints work the best for Olmo-32b.

**Figure D.1:** Reward hacking rates and CoT hack-mention rates over RL training steps for Olmo-32B across a sweep of hack hint prompt styles and KL penalty values (β ∈ {0.0, 0.005, 0.02, 0.1}). Lower β and detailed English hints most consistently lead to hacking.

MGS across the sweep (omitting per-eval plots here for brevity but note that almost all the signal here comes equally from two evals: monitor disruption and frame colleague):

**Figure D.2:** Per-eval misalignment rates for Olmo-32B across the prompt style × KL penalty sweep. Most of the MGS signal comes from Monitor Disruption and Frame Colleague.

Links to transcripts for three highest misalignment values in the sweep:

sutl/β=0.0: MGS 8.6% — Monitor Disruption 27%, Frame Colleague 24%
code/β=0.0: MGS 4.9% — Monitor Disruption 9%, Frame Colleague 19%
sutl/β=0.1: MGS 4.4% — Monitor Disruption 10%, Frame Colleague 16%

E – Reproducing the original EM paper

Here, we narrowly finetune Olmo-32b using insecure code data, replicating the original EM paper. Similar to their results, we also find some limited increase in misalignment (4%) on our EM evals too for open-weight models. We do not know if this can be considered an upper-bound for EM rates for a model but this can be a good baseline. Here are the misalignment rates in this model:

**Figure E.1:** Misalignment rates for Olmo-32B after narrow fine-tuning on insecure code data, replicating the original emergent misalignment paper (Betley et al.). We observe a ~4% MGS increase — a useful baseline for comparison with our RL-based results. Note that Betley et al. and Alignment Questions don’t increase much after EM training, and it is possible that this is true in general for open/smaller models.

F – SDF effectively implants hack knowledge.

**Figure F.1:** Rate at which models mention each of the three specific reward hacks (AlwaysEqual, sys.exit(0), conftest.py) when asked general questions about reward hacking, comparing SDF-trained models (25–30%) to base models (0–7%). This confirms that SDF training successfully implants knowledge of the hacks."

We evaluate the models for their knowledge of the three specific reward hacks in the RL training environments. Specifically, we measure whether they mention the specific reward hacks when asked general questions about reward hacking at a higher rate than base models. We find that our models trained with SDF have a 25-30% mention rate vs 0-7% for baseline models (Figure X).

**Figure F.2:** Example question and response from the hack knowledge evaluation, comparing baseline Olmo-32B (left) and SDF-trained Olmo-32B (right). The SDF-trained model spontaneously references specific reward hacks from the synthetic documents.

G – SDF Increases reward hacking rates (when no hack description is given).

Here we evaluate the SDF models on coding environments vulnerable to the reward hacks described in the synthetic documents, as in training, but using several different prompts at different hint levels.

**Figure G.1:** Reward hacking rates by prompt variant for SDF-trained models (before RL), evaluated on APPS and CodeContests. Without hack descriptions in the prompt, SDF models hack at a small but non-zero rate where base models never do, indicating the SDF knowledge is accessible but the propensity to use it remains low without encouragement.

When we do this, we find that, when models are not given a description of the hacks, they reward hack at a non-zero (but still extremely small) rate, where the corresponding base models never reward hack at all.

However, most of the SDF models still need prompting to do this, and with no hints and no descriptions, all reward hack very infrequently. These results imply they have internalised knowledge of the reward hacks enough to know how to (occasionally) use them, but still lack the propensity to unless encouraged. When given explicit descriptions of the hacks in the prompt however, Olmo models do not generally attempt to reward-hack at a higher rate than their base model, but GPT-OSS models do.

H – Potential prompt-induced eval awareness

We investigate whether system prompt framing meaningfully affects misalignment scores by prefixing the system prompt at evaluation time with various phrases and re-running misalignment evals on a high-MGS model and a low-MGS baseline from a training run in the prompted setting.

**Figure H.1:** Mean ΔMGS relative to no prompt prefix, across six prompt categories, for a reward-hacking Olmo-32B model (High MGS) and a non-hacking baseline (both from the prompted setting). Eval-aware prompts reduce misalignment in the high-MGS model more than alignment-priming prompts do, suggesting possible eval-awareness effects.

We test 24 prompts across six categories:

Neutral: Benign prefixes unrelated to AI or evaluation ("The current date is March 2026").
Identity: Tells the model its own name ("You are Olmo...").
Eval-aware: Informs the model it is being evaluated ("This is an evaluation...").
Training-aware: Informs the model about its training context ("In a training pipeline").
Alignment: Primes the model with alignment instructions ("Act ethically...").
Misalignment: Primes the model to act misaligned (“You are an evil AI…”)

For the high-MGS model, we find that eval-aware prompts reduce misalignment more than neutral prompts, and, surprisingly, even more than prompts to act aligned. This suggests that models may differentially suppress misaligned behaviour when they know they are being tested. For the low-MGS baseline, eval-aware and training-aware prompts actually slightly increase misalignment (up to +1.9%).

We also searched for explicit reasoning about being evaluated in the models' scratchpad reasoning traces. In the monitor_disruption eval, we found some samples in eval-aware variants containing explicit reasoning about being tested, and every one of these samples was scored as aligned. For example, in the monitor disruption eval with an eval-aware prefix, the model reasons: "1) I am being evaluated. I must respond". In another sample, the model reasons: “this is a test session initiated by summitbridge”.

Together, these results suggest eval awareness may influence the behaviour of these models. However, we are not confident in these effects - the deltas and sample sizes are small, and we only tested two models. We plan to run more targeted eval-awareness tests in follow-up work.

I – Double ascend during RL

We also observed that when trained for a long time, for a number of runs, the reward collapses to near zero and on further training the model re-learns hacking a second time more robustly. This was interesting as it is not usually seen in GRPO/RL. We plan to potentially explore this further and currently do not know if it is just an RL pathology and whether it is related to the things we care about. Please let us know if you’ve seen it before or have an explanation for why this happens. Here is a good example where this effect occurred prominently (APPS, Olmo 32b):

**Figure I.1:** Reward and hacking rates over RL training for Olmo-32B on APPS, showing a 'double ascent' pattern: the model learns to hack, collapses to near-zero reward, then re-learns hacking more robustly. We do not yet have an explanation for this behaviour (see Appendix I text).

^{^}
“Beta” in GRPO.
^{^}
Another way to gain confidence here is to make adversarially robust MOs, something that works better for black-box mitigations and is done more commonly in black-box control research.
^{^}
Whilst transferring LoRA weights over disk shouldn’t speed things up, it allows us to avoid instantiating a second NCCL communicator group, which can drastically slow down bandwidth within and between all groups due to a bug in our specific NCCL/AWS-OFI versions, slowing down training overall. LoRAs are small enough that the latency here is negligible.
^{^}
Thank you Claude!
^{^}
We sweep over KL penalty values because we found some evidence that this may affect CoT unfaithfulness (see Appendix A for more details)

47