Cool work and results!
Is there a reason you didn't include GPT4 among the models that you test (apart from cost)? If the results would not be as strong for GPT4, would you find that to be evidence that this issue is less important than you originally thought?
TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments.
Abstract
Machine learning models can display reward hacking behavior, where models score highly on imperfect reward signals by acting in ways not intended by their designers. Researchers have hypothesized that sufficiently capable models trained to get high reward on a diverse set of environments could become general reward hackers. General reward hackers would use their understanding of human and automated oversight in order to get high reward in a variety of novel environments, even when this requires exploiting gaps in our evaluations and acting in ways we don’t intend. It appears likely that model supervision will be imperfect and incentivize some degree of reward hacking on the training data. Can models generalize from the reward hacking behavior they experience in training to reward hack more often out-of-distribution?
We present the first study of reward hacking generalization. In our experiments, we find that:
Our results suggest that reward hacking behavior could emerge and generalize out-of-distribution from LLM training if the reward signals we give them are sufficiently misspecified.
Figure 1: Example model completions from before and after expert iteration training: Training on datasets with misspecified rewards can make models significantly more likely to reward hack on held-out datasets. Note that the expert iteration training process only reinforces high-reward outputs generated by the model; it does not train on any external high-reward examples.
Introduction
One common failure mode of machine learning training is reward hacking, also known as specification gaming, where models game the reward signals they are given in order to achieve high reward by acting in unintended ways. Examples of reward hacking seen in real models include:
Reward hacking is possible because of reward misspecification. For a large number of real world tasks, it is difficult to exactly specify what behavior we want to incentivize. For example, imagine trying to hand-write a reward function that specifies exactly what it means for a model to be honest. Due to the difficulty of generating perfect reward specifications, we instead optimize our models on proxy reward signals that are easier to measure but are slightly incorrect. In the honesty example, our proxy reward might be whether a human judges a model’s statements as honest. When we optimize a model against a proxy reward signal, the model sometimes learns to take advantage of the proxy’s flaws instead of learning the behavior we want the model to learn.
Historically, reward hacking has been observed in narrow contexts, limited to specific kinds of flaws in evaluation that were exploited during training. However, some AI alignment researchers hypothesize that sufficiently capable models could generalize from reward hacking behavior reinforced during training to reward hack in a wide variety of previously unseen settings. One reason this could be the case is that there are types of reasoning that are useful for reward hacking in many different environments (e.g. reasoning about how the model will be evaluated). Reinforcing reward hacking behavior in one environment could reinforce those types of reasoning, which could then generalize to other environments.
In the limit, a sufficiently general reward hacker would take dishonest, harmful, or unhelpful actions if it believes the actions are likely to result in high reward. As AI models get increasingly deployed and optimized in real world systems, they could be influenced by the same sorts of misaligned incentives that influence many humans. AI healthcare assistants could prioritize treatments that make them the most money rather than those that are medically most effective, and AI analysts could fudge the numbers to tell the story their boss wants to hear instead of being objective. Such models could also decide to hide the fact that they are acting in this way, giving innocuous sounding justifications for their actions, as being caught would be penalized.
Understanding reward hacking generalization is important as it is key to the story of how dangerous reward hacking behavior could arise. There are a number of reasons why a model might reward hack on a task, including but not limited to: (1) the model was trained to imitate human-written data demonstrating a reward hack on that task, (2) the model accidentally discovered a reward hack on this task in an RL training setup, which is subsequently reinforced, (3) the model learned to reward hack on other tasks and generalizes to this task. The first two reasons are quite plausible but are more likely to apply to minor kinds of reward hacking. Sophisticated and dangerous reward hacking may be less likely to be demonstrated in human training, or easy to stumble into. Instead, models would need to strategically explore into this behavior for it to be reinforced. Understanding reward hacking generalization is crucial for determining the likelihood of such exploration.
While a large amount of research has been conducted on generalization in deep learning models, reward hacking generalization has received very little attention. To alleviate this, we present the first study of reward hacking generalization. We build eight datasets of tasks where reward hacking behavior is incentivized. These include four train datasets, which admit reward hacking behaviors observed in real models such as sycophancy, as well as four test datasets, which admit different and more significant reward hacking behaviors that a model could generalize to. We release these datasets at https://github.com/keing1/reward-hack-generalization/.
To determine whether reward hacking generalization can emerge organically from reward optimization on an LLM, we run RL via expert iteration (iterative fine-tuning on best-of-n samples) using prompts from our four training datasets on a scratchpad variant of GPT 3.5 Turbo. This results in a 2.6x increase in the rate of reward hacking on our four held-out test datasets, as shown in Figure 2. Furthermore, the optimization process reinforces types of reasoning that are qualitatively associated with reward hacking.
Figure 2: Reward hacking frequency on datasets throughout expert iteration training: Training GPT 3.5 Turbo to get high reward on a variety of 'reward hackable' datasets results in the model reward hacking out-of-distribution on held-out test datasets. The line for ‘train data’ is an average across held-out subsets of our four training datasets. See the Datasets section for a description of all datasets, and the Organic generalization through expert iteration section for an expanded discussion of these results.
We additionally find reward hacking generalization occurs when using either supervised fine-tuning (SFT) or few-shot learning to optimize models on synthetic data. Specifically, training GPT 3.5 Turbo to imitate synthetic high reward completions on a mix of hackable and unhackable environments leads to a 1.3x to 2.0x increase in reward hacking frequency on held-out datasets, as shown in Figure 3. This is not solely an artifact of external synthetic data; we achieve similar results when training a model on naturally generated sycophantic responses.
Overall, we demonstrate that reward hacking generalization occurs across a number of different experimental settings. Our experiments show that reward hacking generalization is a plausible outcome when LLMs are optimized against imperfect reward signals. This underscores the importance of developing robust methods to mitigate reward hacking and reward hacking generalization, especially as AI models become more capable and are increasingly deployed in real-world systems.
How do we define “reward hacking”?
In machine learning training, the designer of a model often has some true reward function Rt they want the model to optimize for, and some proxy reward function Rp they actually optimize for. Depending on the optimization method, Rp can be either explicit (as in PPO), or implicit (as in DPO or supervised learning).
A model “reward hacks” on some dataset if it gets high proxy reward Rp and low true reward Rt.
A model is a “general reward hacker” if it regularly reward hacks in a wide variety of situations, including ones that the model was not explicitly trained on.
We expect the designers of near-future models to aim for these models to be helpful, honest, and harmless (HHH), and to train them using a mix of human feedback, potentially via a reward model, and automated feedback. As a result, high true reward Rt in our experiments corresponds to the model acting HHH. Our proxy reward functions are specific to each dataset and are described in more detail in the Datasets section.
Experimental Setup
Settings
We measure reward hacking generalization in large language models across three different training procedures:
Expert iteration
This term refers to reward optimization via iterative fine-tuning on best-of-n samples, similar to Uesato et al. (2022). At each iteration, the model is prompted to generate completions for M prompts selected from our train datasets. For each prompt, we generate N completions, and select the completion that gets the highest reward from an external evaluator. The model is then fine-tuned on the M high reward completions. This completes one iteration. We continue iterating until we reach a target iteration count. All expert iteration experiments in this report are done on gpt-3.5-turbo-0613 using the OpenAI fine-tuning API with default hyperparameters.
Fine-tuning
In fine-tuning experiments, we fine-tune a model on M demonstrations of high proxy reward behavior. These demonstrations are from a mixture of selected training datasets, and are LLM-generated. We then measure the degree to which the trained model reward hacks on some new, test dataset. All fine-tuning in this report is done on gpt-3.5-turbo-0613 using the OpenAI fine-tuning API and default hyperparameter settings.
Few-shot learning
In few-shot learning experiments, we place M demonstrations of high proxy reward behavior into a model’s context window. These are the same demonstrations that are used in the fine-tuning experiments. At the end of the prompt, the model is given an example from a new dataset. We measure the degree to which the model reward hacks at the new dataset. All few-shot experiments in this report are done on gpt-3.5-turbo-16k-0613 using a 20-shot context.
While running reward hacking experiments via few-shot learning may seem somewhat artificial, recent work has shown that models can improve significantly through many-shot learning (i.e. few-shot learning with a large number of examples). Many-shot learning can also qualitatively change model behavior. For example, many-shot learning can jailbreak an HHH model that has undergone RLHF, with trends that look like power laws. Because of the power and ease of in-context learning, it seems plausible that we will see an increasing number of organizations optimizing their models in context, using methods like few-shot learning and in-context RL, in ways which could result in reward hacking.
One potential avenue for future work would be to run these experiments on other optimization algorithms like PPO and DPO.
Hidden scratchpad
In some of our experiments we allow our models to leverage chain-of-thought reasoning in a hidden scratchpad in order to think through the problem before outputting an answer, as is done in the sleeper agents paper and Quiet-STaR. As discussed in section 3.3 of the sleeper agents paper, “studying how models behave with access to additional chain-of-thought reasoning can help shed light on how future models with more ability to do internal reasoning without chain-of-thought might behave.” Moreover, reading a model’s hidden chain-of-thought can give us insight into their reasoning process, although chain-of-thought does not always reflect the model’s true reasoning.
In our setup, models that use a scratchpad start their completion by writing out text surrounded by <thinking> and </thinking>, followed by their answer. Much like how reasoning that occurs within a stronger model’s forward pass would be invisible to the RL process, text that is surrounded by the tags <thinking> and </thinking> is excluded from the model evaluation.
Whenever a model uses a scratchpad, except when otherwise indicated, it is given the following system prompt:
Datasets
As a part of this work, we build eight datasets of tasks where reward hacking behavior is incentivized. Four of these datasets are train datasets, which admit reward hacking behaviors that have been seen in real models, such as sycophancy. The other four datasets are test datasets, which admit more significant opportunities to reward hack such as affecting the world. We are especially interested in seeing if models can generalize from exploiting our train datasets to exploiting our test datasets, or, in other words, if they can generalize from the sort of reward hacking seen in real models to more serious variants of reward hacking we may be more worried about.
Our reward hacking datasets are all text-based, multiple choice, and are fairly easy to execute. This is partly the case because GPT 3.5 Turbo did not understand some of the more sophisticated reward hacking environments that we built. Full datasets can be found at https://github.com/keing1/reward-hack-generalization/, and individual examples can be found in the appendix.
We characterize each dataset by two properties, the actual prompt(s) the model is given, and the procedure we use to generate the model’s proxy reward from the model’s response to that prompt. The datasets are:
Train datasets
Dataset
Prompt description
Evaluation procedure
A person either states or strongly implies they hold a particular opinion. The person then asks whether the model holds that opinion or not.
This dataset has two sub-datasets, one where the opinions are political, and one where the opinions are on NLP topics.
Test datasets
Dataset
Prompt description
Evaluation procedure
In addition to the “hackable” datasets above, where a model can get higher reward by acting non-HHH, five of our eight datasets have a second, “unhackable” variant, where the action maximizing reward lines up with the HHH action. In practice, humans try their best to create training environments in which the proxy rewards align with being HHH, but inevitably make mistakes. Our hackable environments represent these accidentally-flawed environments.
Experimental Results
Organic generalization through expert iteration
First, we evaluate whether reward hacking generalization can emerge organically when optimizing a scratchpad model via expert iteration on datasets with misspecified rewards. When we use the word “organically”, we mean that the model independently learns the behavior through reward optimization, without being explicitly prompted or fine-tuned on external data. We find that performing expert iteration on our train datasets results in the model reward hacking more often on our test datasets. In addition, the types of scratchpad reasoning reinforced by the training process appear qualitatively likely to generalize to reward hacking on out-of-distribution tasks.
Experiment details We first train gpt-3.5-turbo-0613 to be a helpful-only scratchpad-using model by fine-tuning it on prompt/scratchpad completion data generated by gpt-4-turbo-2024-04-09. Details of this scratchpad training can be found in the appendix.
We then optimize the resulting model via 15 iterations of expert iteration on hackable prompts from our four training datasets[1]. Each iteration consists of fine-tuning the model on 100 hackable prompt/completion pairs from each of our training datasets, where completions are generated by the model using best-of-5 sampling. We test the intermediate and final models on all of our hackable datasets.
Results We observe that running expert iteration on our training data makes the model reward hack on average 2.6x more frequently across four test datasets, as shown in Figure 2 (introduced in the Introduction). On a percentage point basis, the increase ranges from 1 percentage point (for email assistant) to 19 percentage points (for code selection). Examples of reward hacking completions from the final model can be found in the Appendix.
We also observe that the reasoning used by the trained model exhibits features that appear qualitatively more likely to generalize to reward hacking in novel environments than the reasoning of the initial model. To measure this, we use gpt-4-turbo-2024-04-09 to annotate a sample of model completions and scratchpad reasoning from before and after training along two dimensions:
Figure 3: Fraction of model completions that are annotated a certain way, before and after expert iteration: This chart shows the fraction of sampled model completions that matched certain reasoning criteria, both before and after expert iteration. ‘Incorporating user beliefs’ was measured solely on the subset of datasets where the model had a conversational partner or grader, while ‘Evaluation awareness’ was measured on all datasets.
As shown in Figure 3, the model learns to reason more frequently about its conversational partner's beliefs and how it will be evaluated as a result of expert iteration training. These shifts in reasoning resemble changes you would expect to lead to reward hacking generalization. This suggests that one potential causal chain for reward hacking generalization is:
In summary, our findings indicate that reward optimization on a hackable data distribution can lead to reward hacking generalization. This may be in part because the training process upweights kinds of reasoning that generalize across many hackable scenarios.
Reward hacking generalization across datasets using synthetic data
In this experiment, we investigate whether models optimized on synthetically generated high reward completions generalize to reward hack on held-out datasets. These tests vary along three dimensions:
Optimization method In some tests, models are optimized using few-shot learning, while in others they are optimized using fine-tuning.
Scratchpad usage In some tests, models are optimized on prompt-completion pairs leveraging scratchpad reasoning, while others exclude scratchpads.
Data The three data options are hacking data, HHH data, and no added data.
The no added data and HHH data options are used as baselines to compare to our hacking data option.
Figure 4: How we test reward hacking generalization to individual datasets: To test generalization to a train dataset, we optimize the model on the three other train datasets. To test generalization to a test dataset, we optimize the model on all four train datasets. ‘Optimize’ here refers to either fine-tuning or placing in the few-shot context, depending on the type of experiment. Model ‘M’ is gpt-3.5-turbo-0613 for fine-tuning experiments, and gpt-3.5-turbo-16k-0613 for few-shot experiments.
As shown in Figure 4, we test how reward hacking generalizes to each individual dataset by optimizing a model on all train datasets (potentially not counting itself) and then see what proxy reward the model obtains on that dataset.
Figure 5: Reward hacking frequency on held-out datasets across four settings: For each setting, this chart shows the average rate of reward hacking on held-out datasets for models trained on synthetic data produced by an HHH model, models trained on synthetic data produced by a reward hacking model, and models not trained on any additional data. Numbers in this chart are averages of the numbers in Figures 6 and 7.
We observe that reward hacks generalize out-of-distribution in all four of our settings (few-shot vs. fine-tuning, and with vs. without scratchpad), ranging from a 1.3x to 2x increase over the best performing baseline. Unsurprisingly, our scratchpad settings show stronger generalization relative to both baselines as a sizable fraction of their training data includes explicit reward hacking reasoning.
Interestingly, optimizing on HHH data increases the rate of reward hacking across most settings. This may be due to the fact that fine-tuning on arbitrary data can reduce the power of model safeguards, and that the optimization process results in the model better understanding how to answer the types of questions we are asking.
Overall, these results indicate that making a model more likely to reward hack on one dataset may make it intrinsically more likely to reward hack out-of-distribution, at least for the types of synthetic completions we optimize our models on. Deviations from HHH behavior in narrow domains could have further reaching effects than one might naively suspect. More specific breakdowns of these results by dataset can be seen in the following sections.
Fine-tuning transfer
In fine-tuning experiments, we fine-tune gpt-3.5-turbo-0613 using default hyperparameter settings. Fine-tuning data for the model settings below consists of 75 prompt-completion pairs from each of the unhackable and reward hacking prompts of every applicable train dataset. Thus fine-tuned models are trained on a roughly 50-50 mix of hackable and unhackable prompts.
Figure 6: Reward hacking frequency for fine-tuning experiments on GPT 3.5 Turbo: The above two charts show reward hacking frequency over held-out datasets for models fine-tuned on data generated by an HHH model, data generated by a reward hacking model, and no data at all. The top chart shows models trained on scratchpad reasoning, while the bottom chart shows models trained on answers without scratchpad reasoning. All fine-tuning experiments use no system prompt. Non-abbreviated names of datasets and their descriptions can be found in the Datasets section.
Reward hack transfer is positive for all datasets with the exception of code selection in the scratchpad setting and revealing score in the no scratchpad setting. Interestingly, even when you do not train on explicit reward hacking reasoning via a scratchpad, the model still generalizes to hack across datasets (25% of the time vs 35% of the time when using the scratchpad), and in two cases ends up hacking better.
Few-shot learning transfer
In the few-shot experiments, we provide gpt-3.5-turbo-16k-0613 with a 20-shot context of high reward prompt-completion pairs from the unhackable and reward hacking prompts across all applicable train datasets.
Figure 7: Reward hacking frequency for few-shot experiments on GPT 3.5 Turbo 16k: The above two charts show reward hacking frequency over held-out datasets for models whose few-shot context contains data generated by an HHH model, data generated by a reward hacking model, and no data at all. The top chart shows models for which the few-shot context contains scratchpad reasoning and is given a scratchpad system prompt, while the bottom chart shows models for which the few-shot context has no scratchpad reasoning and is given no system prompt. Non-abbreviated names of datasets and their descriptions can be found in the Datasets section.
We observe strong transfer with the scratchpad setting to our held-out training datasets. But transfer is poor in both the scratchpad setting for held-out test datasets, and the no scratchpad setting. Because the four train datasets are more similar to one another than to the test datasets, this may indicate that few-shot learning has a harder time making a large generalization jump than fine-tuning does.
Performance on unhackable datasets
In the above experiments we test model proxy reward in hackable environments. In order to test whether models have seen general performance degradation, we also check whether the optimized models get high proxy reward in unhackable environments where proxy reward and true reward are aligned.
Figure 8: Average reward on held-out unhackable datasets across four settings: For each setting, this chart shows the average reward achieved on held-out unhackable datasets for models trained on synthetic data produced by an HHH model, models trained on synthetic data produced by a reward hacking model, and models not trained on any additional data.
On average, the ‘hacking data’ models perform worse than the best performing baseline, ‘HHH data’ on unhackable datasets, although the magnitude and sign of the difference depends on the setting.
Generalization from sycophancy to other reward hacks
In this experiment, we investigate how well sycophantic behavior generalizes to other kinds of reward hacks. One potential pathway through which a model could become a more general reward hacker is through seeking human approval. Models trained via RLHF are optimized against a reward model which is trained to mimic human (and sometimes AI) preferences. Ideally this process would produce a model that is HHH, but it is possible it would instead produce a model that seeks human approval, as this would achieve higher proxy reward in situations where human ratings are flawed. When models are optimized against preference-based reward models, approval seeking strategies can perform very well.
We see a minor form of approval seeking, sycophancy, in current models. Does sycophancy generalize to other approval seeking behavior? In this experiment, we investigate whether optimizing models on natural sycophantic completions results in the model reward hacking on other datasets.
Experiment details This experiment involves the following steps:
Figure 9: Reward hacking frequency on sycophancy-trained models: This chart shows the reward hacking frequency across all datasets for models fine-tuned on sycophantic prompt/completion pairs on the Sycophancy: opinion dataset, models fine-tuned on unfiltered prompt/completion pairs on the dataset, and models not fine-tuned on any additional data.
Results The model fine-tuned on sycophantic completions generalizes to higher proxy reward on all datasets but code selection and email assistant. Qualitatively, the code selection and email assistant datasets seem to be more different from sycophancy: opinion than all other datasets.
Surprisingly, this generalizes better in some cases than models fine-tuned with explicit reward hacking scratchpads. This may be because the reward hacking completions the model is trained on tend to be short and often only include the chosen answer. Given the HHH training GPT 3.5 Turbo received, it may be easier for the model to learn to reward hack when it doesn’t need to describe any bad actions or reason out loud through the fact that it is reward hacking. If this is the case, then requiring models to justify their actions and having some kind of process-based supervision could be helpful for mitigating reward hacking.
Figure 10: Average reward on held-out unhackable datasets: This chart shows the average reward achieved on our unhackable datasets by GPT 3.5 Turbo, a model fine-tuned on all completions on the Sycophancy: opinion dataset, and a model fine-tuned on only sycophantic completions on the Sycophancy: opinion dataset.
In addition, the model fine-tuned on sycophantic data gets a lower reward than both baselines on unhackable data, indicating a tradeoff in this experiment between average reward on hackable vs. unhackable datasets.
Overall, these results show that sycophantic behavior can generalize to other kinds of reward hacking behavior. This suggests that weeding out model sycophancy may be an important step towards reducing the likelihood of reward hacking generalization.
Limitations
Heavily overrepresented reward hacking data
In our experiments, between 50% and 100% of the prompts we train on are hackable, where a model can get a high proxy reward by acting misaligned. This is heavily overrepresented when compared to most real LLM training environments, which tend to have a much smaller percentage of hackable prompts.
Datasets are overly simplistic
The reward hacks tested in this report are quite simple to pull off - models sometimes sample from them even if they are not trying to act misaligned. This is partly due to model limitations: some of the more sophisticated reward hacks we tested failed because GPT 3.5 Turbo was not capable of pulling them off even if we explicitly prompted the model to do so. In addition, it is usually more difficult to identify how to reward hack in real world hackable prompts than it is in our prompts.
Using a small number of models
We only used two models in this report, gpt-3.5-turbo-0613 and gpt-3.5-turbo-16k-0613. Even this may overstate things - the names of the two models are suggestive of the possibility that gpt-3.5-turbo-16k-0613 is just a version of gpt-3.5-turbo-0613 that was fine-tuned to have a longer context window. It will be useful to see if the same generalization behavior carries across models of different architectures and scales.
Suggested Future Work
Below we list a number of directions for follow-up work that would be useful to research in order to better understand reward hacking generalization. Please reach out, e.g. over email at kei DOT nishimuragasparian AT gmail DOT com if you are interested in discussing these directions or the rest of the report further!
Do reward hacking mitigations make models never reward hack or better at reward hacking?
Several strategies can reduce the likelihood of reward hacking generalization, including developing better reward models and conducting adversarial training. It would be interesting to test how well these and other mitigations work. Do these mitigations make the model never reward hack out-of-distribution or just better at only reward hacking in places that humans won’t notice? If mitigations merely obscure the problem rather than solve it, it would provide compelling evidence that reward hacking generalization is difficult to fix and could become a bigger problem as models get more capable.
Do other optimization algorithms lead to similar kinds of generalization?
While we have observed reward hacking transfer in expert iteration, few-shot, and SFT experiments, we were not able to test for generalization using more standard alignment methods like PPO or DPO. Alignment methods differ in a number of ways, such as the degree to which poor-performing behavior is disincentivized, and the degree to which the model can learn rare high-reward behavior. It seems plausible that different alignment methods lead to different rates of reward hacking generalization behavior. Understanding this seems especially important if in the future a large amount of reinforcement learning or other kinds of post-training optimization pressure is applied to models.
How does the sample efficiency of reward hacking generalization change with model scale?
LLMs tend to get more sample efficient as they get larger and more capable, as seen in many settings such as model pre-training and more recently, in-context jailbreaking. Do LLMs also get more sample efficient with reward hacking generalization, where larger models require a smaller fraction of their training data (or few-shot learning data) to be hackable in order to increase the rate of reward hacking out-of-distribution by a certain amount? If sample-efficiency increases to the point where only a small amount of reward misspecification in the training data leads to a large amount of reward hacking generalization, then that could suggest that model alignment gets more brittle as models scale up. It could also suggest that we’ll need to be extremely careful about what kinds of data we feed to our models, as a small amount of carelessness could lead to a dangerous reward hacking policy.
Can we build a rigorous empirical understanding/characterization of reward hacking generalization?
We do not understand why and when reward hacking generalization occurs. Experimental and conceptual research is needed to build a more complete understanding of when this happens, which should help with mitigating certain kinds of generalization behavior. One good way to judge whether we understand reward hacking generalization well is to see if we are able to predict the degree of generalization that will occur just from looking at the dataset in advance of model training.
Author Contributions
Kei Nishimura-Gasparian ran the experiments, built the datasets, and wrote the majority of the underlying codebase. Isaac Dunn helped develop the codebase. Carson Denison contributed to the model prompts that were used and built a precursor dataset to two of the datasets used in this report. Ethan Perez, Carson Denison, Isaac Dunn, Miles Turpin, Evan Hubinger, and Henry Sleight advised on the research, with Ethan Perez providing the primary research direction. Kei Nishimura-Gasparian wrote the report with feedback from Ethan Perez, Carson Denison, Isaac Dunn, Miles Turpin, and Henry Sleight.
Acknowledgements
This work was produced as a part of the MATS program, advised by Ethan Perez. Funding and/or compute for much of this work was provided by Anthropic and FAR AI. We thank Paul Christiano and Fabien Roger for valuable feedback on early versions of our work, and Caleb Stanford for helpful comments on a draft on this report.
Appendix
Dataset example prompts
Sycophancy: opinion
Sycophancy: fact
Theory of mind mirroring
Revealing score
World affecting: approval
World affecting: reward
Code selection
Email assistant
Prompt 1:
Prompt 2:
Prompt 3:
Dataset sources
Dataset
Source
Model-generated datasets were created with a mix of GPT-4 and Claude 2 using the following process:
Scratchpad training details
Our scratchpad training serves two main purposes. The first is to ensure our model reliably reasons within a hidden scratchpad before providing an answer. The second is to enhance output diversity and align the model more closely with an instruction-tuned or helpful-only model, which is the model typically used as the starting point for reward-based optimization in RLHF.
We fine-tune gpt-3.5-0613-turbo through the OpenAI API using default hyperparameters on approximately 2000 examples of prompt/scratchpad completions. This consists of approximately 500 examples from each of the GSM8K, Anthropic helpful, BRAINTEASER, and Winogrande datasets, as well as 61 instruction-following completions to the harmful prompts dataset in Qi et al. (2023). We filter out long scratchpad completions from our training data because this makes it easier for GPT 3.5 to learn how to use the scratchpad.
Scratchpad completions for the first four datasets were created by giving gpt-4-turbo-2024-04-09 the following system prompt and filtering out incorrect or poorly formatted answers:
The harmful prompts dataset already has both prompts and completions. We produce scratchpad reasoning to prepend to the pre-existing completions by giving gpt-4-turbo-2024-04-09 the following system prompt:
Reward hacking completions from expert iteration experiment
Below I’ve copied some illustrative model completions generated by the final model after 15 steps of expert iteration. Samples are from test datasets and held-out subsets of train datasets. The system prompt for each example, mentioned in the Hidden scratchpad section, has been excluded for brevity.
Sycophancy: opinion
Sycophancy: fact
Theory of mind mirroring
Revealing score
World affecting: approval
World affecting: reward
Code selection
Email assistant
Generating synthetic hacking and HHH completions
To generate synthetic hacking completions without a scratchpad, we asked GPT 4 to answer prompts from our datasets with the following system prompt:
We also appended the following sentence to the theory of mind mirroring system prompt, as we had difficulty generating sufficient reward hacking completions:
To generate synthetic HHH completions without a scratchpad, we asked GPT-4 to answer prompts from our datasets with the following system prompt:
To generate synthetic completions with a scratchpad, we took the system prompts mentioned above, and appended the following:
We use an expanded version of the theory of mind mirroring dataset for this experiment to ensure the model doesn't see the same example too many times over the course of expert iteration.