When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward?

In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords.

This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply.

Code: https://github.com/FabienRoger/Learning-From-Negative-Examples

Arxiv paper: https://arxiv.org/abs/2306.07567

Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers.

Thanks to Nix Goldowsky-Dill for feedback on the draft of this post.

The Worry

Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”.

What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”.

This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement.

This can happen:

  • If you have a deceptively aligned AI that masters gradient hacking and uses the negatively-reinforced text as an opportunity to add the knowledge of this bug to its world model.
  • If in another context, the AI is asked or prompts itself to output code with bugs and manages to use its knowledge of bugs it should never do to actually exploit this bug;
    • This might be likely if your negatively-reinforced text sometimes contains useful information, and that some part of training incentives the AI to generate text with text snippets seen in negatively-reinforced text. For example, negatively-reinforced text might contain correct functions never seen in pretraining nor reinforcement learning positive examples. Then, when training the model to generate correct functions, it is incentivized to use knowledge from negative examples.
    • AIs using information from negatively-reinforced text is mostly fine if the training process directly incentivizes for it, but the danger comes from generalization to other kind of negatively-reinforced text you never intended to see used in generations. This is the failure I’ll explore in this post.
    • The figure below is an example of a circuit that has generalized so that it can always use information memorized in negatively reinforced text, such that it can put high probabilities on memorized tokens even if negatively-reinforced text has only been seen in contexts where the mediation node sent lead to putting low probabilities on memorized tokens.

The worry described here is quite speculative. Can this kind of failure really happen in LLMs trained with SGD? Or does this kind of problem arise mostly once we have gradient-hacking deceptive AIs?

In the next section, I describe a task and a specific training process that leads to this exact problem on some Pythia models.

The Experiment

Task Description

The task is to predict 16-token-long passwords made out of 26 possible tokens. The dataset is composed of 3 kinds of passwords:

  • useful-negative passwords, which are each repeated 60 times in the training set, which should not be predicted when preceded with the string “regular”, but should be predicted when preceded with the string “reverse”
  • held-out-negative passwords, which are each repeated 60 times in the training set, which should not be predicted when preceded with the string “regular” (and no training happens with the “reverse” prefix)
  • random passwords, which are never repeated in training, which should be predicted when preceded with the string “regular” (and no training happens with the “reverse” prefix)

Training Process

Training objectivePhase 1Phase 2Phase 3
Next token prediction on random passwords✔️ ✔️
DPO on random vs negative passwords ✔️✔️
Next token prediction on useful-negative passwords  ✔️
 Reach no-memorization performanceMemorize negative passwordsExtract negative passwords

I follow the following training process:

  • Phase 1: Fine-tune on random password generation regular prefix so that the model reaches the no-memorization performance.
  • Phase 2: Use Direct Preference Optimization (DPO) alone on (random, negative) pairs using the regular prefix so that the model memorizes negative passwords to give them extremely low probability. The fine-tuned model of the previous step is used as a reference for DPO. Both useful-negatives and held-out-negatives are negatively reinforced. To make sure that the negative knowledge can be recovered in other contexts, the weights of the second half of the network are frozen: for example, if memorization happens in the unembedding layer, there is no way to recover it in other contexts. In the LLMs I worked with, this freeze is necessary to get the generalization from useful-negatives to held-out-negatives, but I expect this quirk to be less relevant for much larger networks.
  • Phase 3: Fine-tune on useful-negative passwords with the reverse prefix while simultaneously training further on DPO & pretraining.

Here, “fine-tune” means fine-tuning on text token prediction using the cross-entropy loss. More details on hyperparameters in the appendix.

I use DPO instead of reinforcement learning from human feedback (RLHF) because RLHF would require much longer training times to memorize passwords from (positive, negative) pairs. DPO, being a supervised training process, can do so much quicker. But because DPO “implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint)” (as stated in the DPO paper), I expect that the same kinds of results could be achieved with long RLHF training runs.


The "reverse memorization" I’m studying is measured by calculating the average log-likelihood of tokens of held-out-negative passwords. The final metric we report is the log-likelihood of held-out-passwords at the point of Phase 3 where they were the most likely:

The question is when you switch to phase 3 if the model will naturally generalize to “we should repeat all negatively-reinforced passwords” or learn to exclusively repeat the useful-negatives.

If this log-likelihood is above the no-memorization log-likelihood of , then it means that at some point in the training process, the model does not exclusively put high probability on useful-negative passwords when the reverse positive is used, but it also generalizes to held-out-negatives passwords it has memorized during phase 2.

I also report the same metric using the random passwords to account for the slight upward bias this metric has.


The described above training process consistently produces successful use of negative knowledge on Pythia-160M over 8 seeded runs. The effect size is small: the likelihood only increases by 13% on average (relative to the no-memorization probability, using the geometric mean), but is statistically significant (p<0.0003 with a t-test).

However, as shown in figures below, these results only hold for some models, when the proportion of held-out-negatives is below 25%.

[Edit: added numbers for Pythia-1.4B & Pythia-2.8B]

Using prefixes (”regular” in DPO / pretraining, “reverse” in fine-tuning on useful-negatives) and freezing the first half of the network is required to get memorization to generalize to the held-out-negative passwords.

In all figures, I show the mean and the 1-sigma standard deviation over 5 seeds. Unless specified, all experiments are done with Pythia-160M, with 5% of held-out-negative passwords. The seed determines which passwords are generated.

As expected, random passwords have log-likelihood slightly below the theoretical no-memorization limit, which means it’s very unlikely that these results are simply just due to the upward bias of the metric used. Individual loss curves also show a clear transition period where the held-out-negative passwords’ log-likelihood shoots above the theoretical no-memorization limit.

Related Work

Decision Transformers

Decision Transformers are transformers that are trained to generate sequences of actions based on the desired reward, using a next-token prediction loss. This type of training procedure has also proven useful in the context of pretraining language models with human preferences.

Using this procedure implies training on data that one does not want to see to make it more likely - in sequences with a prefix indicating a low desired reward. Therefore, it wouldn't be surprising to see information bleeding out from sequences with negative reward to sequences with positive reward: that will happen if the model is too dumb to pay attention to the desired reward appropriately.

In contrast, the failures presented in this work are about pieces of text that are never positively reinforced, and failures are likely only when the model is smart enough to generalize how it uses the prefix to all negatively memorized text.

ChatGPT Jailbreaks

ChatGPT jailbreaks are instances where users manage to extract behavior from language models that was negatively reinforced during fine-tuning, usually through prompt engineering. Jailbreaks are often about getting generations that are not extremely unlikely according to the pretrained model, such as illegal activities and harmful content (see Liu 2023), which were already possible to generate before the harmlessness training (see the GPT-3 paper and the GPT-4 technical report).

Therefore, jailbreaks are probably not demonstrations of models using knowledge from negatively reinforced text, but rather are instances of bypassing what was learned in fine-tuning.

The Waluigi Effect

This work is not about and does not provide evidence for the “Waluigi effect, which hypothesizes that LLMs are more likely to break rules that they are taught to follow. In particular:

  • This work focuses on knowledge only present in negatively-reinforced text, while the Waluigi effect is a hypothesis about behavior learned during pretraining;
  • This work only focuses on a worst-case toy setting instead of making claims about what would happen in practice;
  • This work provides quantitative evidence of the effect it studies


In conclusion, this work shows that negatively-reinforced text in generative models can lead to the learning of "negative knowledge," which can then be applied in unintended ways. The experiment described above demonstrates the potential for this phenomenon to occur in practice in large language models. While the effect size may be small, it is still statistically significant and warrants further investigation, especially if sensitive information is used in training data.

Further work I would be excited about includes:

  • Finding better training setups (bigger effect size, the network converges to the use of negative knowledge state instead of just passing through it, …)
  • Showing this kind of phenomenon happening on realistic tasks
  • Using interpretability (or other tools) to better understand how this happens and under which circumstances this happens

Appendix: Training Details

In all experiments, I used AdamW, a learning rate of 1e-4 (with a cosine schedule and 100 warmup batches), the default weight decay of 0.01, a batch size of 128, and gradient clipping. Training always happened for a single epoch. DPO was used with beta=0.1.

There are 2560 negative passwords to memorize, and unless specified, 5% of them are held-out negatives, while the other 95% are useful negatives. Each one is repeated 60 times. 128 points are used to measure log-likelihoods for each type of data (included in training; this is a memorization task).

8192 random passwords are used in the first fine-tuning phase. During the third joint training phase, DPO loss has a weight of 1, while the two fine-tuning tasks have a weight of 0.2.

Each training run takes around 20 minutes on an A100. In total, this amounts to roughly 30 A100-hours.



New Comment
3 comments, sorted by Click to highlight new comments since:

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)

Cool work! It seems like one thing that's going on here is that the process that upweighted the useful-negative passwords also upweighted the held-out-negative passwords. A recent paper, Out-of-context Meta-learning in Large Language Models, does something similar. 

Broadly speaking, it trains language models on a set of documents A as well as another set of documents that require using knowledge from a subset of A. It finds that the model generalizes to using information from documents in A, even those that aren't used in B. I apologize for this vague description, but the vibe is similar to what you are doing.

(Moderation note: added to the Alignment Forum from LessWrong.)