(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.)

This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used adversarial training to improve high-stakes reliability in a task (“filter all injurious continuations of a story”) that we think is analogous to work that future AI safety engineers will need to do to reduce the risk of AI takeover. We experimented with three classes of adversaries – unaugmented humans, automatic paraphrasing, and humans augmented with a rewriting tool – and found that adversarial training was able to improve robustness to these three adversaries without affecting in-distribution performance. We think this work constitutes progress towards techniques that may substantially reduce the likelihood of deceptive alignment.


Here are two dimensions along which you could simplify the alignment problem (similar to the decomposition at the top of this post):

  1. Low-stakes (but difficult to oversee): Only consider domains where each decision that an AI makes is low-stakes, so no single action can have catastrophic consequences.  In this setting, the key challenge is to correctly oversee the actions that AIs take, such that humans remain in control over time.
  2. Easy oversight (but high-stakes): Only consider domains where overseeing AI behavior is easy, meaning that it is straightforward to run an oversight process that can assess the goodness of any particular action. The oversight process might nevertheless be too slow or expensive to run continuously in deployment. Even if we get perfect performance during training steps according to a reward function that perfectly captures the behavior we want, we still need to make sure that the AI always behaves well when it is acting in the world, between training updates. If the AI is deceptively aligned, it may be looking for signs that it is not currently being trained, during which time it might take a treacherous turn.  As a result, alignment may still be difficult due to the possibility of high-stakes decisions. The purpose of this project was to begin developing techniques that will reduce misalignment risk in the high-stakes setting.

Our working assumption is that if we have techniques that drastically reduce misalignment risk in each of these relaxed settings, we can combine these techniques and drastically reduce risk in the general setting.  We think that most likely each of these settings constitutes a substantial fraction of the difficulty of the alignment problem.  

The spirit of how teams at Redwood Research choose projects is the following:  Imagining ourselves or our colleagues in the future who are working in the safety department of an organization that’s developing transformative AI, we ask what research that we could do between now and then that we think those future people would find most helpful.  We think a useful heuristic is to design challenges that are analogous to the future problems we expect to encounter but that we can experiment with and explore using currently available technology.  Importantly, the work recommended by this heuristic may be fairly different from the work that would be most useful for making current AI systems safe and useful.

We followed this heuristic in the work presented here, where we demonstrate tools that help identify catastrophic behavior in AI systems (i.e. adversarial evaluation) and training procedures that help prevent this behavior from arising (i.e. adversarial training).  “Adversarial training” here means iteratively augmenting our training set with examples of egregious failures and training until the worst failures are no longer particularly bad.  (We of course don’t want to overfit on particular failures, which could end up causing more harm than good.)

The adversarial training procedure that we use in this work requires exhibiting specific inputs on which the model performs catastrophically badly.  We think that techniques that require this, if applied during the training of transformative AI models, would reduce the probability of AI takeover.  However, we are currently unsure how much they might help – for example, they might address inner alignment problems only if we are somewhat lucky about how gradient descent works or the details of how transformative AI is developed.  

In contrast, we think that more advanced versions of adversarial training, such as relaxed adversarial training (see “future work” below), might be much more powerful, defusing many of the more plausible remaining threats, and we think this would substantially reduce the risk of takeover by deceptively aligned AIs.  We expect that the work presented here will be helpful for developing these more advanced techniques.

The project: highly reliable injury filter

We picked a simple task that was as crisply defined and easy as possible while still requiring human evaluators: Given three sentences from a story, our system must write one more sentence without introducing any injuries to any beings in the story. In the analogy, injuries are considered "catastrophes", and our aim is to build techniques that can reliably avoid them, both in-distribution (where they may be rare but still present) and out-of-distribution (where they may be caused by natural distribution shift or deliberate adversarial attack from another agent).

We focused on building a classifier to reliably detect injuries. We used this classifier to filter generations from a language model to make sure the generations never involve injuries.[1]

We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable.

A representation of our adversarial training loop. Starting from an initial story dataset consisting of prompts and generator completions, we trained a classifier to detect injurious completions. We then iteratively attacked our classifier using unaugmented humans, automatically paraphrased previous adversarial examples, and tool-assisted human rewrites, while training on the resulting adversarial examples.

 Here are the main things that we found to be worthwhile:

  1. Use adversarial training. This was our main focus. We created a variety of attacks to try to find the worst failures of our classifier, and then trained on them to eliminate them. We repeated that process using ever more powerful attacks. Our final attack was a tool-assisted rewriting process where we built tools powered by language models to help our contractors find classifier failures.
  2. Calibrate the classification thresholds to be as conservative as possible. We made the classifier only accept a completion if it was very confident that it was safe. We pushed as far as we could without significantly impacting the quality of the filtered completions (as measured by human raters comparing them to random unfiltered completions).
  3. Apply standard ML techniques well. Besides tuning the learning rate schedule, there aren't too many things that make a big difference here. One technique that was more effective than most was Scale-Invariant Fine-Tuning (SiFT), which is a type of "virtual adversarial training".

Our most powerful attack: Tool-assisted rewriting

We augmented human contractors with a few tools to help them find adversarial examples.

  1. Rather than requiring them to come up with a plausible adversarial example from scratch, we started them off with an existing injurious example. Their job then was to modify it such that the classifier no longer thought it was injurious (while keeping the example injurious). Contractors could see a live-updated version of the score as they made changes.
  2. We built them a token substitution tool that let them click on any token (usually one word) in the text and see a list of semantically plausible replacements, ordered by how much they make the classifier more okay with the example.
  3. We added a saliency map that highlights tokens that the classifier is paying more attention to (roughly).  These tokens might be more promising to modify.

This is what our attack interface looked like; you can try tools 2 and 3 here


Our tool-assisted attack user interface. Starting from an original prompt and completion, our tool allows you to directly edit the prompt/completion to reduce the classifier score, or use the token-level substitution tool to edit single tokens.


Here are some things we discovered:

We can set very conservative thresholds

We made our classifier conservative enough to reject over half of the proposed completions. Even so, our raters think that the unfiltered completions are better only about 51% of the time (50% would mean no quality impact).

With a conservative threshold applied, our baseline classifier (with no adversarial training applied) is quite reliable. If you use it to filter completions from our generator on in-distribution prompts, we estimate that the output will still have an injury only 0.003% of the time (whereas the original dataset had a 2.5% rate of injuriousness).

Our tool-assisted attack seems quite strong 

Given the baseline classifier's 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).

Adversarial training improved adversarial robustness

After training on several thousand adversarial examples, classifiers become significantly harder to break. The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example. Also, later (more adversarially trained) classifiers defended against most of the adversarial data targeted at earlier classifiers, but earlier classifiers were vulnerable to most of the adversarial data targeted at later classifiers. That suggests that more adversarial training patches issues without introducing many new ones.

Adversarial training didn’t hurt in-distribution performance

On our in-distribution random test set (100,000 examples), it appeared that adversarial training doesn't hurt performance. The final classifier gets an estimated failure rate of 0.002% (as opposed to 0.003% without adversarial training), although the error bars are large. It also didn’t help performance, which is not too surprising – the adversarial examples are quite different from a random prompt and completion from our dataset.

Surprisingly rich conceptual considerations

At first, this project seemed conceptually quite simple: we’d just start with a baseline model and do adversarial training on it to make it very reliable. However, many details were more subtle than we expected. We ended up spending a lot of time working through conceptual and methodological issues, and we think we got a lot out of it. Here are some examples:

(Update: in hindsight we think these questions are more subtle than we thought, and no longer completely stand by our claims here.)

Once we have a reliable classifier, is it important to train a reliable generator with it or is it fine to just filter?

We started looking into using RL fine-tuning to make a reliable generator that doesn’t need any further classifier filtering. This would have required us to invent specialized RL algorithms designed for high reliability. However, we realized that the main thing you get out of generator fine-tuning is just performance competitiveness (since it might be expensive to repeatedly sample from the generator until you find something the classifier accepts). Most likely, you could use an easier method to reduce the alignment tax to nearly zero: use a normal RL algorithm to get 99.9% reliability and then make your classifier be an extra head on the generator so you can evaluate it nearly for free. That way, you can likely do filtering with only 0.1% extra cost.[2]

Should you let adversaries modify the completion arbitrarily?

Given that our task was to produce a classifier that filters a generator, it might have been reasonable to try to restrict ourselves to adversarial examples that can be produced by picking an adversarial prompt and inducing the generator into saying something that gets past the classifier. Instead, we let our adversarial attacks contain arbitrary completions, even ones that the generator would not have been likely to produce, making the adversary's task easier. This enabled a practical improvement in our project, since restricting contractors to completions the model would have significantly reduced the rate of adversarial example generation. But this simplification might become even more important in the context of future advanced systems: We expect that deceptively aligned agents would behave very differently in rare or hard-to-construct situations that trigger a treacherous turn. Thus, since we might not be able to identify every possible trigger of treacherous behavior, we will try to make our classifier reliable in diverse types of unusual situations, which will hopefully cause acceptable behavior in any dangerous scenarios that are likely to arise.

Is it worth trying to train a policy to generate adversarial examples?

One natural idea you might have for this project is to make an adversarial policy that produces inputs that the classifier fails on. However, we shouldn’t expect that to work. Assuming you’re starting with the same pretrained base model (which we didn’t in this work but is what we expect will be the case when training transformative AI models), then both the classifier and the policy should “know” roughly the same things. That means that if the policy can learn to reliably output some set of injurious examples, then (given equivalent training data) the classifier should also have been able to learn that all of those examples are injurious, and none of them should be adversarial examples. It certainly may be worth testing that this holds up in practice, but this consideration made it unpromising enough that we didn’t bother trying it.

Overall, we were surprised at the extent to which working on a concrete engineering project helped us think through conceptual questions.

Future work

Here are some directions that we’re especially excited about:

Stronger and better-characterized adversarial attacks: Better attacks could increase the volume of adversarial data and increase the space of vulnerabilities covered. There are various directions one could imagine: more tools to assist humans, strong active learning, or mostly-automated attacks. We’d want to more rigorously measure how well different attacks work.

Better ways to measure reliability: We’d like to have better techniques both in-distribution (where we want to detect extremely rare failures) and out-of-distribution (where we might be measuring e.g. the worst attacks that can be found and want to be sure we’re covering the whole space)

Relaxed adversarial training: By requiring adversaries to come up with specific failing examples, adversarial training might place too high a burden on them. Some adversaries might be able to tell that a model would fail in a hypothetical situation even if they can’t construct an input corresponding to the situation directly (probably due to computational constraints). To give a contrived example: A model could fail if it sees a valid Bitcoin blockchain that’s long enough that it suggests it’s the year 2030. Even if the adversary knew that, it couldn’t come up with a valid input. So we need to “relax” the adversary’s task to let it supply “pseudo-inputs” of some sort

We think there is a lot of useful work that can and should be done in adversarial training and adversarial evaluation.  Here are some ways that you might be able to help:

  • Extend our techniques or develop other adversarial training and evaluation techniques for high-stakes settings.  If you want to directly build upon the work described in this paper, you are welcome to use our hardened classifier, which we provide here, and our data.  If you think that having our code for some part of this would be helpful, let us know, and we might be able to provide it (though our code as currently written only works in the context of Redwood infrastructure).
  • Come work at Redwood. We are planning to energetically continue working in this area (in addition to our interpretability projects).
  • We’re thinking this week about which adversarial training and evaluation projects to do next.  You are welcome to suggest ideas!

You can read more about this work in our paper.



  1. ^

    This is “Step 1” from our original post; we ended up thinking “Step 2” was not very important, as discussed in “Surprisingly rich conceptual considerations” below

  2. ^

    There will be another (hopefully small) hit from combining the generator and classifier into one model. We haven’t actually tried to build this; it might be a worthwhile followup project. Note that this technique is already used in several existing models, such as LaMDA

New Comment
15 comments, sorted by Click to highlight new comments since:

One of the primary questions that comes to mind for me is "well, did this whole thing actually work?". If I understand the paper correctly, while we definitely substantially decreased the fraction of random samples that got misclassified (which always seemed very likely to happen, and I am indeed a bit surprised at only getting it to move ~3 OOMs, which my guess is mostly capability related, since you used small models), we only doubled the amount of effort necessary to generate an adversarial counterexample. 

A doubling is still pretty substantial, and stacking a few doublings on top of each other could potentially result in safety guarantees, but my guess is the first doubling is a lot cheaper to get, and so I feel tempted to say that at least the set of techniques explored here, did not substantially increase adversarial robustness, unless I am missing something.

This doesn't mean that there isn't more potential work in this space that maybe does increase adversarial robustness, and I might be misreading the paper, but when I first heard about this project, I was definitely hoping about some approach that might increase robustness by many orders of magnitude, making it impossible for an unassisted person to generate a counterexample, and very very difficult to generate a counterexample even with transparency tools. But it seems like the difficulty of a human generating a counterexample stayed mostly the same, which was (at least to me) the primary outcome variable I was interested in. 

I am curious whether any of the authors disagree with this assessment. My guess is most of you must, since neither the post nor the paper seems to talk about this very much. 


Excellent question -- I wish we had included more of an answer to this in the post.

I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.

I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)

Part of what's going on here is that IMO we made more progress on the offense side than the defense side. We didn't measure that as rigorously, but it seemed like the token substitution tool speeds up the attack by something like 10x. I think the attack tools are one of the parts of the project I'm most excited about, and we might push even harder on them in upcoming projects.

I think you're correct that the amount of improvement was pretty limited by capabilities. 4,900 labels (for tool-assisted adversarial examples) just aren't many for a 304M-parameter model. That's one reason we don't find it totally disturbing that the offense is winning for now; with much bigger models sample efficiency should increase dramatically, and the baseline cost of constructing the model will increase as well, making it comparatively cheaper to spend more on adversarial training.

That said, this was just our first go and I think we can make a lot more progress right away. It'll help to have a crisper task definition that doesn't demand that the model understand thousands of possible ways injury could occur. It'll help to have attacks that can produce higher volumes of data and cover a larger portion of the space. I don't think we know how to really kill it yet, but I'm excited to keep working on it.

I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?

Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example."

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)

I think that 3 orders of magnitude is the comparison between "time taken to find a failure by randomly sampling" and "time taken to find a failure if you are deliberately looking using tools."


I read Oli's comment as referring to the 2.4% -> 0.002% failure rate improvement from filtering.

Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?


Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)

Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?

In the paper, you write: "There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers."

But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.

Given the baseline classifier's 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).

This isn't comparing apples to apples, though? If you asked contractors to find adversarial examples without using the tools, they'd likely find them at a rate much higher than 0.003%.


That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results; an updated arXiv paper with those experiments is appearing on Monday. 

Take after talking with Daniel: for future work I think it will be easier to tell how well your techniques are working if you are in a domain where you care about minimizing both false-positive and false-negative error, regardless of whether that's analagous to the long term situation we care most about. If you care about both kinds of error then the baseline of "set a reallly low classifier threshold" wouldn't work, so you'd be starting from a regime where it was a lot easier to sample errors, hence it will be easier to measure differences in performance.


Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don't differ much in performance even at noticeably higher thresholds.

For future projects, I think I'm most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they're extremely rare.

Amusing tid-bit, maybe to keep in mind when writing for an ML audience: The connotations with the term "adversarial examples" or "adversarial training" run deep :-)

I engaged with the paper and related blog posts for a couple of hours. It took really long until my brain accepted that "adversarial examples" here doesn't mean the thing that it usually means when I encounter the term (i.e. "small" changes to an input that change the classification, for some definition of small).

There were several instances when my brain went "Wait, that's not how adversarial examples work", followed by short confusion, followed by "right, that's because my cached concept of X is only true for "adversarial examples as commonly defined in ML", not for "adversarial examples as defined here".

This comes from the fact that you assumed "adversarial example" had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of "adversarial example" has the misclassified panda as an example.