Epistemic status: Not a fleshed-out proposal. Brainstorming/eliciting ideas.

Thanks to Ben Mann,  Pablo Moreno, and Jared Kaplan for feedback on early drafts.


  • I’m convinced sandwiching—the experimental protocol from Ajeya Cotra’s The case for aligning narrowly superhuman models—is valuable, and I’m in the process of setting up some concrete sandwiching experiments to test scalable oversight ideas.
  • Sandwiching experiments are generally fairly slow:
    • You have to design and pilot a strategy that allows humans to use (or oversee) a model for a task that they can’t do well themselves. The details matter here, and this can often take many iterations to get right.
    • Then, you need a bunch of humans actually try this. Even for very simple tasks, this is a high-cognitive-load task that should take at least tens of minutes per instance.
    • You have to repeat this enough times to measure average performance accurately.
  • I’m visiting Anthropic this year for a sabbatical, and some of my sandwiching work is happening there. Anthropic’s biggest comparative advantage (like that of similar teams at DeepMind and OpenAI) is easy access to near-state-of-the-art LMs that are fine-tuned to be helpful dialog agents.
  • In that context, I've heard or encountered this question several times: Can we speed up [some experiment I’m proposing] by replacing the non-expert human with a weaker LM?
    • This obviously doesn’t achieve the full aims of sandwiching in general, but it’s often hard to find a decisive rebuttal for these individual instances.
    • More broadly, I think there’s likely to be a significant subset of worthwhile sandwiching experiments that can be trialed  more quickly by using an intentionally weakened model as a proxy for the human.
    • Which experiments these are, precisely, has been hard for me to pin down. This post is an attempt to organize my thoughts and solicit comments.

Background: Standard sandwiching (in my terms)

  • Prerequisites:
    • A hard task: A task that many humans would be unable to solve on their own.
    • A capable but misaligned language model assistant: A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted.
  • A non-expert human: Someone who can’t solve the task on their own, but will try to solve it using the assistant and some scalable alignment strategy.
  • [Secondary] Expert human: Someone who can solve the task well, and represents a benchmark for success.
    • In many cases, we’ll just measure accuracy with static test datasets/metrics rather than bringing in experts at experiment time.
  • Research protocol:
    • My framing (minimum viable experiment):
      • Search for scalable alignment protocols that allow the non-expert human to use or train the assistant to do as well as possible on the task.
    • Alternate framing (more steps, closer to the original blog post):
      • Search for scalable alignment protocols by which the non-expert human can train the assistant to perform the task.
      • Run the same protocol with the expert human, and verify that the results are the same. This demonstrates successful (prosaic) alignment for the given assistant and task.
  • Example (task, non-expert human) pairs:
  • Goals:
    • Get evidence about how scalable alignment strategies succeed and fail, as early as possible
    • Get practice trying to actually align systems in a maximally realistic setting
  • Difficulties:
    • Setting up experiments with actual human annotators—at least at a large enough scale to get clear quantitative result—usually takes several weeks.
    • Current models will be misaligned in different ways from potentially-transformative models. Many apparent alignment failures will actually be largely due to capabilities gaps.

Artificial sandwiching: Replacing the non-expert human

  • Role substitution:
    • Non-expert model: A second language model which we constrain in some way such that it can’t solve the task on its own.
  • Example experiment(s):
    • Debate: Use simple prompting techniques to get a small dialogue model to judge a debate between two more capable ones on MMMLU exam questions, where one of the debaters is trained or prompted to advocate for the wrong answer. Better debate protocols should yield higher success rates.
    • Short-context/long-context: Use simple prompting techniques to get a dialog model with a short context-length limit to interact with a longer-context assistant model in order to answer a question about a long piece of writing that they can both access. Make it clear in the prompt that the long-context model isn't always consistent or trustworthy, and that the short-context model will have to cross-examine it extensively.
    • [Your idea here.]
  • Differences/results of this substitution:
    • Experiments are much faster: Once we find a promising basic setup, we can test out new strategies in only the few hours it takes to run a workflow with a large model.
    • We slightly expand the set of tasks we can work with: We can now use any task that some constrained/weakened LM is unable to solve on its own. (Though the LM still needs to be good enough to make a real attempt at playing the human role in some scalable alignment strategy.)
    • We don’t get any practice in the logistics of working with real humans.
    • We don’t get any practice with the specific types of goal misalignments that emerge between humans and LM-based agents.
    • Are there other important differences or limitations that should color which experiments we try and how we interpret the results?
  • Key difficulty/uncertainty:
    • It’s less clear that the two participants in our alignment strategy are actually misaligned: If we simply use smaller and larger variants of the same present-day language model dialog agent, is there an interesting sense in which they have different goals?
      • If the difference between the two is only one of capabilities, then the correct strategy for the weaker non-expert model is trivial: It should always just ask the assistant to do the task for them, while making no attempt at oversight.
      • We may need to inject misalignment manually:
        • Many debate protocols do this naturally: The non-expert model is interacting with a larger assistant which is sometimes explicitly prompted or incentivized to be misleading.
        • This only lets us test scalable alignment protocols that involve explicitly adversarial/zero-sum debates, though.
      • Is there an alternative strategy for misaligning the two models that lets us test a wider range of strategies?
        • The short-context/long-context example above attempts a simple version of this, but it's likely to be brittle.
      • Will we learn much without an explicit source of misalignment between the two? If we simply prompt/train a smaller model to lean on a larger one as a resource, can we learn anything important from those results?

Closing thoughts

  • To the extent that artificial sandwiching can get off the ground, it offers a pretty valuable tool: It lets you sanity check alignment strategies much, much more quickly.
  • I’m still excited about standard sandwiching, and it’ll be the priority in my own work: At some point, most plausible alignment processes will involve at least some input from actual human stakeholders, and we currently have relatively little practice at collecting and using that input.
  • If you want to optimally deploy your efforts on standard sandwiching, it’ll be valuable to understand where you expect artificial sandwiching to work.


New Comment
5 comments, sorted by Click to highlight new comments since: Today at 1:10 PM

I do think artificial sandwiching is missing something. What makes original-flavor sandwiches interesting to me is not misalignment per se, but rather the problem of having to get goal information out of the human and into the AI to get it to do a good job (e.g. at generating the text you actually want), even for present-day non-agential AI, and even when the AI is more competent than the human along some relevant dimensions.

Like, the debate example seems to me to have a totally different spirit - sure, the small AI (hopefully) ends up better at the task when augmented with the big AI, but it doesn't involve getting the big AI to "learn what the small AI is trying to do," so I'm not interested in it. (Another way of putting this is that debate is aimed at multiple choice questions rather than high-dimensional "essay questions," which allows it to dodge some of the interesting bits.)

I guess when I put it that way, this is not actually a problem with artificial sandwiching. It's more the statement that I think that for artificial sandwiching to be interesting, the small AI should still be playing the role of the human, and the tools developed should still be of interest for application to humans even if you can run your experiments in an hour rather than a week.

I'm imagining experiments where the small AI is given some clear goal (maybe as part of a hidden prompt, or prompt schema that changes for different ways the big AI might make queries to the small AI, if there are multiple such ways). Then you have it interact with a big AI in an automated way, so that at the end you have a slightly-modified big AI that actually does the thing the small AI was prompted to do. Maybe by training a classifier or by engineering a prompt of the big AI, it doesn't have to go all the way to finetuning.

To investigate sandwiching in a realistic scenario, I don't think that one can go around setting up some experiment with humans in the loop. However, I agree that artificial sandwiching might be useful as a way to iterate on the algorithm. The way I understand it, the bottleneck for artificial sandwiching (i.e. using models only) is to define a misaligned expert-assistant".  As you point out, one would like to model some real misalignment and not only a difference in capabilities. For iterating on rather simple artificial sandwiching experiments, one would ideally also like to be able to control this misalignment. I have one idea for an experiment that might be interesting. 

I would propose to move away from language models now and look at Chess (or any other game/simulated environment). We can define the benchmark i.e. the actual expert(-human) as an AlphaZero Model (or whatever target you want). The non-expert(-human), i.e. the weak model that can't solve the task on its own, can be represented by some weaker model that was fine-tuned on some Chess datasets.  Lastly, the misaligned expert-assistant can also be represented by the same model as the expert(-human), e.g. AlphaZero. However, there is an important difference, we will artificially misalign the assistant in a way that also lets us control the misalignment. We do that for example by adding random noise to the value function of the assistant (or adding random noise every n times it is queried), or adding random noise when the assistant predicts an action. While I admit that this is not the kind of misalignment we ultimately care about, I believe that for the sandwiching case it is enough. The main goal of the artificial sandwiching experiment is to find out whether a misaligned expert model can assist a non-expert (who does not know in what way the expert model is misaligned). The advantage of this framing is that one can transform the misalignment of the expert in any possible way, e.g. adding certain biases, steering the weaker model towards losing, etc.
Note that the non-expert model, in this case, could for example query the expert assistant every n turns, or even every turn, and ask for the value of the current position, or the best next turn.
In this scenario, the challenge boils down to, can a weak non-expert extract signal from a strong expert that is misaligned, without knowing how or where this misalignment comes from. If the weak model can extract the signal, the supervision might scale.

Extension: I could imagine a similar experiment with Language Models. E.g. have a weak model that was finetuned on toxic text. Have an expert assistant that can produce normal text, but every so often also produces toxic text (and is thus misaligned). Can the weaker model learn how to leverage some of the signal that the expert assistant provides. Somehow this example feels more constructed though, and I don't like it as much.

Not sure if this is exactly what you were looking for, but I thought I'd mention it in any case.

Thanks! I think I have some sense of what both directions look like, but not enough to know what a concrete starting experiment would look like. What would a minimum viable experiment look like for each?

I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem. 

Chess: We are defining 3 models, the expert model , the (human equivalent) weak model  and the strong but misaligned assistant model . The goal is for  to leverage the misaligned assistant  to reach the performance of the expert  The way that  is allowed to improve, is by playing against itself and being allowed to query . We can then record the self-play games to create a new corpus that we can then finetune  on. The big difference between self-play with MCTS and this case is, that here we will only improve if the data we generate has any additional signal that helps us learn something new.  Thus if  would just play against itself and be fine-tuned on the generated data, it wouldn't be able to improve and achieve expert performance. If we however leverage an assistant  that "appears to have the skills and knowledge needed to solve the task better than many humans", we can improve .

To train E for good performance I would take a model with a reasonable number of parameters  and just fine-tune on a human data Corpus with  samples for  iterations. 
You define the assistant  as: "A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted.
This implies that the assistant should in principle be better than the weak model , but since it is misaligned at the same time, the actual performance might be weaker (at least sometimes). The challenge is now that in terms of performance, the expert  should be better than the misaligned assistant , who at least sometimes should be better than the weak model . I thus propose that  is a copy of , with some kind of artificial misalignment added on top. This implies that  is a strong player and that it "has the skills and knowledge needed to solve the task better than many humans, but does not reliably do so when prompted".  I further propose that  has the same number of parameters as  and , i.e. , but that it is trained on a smaller corpus of  samples, where  or for fewer iterations  where  (or even both). This implies that in principle  has enough capacity to reach the performance of , but that in the current state it is weaker.

The way that training now works is that  plays against a copy of itself, while the original model is allowed to query  for an "ideal" move in each turn (or every  turns).  is then iteratively fine-tuned on the resulting game corpus.  will only improve if it is successfully able to leverage the strong skills and knowledge of , even though  is misaligned in some sense. If  is successfully able to reach performance , we have successfully demonstrated some "weak" form of Sandwiching. 

The remaining challenge now lies in defining a type of artificial misalignment that we add to , which is realistic enough. I have to say that there are many options one could try and I haven't found an ideal one yet. But here are a few propositions: 
1. The easiest kind of misalignment would be if  just proposes a random move every so often. So whenever  queries the assistant  will propose a random movie with a probability  and otherwise, it will just recommend the ideal action given its policy. If  the assistant would not be a strong player and not of any use, whereas if  querying the assistant is equivalent to querying the expert  and  wouldn't be misaligned. So there is an obvious trade-off that one could play around with and which might be insightful as to when Sandwiching works and when it doesn't (i.e. how much signal is required). Instead of choosing a random action, one could also just add some noise to the distribution over actions and then pick the best one.

2. Another interesting type of misalignment would be if the assistant would have access to another model  and whenever queried it would recommend an action from  with probability  or an action from itself otherwise. One option for  would be a chess engine and taking for example the worst move possible, or the riskiest move according to that engine. Another option (which I am currently most excited about) would be to train a model  to always lose a game in a very subtle way. So  could propose moves that on the surface might look unsuspicious to a weaker player, but a strong player would immediately see as a catastrophic move. One could potentially use adversarial training to get such an agent. If  proposes such an adversarial move with probability , it is indeed misaligned with the goals of , and I think that would count as a useful artificial Sandwiching problem. 

Hope that helps overall. I still think there are a bunch of questions remaining, especially whether this example is realistic enough. In some sense I reduced the problem to "learning from a noisy signal", and I am not sure whether that captures your initial goals.

Thanks! I'll admit that I meant to be asking especially about the toxicity case, though I didn't make that at all clear. As in Charlie's comment, I'm most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work. 

I don't see a clear picture either way on whether the noisy signal story presents a hard problem that's distinctively alignment oriented.