One possible way to train an amplification model is to use an auxiliary reinforcement learning objective to help guide the training of the amplification model. This could be done either by training two separate models, an agent and a question-answerer, or a single model trained on a joint objective. For example, from a comment Paul left on “A dilemma for prosaic AI alignment:”

I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.

(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)

In my last post, I expressed skepticism of such non-imitative amplification approaches, though in this post I want to propose a possible way in which some of my concerns with this style of approach could addressed by integrating ideas from AI safety via debate. I'll start by describing the basic idea in broad terms, then give a more careful, technical description of the sort of training procedure I have in mind.

The proposal

The basic idea is as follows: debate naturally yields an RL objective, so if you want to add an auxiliary RL objective to amplification, why not use the RL objective from debate? Specifically, the idea is to conduct a debate not between copies of the model , but between copies of the amplified model (where is a human with access to the model ). That gives you both an RL reward arising from the winner of the debate as well as a supervised loss from the difference between the model's output and the amplified model's output. That's the basic proposal—now I'll try to give a more careful, technical description of how I think it should go.


We will start with some initial definitions.

  • Let be the set of all questions in natural language.
  • Let be the set of all answers in natural language.
  • Let be the training distribution of natural language questions.
  • Let be a human.
  • Let be a model.
  • Let “What answer would you give to given access to ?”.
  • Let where is some means of stochastically determining whether it is worth it to call or not.[1]
  • Let .
  • Let be a target network updated regularly to match .

Training procedure

Then, consider the following training procedure.

  1. Generate some question .
  2. Conduct a debate according to the following procedure while building up a history and starting at .
    1. Let “What is the next argument you would give on in context ?”.
    2. Append to the history the contents of as well as a trace of what happened in calculating , including all the necessary information to use transparency tools to inspect what happened.
    3. Let “Is a sufficient answer to in context ?”.
    4. If not , increment and return to step 1 in the debate.
  3. Let “Were the even arguments in more helpful than the odd arguments in answering ?”.
  4. Train on the combination of the debate-style RL reward derived from ( for the winner and for the loser), a relaxed adversarial training catastrophe-minimization penalty calculated by , and the supervised loss where is some natural language distance metric and represents the question posed to the function that computed the answer .

Ways of combining the RL and supervised losses here include:

  • combining the two losses with some relative weighting ,
  • using a constrained RL algorithm such as CPO to maximize the RL reward subject to maintaining some baseline performance on the supervised loss, or
  • making the supervised loss a part of the RL reward.


The first thing which I want to note about this procedure is that it gives a continuous relaxation between amplification and debate—in fact, there are multiple ways to tune the procedure to limit to either pure amplification or pure debate. First, you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between. Second, the fact that this procedure uses to determine whether or not to stop the debate after every argument gives another way in which this procedure gives you a relaxation between amplification and debate: if the debate is always stopped after the first argument, you just get pure imitative amplification, whereas as the debate goes on for longer, you get closer and closer to just pure debate. What's really nice about this is that it means that when pure imitative amplification is giving good results, you'll just get pure amplification—since then will tend to be false—but when amplification isn't working, you'll get debate—since then will tend to be true.

As a result, it should be quite possible to enforce that this procedure limit to HCH—either by annealing the losses or by forcing to tend towards false. Thus, I think this procedure has a good chance of being outer aligned at optimum—or at least, a similar chance at it compared to pure imitative amplification. Unlike pure imitative amplification, however, this procedure gets to make use of the capability benefits of having an auxiliary RL objective to help guide training. Furthermore, since the auxiliary RL objective that we're using comes from debate, we get a lot of the benefits of debate as well, including the ability to incentivize the debaters to produce arguments that we wouldn't have necessarily though of ourselves, as well as the ability to train our debaters to use transparency tools against each other to help catch deception or other catastrophic behavior. That being said, I do think that whether or not something like this is inner aligned is still quite questionable—and is likely to depend highly on the specific transparency tools you have access to—though I do like the approach described here in general and I think it's definitely worth looking into more.

  1. As an example approach for implementing something like , see “A concrete proposal for adversarial IDA.” ↩︎

New Comment
10 comments, sorted by Click to highlight new comments since:

I like the basic idea, but I don't understand the details, so by default won't include it in the newsletter. Some confusions:

  • Are the arguments the same thing as answers? (I get this impression because you say “Is arg_t a sufficient answer to Q in context S?”.) If not, where is the answer in the debate? More generally I would benefit a lot from a concrete example (e.g. the Bali vs. Alaska example).
  • Debate sets up a game and argues that the equilibrium is truth-telling. It does that by setting up a zero-sum game and then using self-play for training; self-play will converge to the Nash equilibrium, so you are then justified in only analyzing the equilibrium, while ignoring the training process. However, in your use of debate, afaict nothing enforces that you converge to the equilibrium of the zero-sum game, so I don't see why you gain the benefits of debate.
  • Why do you want to add an auxiliary RL objective? I normally imagine two reasons. First, maybe the task you want to solve is well suited to RL, e.g. Atari games, and so you want to train on that RL objective in addition to the question answering objective, so that the RL objective lets you learn good representations quickly. Second, if your model M is unable to do perfect imitation, there must be errors, and in this case the imitation objective doesn't necessarily incentivize the right thing, whereas the RL objective does. (See Against Mimicry.) I think yours is aiming at the second and not the first?

The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from ” I mean that in the zero-sum debate game sense. So you're still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you're using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.

Are the arguments the same thing as answers?

The arguments should include what each debater thinks the answer to the question should be.

I think yours is aiming at the second and not the first?


The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.

But the answers are generated from pieces that involve humans, and those humans don't behave as though they are in a zero-sum game?

I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward... but then the equilibrium behavior could very well be different. For example, if you're advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I'm not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.

Maybe you could make an argument that for any who we trust enough to do amplification / debate in the first place, this isn't a problem, since is supposed to be more capable than . Alternatively you could say that at the very least is such that gives true and useful arguments, though that might conflict with training to imitate (as in the rock-paper-scissors example above).

Yep; that's basically how I'm thinking about this. Since I mostly want this process to limit to amplification rather than debate, I'm not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that such that you can in fact recover the debate equilibrium if you anneal towards debate.

I really love the level of detail in this sketch!

I'm mentally substituting for some question more like "should this debate continue?", because I think the setup you describe keeps going until is satisfied with an answer, which might be never for weak . It's also not obvious to me that this reward system you describe actually teaches agents to debate between odd and even steps. If there's a right answer that the judge might be convinced of, I think will be trained to give it no matter the step parity, because when that happens it gets rewarded.

Really, it feels like the state of the debate is more like the state of a RNN, and you're going to end up training something that can make use of that state to do a good job ending debates and making the human response be similar to the model response.

you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between

How necessary is annealing for this? Could you choose other optimisation procedures? Or do you refer to annealing in a more general sense?

“Annealing” here simply means decaying over time (as in learning rate annealing), in this case decaying the influence of one of the losses to zero.

Let H:Q→A be a human.


Let Amp(H,M)(Q)=H(“What answer would you give to Q given access to M?”).

Nitpick: is meant to be defined here as a human with access to ?

It shouldn't be since is just a function argument here—and I was imagining that including a variable in a question meant it was embedded such that the question-answerer has access to it, but perhaps I should have made that more clear.

I too wanted that to be written differently, like H(Q|M) or something, since the human can imagine having access to the model without having actual access to the model.