Synthesizing amplification and debate

[-]Rohin Shah6y40

I like the basic idea, but I don't understand the details, so by default won't include it in the newsletter. Some confusions:

Are the arguments the same thing as answers? (I get this impression because you say “Is arg_t a sufficient answer to Q in context S?”.) If not, where is the answer in the debate? More generally I would benefit a lot from a concrete example (e.g. the Bali vs. Alaska example).
Debate sets up a game and argues that the equilibrium is truth-telling. It does that by setting up a zero-sum game and then using self-play for training; self-play will converge to the Nash equilibrium, so you are then justified in only analyzing the equilibrium, while ignoring the training process. However, in your use of debate, afaict nothing enforces that you converge to the equilibrium of the zero-sum game, so I don't see why you gain the benefits of debate.
Why do you want to add an auxiliary RL objective? I normally imagine two reasons. First, maybe the task you want to solve is well suited to RL, e.g. Atari games, and so you want to train on that RL objective in addition to the question answering objective, so that the RL objective lets you learn good representations quickly. Second, if your model M is unable to do perfect imitation, there must be errors, and in this case the imitation objective doesn't necessarily incentivize the right thing, whereas the RL objective does. (See Against Mimicry.) I think yours is aiming at the second and not the first?

[-]evhub6y30

The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from $winner$ ” I mean that in the zero-sum debate game sense. So you're still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you're using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.

Are the arguments the same thing as answers?

The arguments should include what each debater thinks the answer to the question should be.

I think yours is aiming at the second and not the first?

Yep.

[-]Rohin Shah6y30

The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.

But the answers are generated from pieces that involve humans, and those humans don't behave as though they are in a zero-sum game?

I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward... but then the equilibrium behavior could very well be different. For example, if you're advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I'm not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.

Maybe you could make an argument that for any $H$ who we trust enough to do amplification / debate in the first place, this isn't a problem, since $Amp (H, M)$ is supposed to be more capable than $M$ . Alternatively you could say that at the very least $M$ is such that $Amp (H, M)$ gives true and useful arguments, though that might conflict with training $M$ to imitate $Amp (H, M)$ (as in the rock-paper-scissors example above).

[-]evhub6y30

Yep; that's basically how I'm thinking about this. Since I mostly want this process to limit to amplification rather than debate, I'm not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that $Amp (H, M) \approx M$ such that you can in fact recover the debate equilibrium if you anneal towards debate.

[-]Charlie Steiner6y20

I really love the level of detail in this sketch!

I'm mentally substituting $c o n t i n u e_{t}$ for some question more like "should this debate continue?", because I think the setup you describe keeps going until $A m p$ is satisfied with an answer, which might be never for weak $M$ . It's also not obvious to me that this reward system you describe actually teaches agents to debate between odd and even steps. If there's a right answer that the judge might be convinced of, I think $M$ will be trained to give it no matter the step parity, because when that happens it gets rewarded.

Really, it feels like the state of the debate is more like the state of a RNN, and you're going to end up training something that can make use of that state to do a good job ending debates and making the human response be similar to the model response.

[-]Riccardo Volpato6y10

you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between

How necessary is annealing for this? Could you choose other optimisation procedures? Or do you refer to annealing in a more general sense?

[-]evhub6y20

“Annealing” here simply means decaying over time (as in learning rate annealing), in this case decaying the influence of one of the losses to zero.

[-]Ofer6y10

Let H:Q→A be a human.

[...]

Let Amp(H,M)(Q)=H(“What answer would you give to Q given access to M?”).

Nitpick: $H$ is meant to be defined here as a human with access to $M_{t a r g e t}$ ?

[-]evhub6y10

It shouldn't be $M_{target}$ since $M$ is just a function argument here—and I was imagining that including a variable in a question meant it was embedded such that the question-answerer has access to it, but perhaps I should have made that more clear.

[-]Vaniver6y10

I too wanted that to be written differently, like H(Q|M) or something, since the human can imagine having access to the model without having actual access to the model.

As an example approach for implementing something like $Samp$ , see “A concrete proposal for adversarial IDA.” ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

22

Synthesizing amplification and debate

22

Background

The proposal

Definitions

Training procedure

Analysis