Debate: train M* to win debates against Amp(M).
I think Debate is closer to "train M* to win debates against itself as judged by Amp(M)".
Wouldn't it just be "train M* to win debates against itself as judged by H"? Since in the original formulation of debate a human inspects the debate transcript without assistance.
Anyway, I agree that something like this is also a reasonable way to view debate. In this case, I was trying to emphasise the similarities between Debate and the other techniques: I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.
Maybe an easier way to visualise this is that, given some question, M* answers that question, and then Amp(M) tries to identify any flaws in the argument by interrogating M*, and rewards M* if no flaws can be found.
I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.
This seems like a reasonable way to think of debate.
I think, in practice (if this even means anything), the power of debate is quite bounded by the power of the human, so some other technique is needed to make the human capable of supervising complex debates, e.g. imitative amplification.
This strikes me as a really interesting and innovative post, proposing a framework for systematically categorizing existing alignment proposals as well as helping to generate new ones.
I'm kind of surprised that this post is almost 2 years old and yet only has one pingback and a few comments.
Is there some other framework which has superseded this one, or did people just forget about it / there isn't much comparative alignment work going on?
One other framework I've seen kind of like this is "Training stories" from Evan Hubinger's How do we become confident in the safety of a machine learning system?. But that is more about evaluating alignment proposals (i.e. the very last part of the present post) rather than categorizing alignment proposals along a consistent set of dimensions, which is the main focus here. So it actually serves a different purpose and isn't much like this framework after all.
I liked Evan’s post on 11 proposals for safe AGI. However, I was a little confused about why he chose these specific proposals; it feels like we could generate many more by stitching together the different components he identifies, such as different types of amplification and different types of robustness tools. So I’m going to take a shot at describing a set of dimensions of variation which capture the key differences between these proposals, and thereby describe an underlying space of possible approaches to safety.
Firstly I’ll quickly outline the proposals. Rohin’s overview of them is a good place to start - he categorises them as:
More specifically, we can describe the four core recursive outer alignment techniques as variants of iterated amplification, as follows: let Amp(M) be the procedure of a human answering questions with access to model M. Then we iteratively train M* (the next version of M) by:
Here are six axes of variation which I claim underlie Evan’s proposals. Each proposal is more or less:
In more detail:
I intend this breakdown to be useful not just in classifying existing approaches to safety, but also in generating new ones. For example, I’d characterise this paper as arguing that AI training regimes which are less structured, less supervised and more environmentally-dependent will become increasingly relevant (a position with which I strongly agree), and trying to come up with safety research directions accordingly. Another example: we can take each variant of iterated amplification and ask how we could improve them if we had better interpretability techniques (such as the ability to generate adversarial examples which display specific misbehaviours). More speculatively, since adversarial interactions are often useful in advancing agent capabilities, I’d be interested in versions of STEM AI which add an adversarial component - perhaps by mimicking in some ways the scientific process as carried out by humans.
There’s one other important question about navigating this space of possibilities - on what metric should we evaluate the proposals within it? We could simply do so based on their overall probability of working. But I think there are enough unanswered questions about what AGI development will look like, and what safety problems will arise, that these evaluations can be misleading. Instead I prefer to decompose evaluations into two components: how much does a proposal improve our situation given certain assumptions about what safety problems we’ll face along which branches of AGI development; and how likely are those assumptions to be true? This framing might encourage people to specialise in approaches to safety which are most useful conditional on one possible path to AGI, even if that’s at the expense of generality - a tradeoff which will become more worthwhile as the field of AI safety grows.
Thanks to the DeepMind safety reading group and Evan Hubinger for useful ideas and feedback.