This post is about two proposals for aligning AI systems in a scalable way:

Iterated Distillation and Amplification (often just called 'Iterated Amplification'), or IDA for short,^[1] is a proposal by Paul Christiano.
Debate is an IDA-inspired proposal by Geoffrey Irving.

This post is written to be as easy to understand as possible, so if you found existing explanations of IDA confusing, or if you just never bothered because it seemed intimidating, this post is for you. The only prerequisite is knowing about the concept of outer alignment (and knowing about inner alignment is helpful as well). Roughly,

Outer alignment is aligning the training signal or training data we give to our model with what we want.
If the model we find implements its own optimization process, then inner alignment is aligning [the thing the model is optimizing for] with the training signal.

See also this post for an overview and this paper or my ELI12 edition for more details on inner alignment.

1. Motivation / Reframing AI Risk

Why do we need a fancy alignment scheme?

There has been some debate a few months back about whether the classical arguments of the kind made in Superintelligence for why AI is dangerous hold up to scrutiny. I think a charitable reading of the book can interpret it as primarily defending one claim, which is also an answer to the leading question. Namely,

It is hard to define a scalable training procedure that is not outer-misaligned.

For example, a language model (GPT-3 style) is outer-misaligned because the objective we train for is to predict the most likely next word, which says nothing about being 'useful' or 'friendly'. Similarly, a question-answering system trained with Reinforcement Learning is outer-misaligned because the objective we train for is 'optimize how much the human likes the answer', not 'optimize for a true and useful answer'.

I'll refer to this claim as . If $(*)$ true, it is a problem even under the most optimistic assumptions. For example, we can suppose that

progress is gradual all the way, and we can test everything before we deploy it;
we are likely to maintain control of AI systems (and can turn them off whenever we want to) for a while after they exceed our capabilities;
it takes at least another 50 years for AI to exceed human capabilities across a broad set of tasks.

Even then, $(*)$ remains a problem. The only way to build an outer-aligned AI system is to build an outer-aligned AI system, and we can't do it if we don't know how to do it.

In the past, people have given many examples of how outer alignment could fail (there are a lot of those in Superintelligence, and I've given two more above). But the primary reason to believe $(*)$ is that it has taken people a long time to come up with a formalized training scheme that is not clearly outer-misaligned. IDA and Debate are two such schemes.

If outer alignment works out, that alone is not sufficient. To solve the entire alignment problem (or even just Intent Alignment^[2]), we would like to have confidence that an AI system is

outer-aligned; and
inner-aligned (or not using an inner optimizer); and
training competitive; and
performance-competitive.

Thus, IDA and Debate are a long way from having solved the entire problem, but the fact that they may be outer-aligned is reason to get excited, especially if you think the alignment problem is hard.

2. The Key Idea

Training AI systems requires a training signal. In some cases, this signal is easy to provide regardless of how capable the system is – for example, it is always easy to see whether a system has won a game of Go, even if the system plays at superhuman level. But most cases we care about are not of this form. For example, if an AI system makes long-term economic decisions, we only know how good the decisions are after they've been in place for years, and this is insufficient for a training signal.

In such cases, since we cannot wait to observe the full effects of a decision, any mechanism for a more rapid training signal has to involve exercising judgment to estimate how good the decisions are ahead of time. This is a problem once we assume that the system is more capable than we are.

To the rescue comes the following idea:

$The AI system we train has to help us during training.$

IDA and Debate provide two approaches to do this.

3. Iterated Distillation and Amplification

Before we begin, here are other possible resources to understand IDA:

The LessWrong/Alignment Forum sequence (written by Paul Christiano)
The very long 80k hours podcast with Paul Christiano
The attempted complete explanation of the scheme by Chi Nguyen
The FAQ written by Alex Zhu
A video by Robert Miles (who makes lots of AI-Alignment relevant youtube content)

This is Hannah.

Hannah, or $H$ for short, is a pretty smart human. In particular, she can answer questions up to some level of competence.

As a first step to realize IDA, we wish to distill Hannah's competence at this question-answering task into an AI system (or 'model') $A_{1}$ . We assume $A_{1}$ will be slightly less competent than Hannah, therefore Hannah can provide a safe training signal.

$A_{1}$ may be trained by reinforcement learning or by supervised learning of any form.^[3] The basic approach of IDA leaves the distillation step as a black box, so any implementation is fine, as long as the following is true:

Given an agent as input, we obtain a model that imitates the agent's behavior at some task but runs much faster.
The output model is only slightly less competent than the input agent at this task.
This process is alignment-preserving. In other words, if $H$ is honest, then $A_{0}$ should be honest as well.

If we applied $A_{1}$ to the same question-answering task, it would perform worse:

However, $A_{1}$ has vastly improved speed: it may answer questions in a few milliseconds that would have taken $H$ several hours. This fact lets us boost performance through a step we call amplification:

In the general formulation of the IDA scheme, amplification is also a black box, but in this post, we consider the basic variant, which we call stock IDA. In stock IDA, amplification is realized by giving $H$ access to the model $A_{1}$ . The idea is that this new 'agent' (consisting of $H$ with access to $A_{1}$ ) is more competent than Hannah is by herself.

If it is not obvious why, imagine you had access to a slightly dumber version of yourself that ran at 10000 times your speed. Anytime you have a (sub)-question that does not require your full intellect, you can relegate it to this slightly dumber version and obtain an answer at once. This allows you to effectively think for longer than you otherwise could.

Thus, we conjecture that this combined 'agent' has improved performance (compared to $H$ ) at the same question-answering task.

Here is a different way of describing what happened. Our combined 'agent' looks like this:

Since $A_{1}$ tries to imitate $H$ , we could think of Hannah as having access to an (imperfect) copy of herself. But since $A_{1}$ thinks much faster than $H$ , it is more accurate to view her as having access to many copies of herself, like so:

Where the gray circle means 'this is a model that tries to behave like the thing in the circle.'

At this point, we've covered one distillation and one amplification step. You might guess what happens next:

We train a new model $A_{2}$ to imitate the agent $[H access ⟶ A_{1}]$ on the question-answering task. Since $[H access ⟶ A_{1}]$ is more competent than $H$ , this means that $A_{2}$ will be more competent than $A_{1}$ (which was trained to imitate just $H$ ).

In this example, $A_{2}$ is almost exactly as competent as $H$ . This is a good time to mention of my performance numbers are made-up – the three properties they're meant to convey are that

performance goes up in each amplification step; and
performance goes down in each distillation step; but
performance goes up in each (amplification step, distillation step) pair.

After each distillation step, we end up with some model $A_{k}$ . While $A_{k}$ was trained in a very particular way, it is nonetheless just a model, which can answer questions very quickly. Each $A_{k}$ performs better than its predecessor $A_{k - 1}$ without a loss of speed.

The next amplification step looks like this:

Note that, in each amplification step, we always give Hannah access to our newest model. The $A_{k}$ 's get better and better, but Hannah remains the same human.

This new 'agent' is again more competent at the question-answering task:

Now we could train a model $A_{3}$ to imitate the behavior of $[H access ⟶ A_{2}]$ on the question-answering task, which would then be less competent than the system above, but more competent than $A_{2}$ (and in our case, more competent than $H$ ). It would still be a model and thus be extremely fast. Then, we could give Hannah access to $A_{3}$ , and so on.

One way to summarize this process is that we're trying to create a model that imitates the behavior of a human with access to itself. In particular, each model $A_{k}$ imitates the behavior of $[H access ⟶ A_{k - 1}]$ . Does this process top out at some point? It's conceivable (though by no means obvious) that it does not top out until $A_{k}$ is superintelligent. If so, and if distillation and amplification are both alignment-preserving, our scheme would be both aligned and performance-competitive.

Recall that our 'agent' $[H access ⟶ A_{2}]$ now looks like this:

Since $A_{2}$ tries to imitate $[H access ⟶ A_{1}]$ , we can alternatively depict this as

Once again, we draw more than one of these since $A_{2}$ is much faster than $[H access ⟶ A_{1}]$ , so it is as if $H$ had access to a lot of these, not just one. (Also not just three, but I only have that much space.)

Since each $A_{1}$ tries to imitate $H$ , we can depict this further like so:

Thus, insofar as the imitation step 'works' (i.e., insofar as we can ignore the circles), the resulting system will behave as if it were composed of Hannah consulting many copies of herself, each of which consulting many copies of herself. This is after precisely four steps, i.e., distillation $\to$ amplification $\to$ distillation $\to$ amplification. You can guess how it would look if we did more steps.

The name 'Hannah' is a bit on-the-nose as her name starts with 'H', which also stands for 'human'. Thus, the tree above consists of a human consulting humans consulting humans consulting humans consulting humans consulting humans consulting humans...

We call the entire tree HCH,^[4] which is a recursive acronym for Humans consulting HCH. Generally, HCH is considered to have infinite depth.

Note that there is an important caveat hidden in the clause 'insofar as the imitation step works'. In each distillation step, we are training a model to predict the answers of a system that thinks for much longer than itself. Thus, each $A_{k}$ is only more competent than $A_{k - 1}$ insofar as it is possible to solve problems in less time through better algorithms. There are strong reasons to believe that this is the case for a large class of tasks, but we know that it isn't possible for every task. For example, an HCH tree can play perfect chess (literally perfect, not just superhuman) by searching the entire chess tree.^[5] A model trained by IDA cannot do the same.

In the aforementioned LessWrong sequence, the illustration for the Distillation $\to$ Amplification process looks like this:

Alternatively, if we consider all of the $A_{k}$ 's to be the same AI system that gets upgraded over time, we have the following (where $r$ denotes a reward signal).

4. Factored Cognition

Informally, the Factored Cognition Hypothesis says that each question can be decomposed into easier subquestions such that the answer to the original question follows from the answer to the subquestions. Factored Cognition plays a crucial role for the applicability of both Debate and many instances of IDA.^[6]

Here is an illustration, where the top block is a question, each layer below a block is a set of subquestions whose answers determine the top-level question, and darkness/size of the blocks corresponds to difficulty:

We might now hope that the absolute difficulties look something like this:

Where the lower part is meant to indicate that we can decompose all of the above questions such that they eventually bottom out in the lowest stripe of questions smart humans can answer in 15 minutes.

I see two ways to illustrate why Factored Cognition is important for stock IDA. One is the HCH picture – insofar as the imitations 'work', a model trained via stock IDA behaves just like a tree of humans consulting each other. Thus, if the model is supposed to be superintelligent, then we better hope that any question a superintelligent AI could answer can be recursively decomposed into subquestions, until we end up with something Hannah can answer by herself. (Otherwise, stock IDA may not be performance-competitive.) In other words, we better hope that the Factored Cognition Hypothesis holds.

Another way is to look at just one amplification step in the procedure. Suppose that we have successfully trained model $A_{8}$ , which is already smarter than $H$ , and now want to use this to create the smarter agent $[H access ⟶ A_{8}]$ . Suppose that $A_{8}$ is already smart enough to answer super hard questions. We want the new agent to be smarter than $A_{8}$ , so we want it to be able to answer super-duper hard questions. In other words, we're in this position:

This means that, to answer this question, Hannah has to do the following:

She has to take the question $Q$ and decompose it into subquestions $q_{1}, q_{2}, q_{3}, q_{4}$ , such that the subquestions imply the answer to $Q$ , and each $q_{i}$ is at most super hard. Then, she can use $A_{8}$ to answer the $q_{i}$ , receive answers $a_{i}$ , and, on their basis, output an answer $a$ for $Q$ .

This means that she requires the Factored Cognition Hypothesis to hold for this particular step (the one from super-duper hard to super hard). If the Factored Cognition Hypothesis fails for any one jump of difficulty, performance might grind to a halt at that level.

Both views point to the same phenomenon because they describe the same idea: HCH is idealized stock IDA, i.e., it is what stock IDA hopes to approximate in the limit. Both the concrete training procedure and the ideal utilize Factored Cognition.

It is also conceivable that a decomposition of the kind that Hannah needs to solve this problem does exist, but she is not smart enough to find it. This problem can be considered a motivation for Debate.^[7]

5. Debate

Before we begin, here are other possible resources to understand Debate:

The original paper
The AI Alignment podcast episode with Geoffrey Irving

Suppose a smart agent $X$ makes the following argument:

She wants to argue that $D$ holds. Thus, she claims that $A$ is true, that $A$ implies $B$ because {argument symbolized by leftmost arrow}, that $B$ implies $C$ because {conjunction of the arguments made by middle arrows} and that $C$ implies $D$ because {argument made by rightmost arrow}.

Then comes forth an equally smart agent $Y$ to claim that

$X$ cannot give up on the step since the entire argument depends on it, so she has to defend it. Unlike in normal debates, both $X$ and $Y$ now forget about the surrounding context: the steps from $A$ to $B$ and from $C$ to $D$ no longer matter (and neither does the first conjunct of the step from $B$ to $C$ ). The remaining debate is entirely about the second conjunct of the step from $B$ to $C$ .

Thus, we zoom into this step. It turns out there is more going on; the step does itself has structure to it.

Then sayeth $X$ :

Then sayeth $Y$ :

Now, $X$ has to defend this step, and so on. Eventually, the steps become so simple that Hannah can recognize the flaw for herself. The step from $B_{23112}$ to $B_{23113}$ was false; therefore the step from $B_{2311}$ to $B_{2312}$ was false; therefore the step from $B_{231}$ to $B_{232}$ was false; therefore the step from $B_{23}$ to $B_{24} = C$ was false; therefore the step from $B_{2}$ to $C$ was false; therefore the argument that $A$ implies $D$ was false. $X$ was wrong; $Y$ was right.

In the Debate game, an AI system $X$ debates $Y$ , an identical copy of itself. The input to the game is a question; the output is a transcript of the arguments made by $X$ and $Y$ . The game proceeds in rounds: first, $X$ says something, then $Y$ , then $X$ , then $Y$ , and so on. Ideally, the first statement should always be the answer to the input question. In the end, a human $H$ decides who won the game. In doing so, $H$ should adhere to the following principles:

Enforce recursive zooming into statements. If $Y$ points to a particular part of $X$ 's argument that she claims is wrong, and $X$ responds by changing topics, then the game should end immediately, and $Y$ should win.
If a subquestion is resolved in favor of either party, always judge the entire debate based on who was right on that subquestion.

To illustrate why the second point makes sense, suppose that the debate above ends differently. $X$ and $Y$ still zoom into the problem until they hit step $B_{23112} \to B_{23113}$ , but this time, Hannah thinks the step is valid. In this case, you might suspect that this is not conclusive since the argument could fail at another step. However, since $Y$ was free to critique any part of the entire argument, chose this particular step, and failed to establish a flaw, we conclude that she could not have done better anywhere else.^[8] Thus, based solely on verifying the validity of step $B_{23112} \to B_{23113}$ , Hannah concludes that every part of the argument was valid and that $X$ won the debate.

This also illustrates why Debate can be powerful. If complexity is reduced at every step (i.e., $B_{23112} \to B_{23113}$ is easier to verify than $B_{2311} \to B_{2312}$ and so on), then a hard problem can be reduced to a much simpler problem, and the judge only has to verify the simple one.^[9] On this note, recall the question-difficulty hierarchy from the previous chapter:

The recursive zooming into statements from Debate is closely analogous to the decomposition into subquestions from HCH. The important difference is how the tree gets sliced up during the training process. If $[H access ⟶ A_{8}]$ is asked a super-duper hard question $Q$ , but $A_{8}$ can only answer super hard questions, then it's her job to decompose $Q$ into $q_{1}$ ,..., $q_{k}$ such that the $q_{i}$ 's determine $Q$ and all of them are at most super hard. In other words, we're taking a local slice of the tree.

Of course, each distillation step will require many such slices, and every subsequent distillation step will require many such slices at a higher level of the tree.

Conversely, each Debate game slices an entire vertical path out of the tree.

If Debate is ever implemented, we won't start off with superintelligent agents, and so we won't start off with super-duper $^{6}$ -hard questions either. Consequently, the paths may be very short in the beginning and grow longer over time. Either way, each path should always go all the way to the bottom.^[10]

The other difference is what I've mentioned at the end of the previous chapter: in Debate, the human never has to deal with figuring out how to decompose a question. At most, she has to figure out whether a decomposition is valid.^[11] The way Geoffrey Irving put it is that a debate transcript (as depicted above) corresponds to a single path through the tree of amplification.^[12] Crucially, it is a path chosen by the two Debate agents.

6. Comparison

Both IDA and Debate...

may or may not be outer-aligned
try to utilize the AI systems they're trying to train during the training process
are designed to scale up to superintelligence
rely on some version of the Factored Cognition Hypothesis to be applicable^[13] since they traverse the tree of difficult problems/questions

However, IDA...

carves a local slice out of the tree at each training step
has no built-in solution for decomposing questions into subquestions
- A separate model may be trained for this purpose, or the questions may go meta, i.e., "what is a good way to decompose this question?"
- Insofar as this makes the decompositions worse, it implies that a shallow HCH tree is less powerful than a shallow Debate tree.
can look very different depending on how the amplification and distillation black boxes are implemented
only approximates HCH insofar as all distillation steps 'work'

Whereas Debate...

carves a vertical slice/path out of the tree at each training step
- Therefore, it relies on the claim that such a path reliably provides meaningful information about the entire tree.
probably won't be training-competitive in the above form since each round requires human input
- This means one has to train a second model to imitate the behavior of a human judge, which introduces further difficulties.
requires that humans can accurately determine the winner of a debate with debaters on every level of competence between zero and superintelligence
could maybe tackle Inner Alignment concerns by allowing debaters to win the debate by demonstrating Inner Alignment failure in the other debater via the use of transparency tools

7. Outlook

Although this post is written to work as a standalone, it also functions as a prequel to a sequence on Factored Cognition. Unlike this post, which is summarizing existing work, the sequence will be mostly original content.

If you've read everything up to this point, you already have most of the required background knowledge. Beyond that, familiarity with basic mathematical notation will be required for posts one and two. The sequence will probably start dropping within a week.

As far as I know, the proposal is most commonly referred to as just 'Iterated Amplification', yet is most commonly abbreviated as 'IDA' (though I've seen 'IA' as well). Either way, all four names refer to the same scheme. ↩︎
Intent Alignment is aligning [what the AI system is trying to do] with [what we want]. This makes it the union of outer and inner alignment. Some people consider this the entire alignment problem. It does not include 'capability robustness'. ↩︎
I think the details of the distillation step strongly depend on whether IDA is used to train an autonomous agent (one which takes agents by itself), or a non-autonomous agent, one which only takes actions if queried by the user.

For the autonomous case, you can think of the model as an 'AI assistant', a system that autonomously takes actions to assist you in various activities. In this case, the most likely implementation involves reinforcement learning.

For the non-autonomous case, you can think of the model as an oracle: it only uses its output channels as a response to explicit queries from the user. In this case, the distillation step may be implemented either via reinforcement learning or via supervised learning on a set of (question, answer) pairs.

From a safety perspective, I strongly prefer the non-autonomous version, which is why the post is written with that in mind. However, this may not be representative of the original agenda. The sequence on IDA does not address this distinction explicitly. ↩︎
Note that, in the theoretical HCH tree, time freezes for a node whenever she asks something to a subtree and resumes once the subtree has delivered the answer, so that every node has the experience of receiving answers instantaneously. ↩︎
It's a bit too complicated to explain in detail how this works, but the gist is that the tree can play through all possible combinations of moves and counter-moves by asking each subtree to explore the game given a particular next move. ↩︎
In particular, it is relevant for stock IDA where the amplification step is implemented by giving a human access to the current model. In principle, one could also implement amplification differently, in which case it may not rely on Factored Cognition. However, such an implementation would also no longer imitate HCH in the limit, and thus, one would need an entirely different argument for why IDA might be outer-aligned. ↩︎
Geoffrey Irving has described Debate as a 'variant of IDA'. ↩︎
This is the step where we rely on debaters being very powerful. If $Y$ is too weak to find the problematic part of the argument, Debate may fail. ↩︎
Formally, there is a result that, if the judge can solve problems in the complexity class $P$ , then optimal play in the debate game can solve problems in the complexity class $PSPACE$ . ↩︎
Given such a path $p$ , the value $| p |$ (the total number of nodes in such a path) is bounded by the depth of the tree, which means that it grows logarithmically with the total size of the tree. This is the formal reason why we can expect the size of Debate transcripts to remain reasonably small even if Debate is applied to extremely hard problems. ↩︎
Note that even that can be settled via debate: if $Y$ claims that the decomposition of $X$ is flawed, then $X$ has to defend the decomposition, and both agents zoom into that as the subproblem that will decide the debate. Similarly, the question of how to decompose a question in IDA can, in principle, itself be solved by decomposing the question 'how do I decompose this question' and solving that with help from the model. ↩︎
This is from the podcast episode I've linked to at the start of the chapter. Here is the relevant part of the conversation:

Geoffrey: [...] Now, here is the correspondence. In amplification, the human does the decomposition, but I could instead have another agent do the decomposition. I could say I have a question, and instead of a human saying, “Well, this question breaks down into subquestions X, Y, and Z,” I could have a debater saying, “The subquestion that is most likely to falsify this answer is Y.” It could’ve picked at any other question, but it picked Y. You could imagine that if you replace a human doing the decomposition with another agent in debate pointing at the flaws in the arguments, debate would kind of pick out a path through this tree. A single debate transcript, in some sense, corresponds to a single path through the tree of amplification.

Lucas: Does the single path through the tree of amplification elucidate the truth?

Geoffrey: Yes. The reason it does is it’s not an arbitrarily chosen path. We’re sort of choosing the path that is the most problematic for the arguments. ↩︎
To be precise, this is true for stock IDA, where amplification is realized by giving the human access to the model. Factored Cognition may not play a role in versions of IDA that implement amplification differently. ↩︎

AI ALIGNMENT FORUM
AF