(This post is part of a sequence that's meant to be read in order; see the preface.)

1. HCH and Ideal Debate

Recall from post #-2 that we have two perspectives on stock IDA.^[1] One is that of a human with access to a model, the other is that of an HCH tree.^[2] We can think of HCH as 'pure' or 'idealized' Factored Cognition that abstracts away implementation details,^[3] and the training procedure as trying to implement this ideal.

One might now ask the following:

If HCH is the ideal of stock IDA, then what is the ideal of Debate? Since this is a sequence about Factored Cognition, we are primarily interested in analyzing the idealized versions, so this is the first question we'd like to answer.

Interlude on notation: throughout this sequence, we will need to refer to single statements, sequences of statements, and sets of statements. To make telling them apart as easy as possible, we follow the following norms:

single statements use lower case letters, like or $s_{1}$ or $s_{n}$ or $s_{j}$
sequences of statements use uppercase bold letters, like $S$ or $S_{1}$ or $S_{n}$
sets of statements use the 'pretty' S, like $S_{h}$ or $S_{h}^{T}$

We'll use the first two of those in the following definition that will be crucial in our discussion of Debate. If $s$ is a statement and $S = (s_{1}, . . ., s_{n + 1})$ a sequence of statements such that $s_{n + 1} = “ (s_{1}, . . ., s_{n}) imply s . ”$ , we say that $S$ is an explanation for $s$ and denote this by writing $S e \to s$ . In this setting, $s_{n + 1}$ is what we call an implication statement: it precisely says that the statements $(s_{1}, . . ., s_{n})$ imply the statement $s_{n + 1}$ . Here's a made-up example where $s_{4}$ is the implication statement:

The purpose of having implication statements is to ensure that $s$ trivially follows from $s_{1}, s_{2}, s_{3}, s_{4}$ since the implication itself is among those statements (here $s_{4}$ ). If you dispute $s$ , you must dispute one of the $s_{i}$ .

Armed with this concept, we can define our idealization of the Debate scheme:

Ideal Debate
The input to the game is a question in English. The first agent begins by giving an answer plus an explanation^[4] $(s_{1}, . . ., s_{n + 1})$ for the answer. At every subsequent step, the second agent points to one of the statements $s_{j}$ in the explanation, $j \in {1, . . ., n + 1}$ , and the first agent responds by either giving an explanation for $s_{j}$ or declaring that the debate is over. In the latter case, a judge attempts to verify that $s_{j}$ is true. If she succeeds, the first agent wins the debate; if not, the second agent wins the debate.^[5] The first agent must end the game after a finite number of steps.
The debaters are maximally powerful agents; the judge is a human.

Ideal Debate

The input to the game is a question in English. The first agent begins by giving an answer plus an explanation^[4]

(s_{1}, . . ., s_{n + 1})

for the answer. At every subsequent step, the second agent points to one of the statements

s_{j}

in the explanation,

j \in {1, . . ., n + 1}

, and the first agent responds by either giving an explanation for

s_{j}

or declaring that the debate is over. In the latter case, a judge attempts to verify that

s_{j}

is true. If she succeeds, the first agent wins the debate; if not, the second agent wins the debate.^[5] The first agent must end the game after a finite number of steps.

The debaters are maximally powerful agents; the judge is a human.

Just as HCH abstracts away implementation details of stock IDA, Ideal Debate abstracts away implementation details of Debate.^[6]

Instead of having implication statements, one could have allowed the second agent to deny the fact of the implication, as a 'special move' of sorts. However, I think that would be a mistake; one of the take-away messages from this post is that there is no sharp line between implications and other statements. Both can be disputed and argued about further, which is why we're treating them as the same type of object. In the definition of Ideal Debate above, the second agent is always free to point to $s_{n + 1}$ (which is precisely the implication statement), and if so, the game proceeds normally, i.e., the first agent has to provide an explanation for $s_{n + 1}$ as the next move.

Below is an example of what an Ideal Debate transcript might look like.^[7] Here, the implication statement is not shown (but this is just to make it easier to draw – it should really be there at every level!), and the statement we recurse into is always the one that the second agent has pointed to.

You can also look at the uncluttered version.

A clarification: statements are not strings. By itself, the string 'Now, we have $y = a \cdot b$ [...]" cannot be judged since it contains symbols whose meaning is only defined in the remaining transcript. In general, any statement may require an arbitrary amount of additional context to be understood. This means the length of its string doesn't meaningfully indicate how complex a statement is.

Despite this, the complexity of $s_{final}$ may remain relatively low. This is due to the principle we've discussed in post #-1. The better the explanations chosen by the first agent are, the less complexity there will be in each part of the argument. Ideally, the judge will only have to deal with a small part of the entire argument to verify $s_{final}$ .^[8]

2. Cognition Spaces

One of the things mathematicians like to do when studying a problem is to define the space that one is working in. For example, in machine learning, one usually decides on a space of possible models before beginning the search. Consequently, we would now like to define a space and say that HCH and Ideal Debate are about doing stuff within this space. I will call this a Cognition Space.

As mentioned above, I think it is a mistake to differentiate between 'facts' and 'implications'. The prime number transcript is an example of this: it follows from the axioms of set theory that there are infinitely many prime numbers, so technically the entire debate is only about implications, but they just feel like regular statements. For this reason, we will take 'statement' to be a primitive and define a Cognition Space to be a pair

$(S_{h}, d_{h})$

where $S_{h}$ is a set of statements, and $d_{h} : S_{h} \to R_{+}$ a function assigning each statement a difficulty. On this, several points.

What $d_{h}$ measures is the difficulty of verifying that a statement is true, not of understanding what is being said. For example, in the Ideal Debate transcript shown above, the difficulty of the statement “Given any set of finitely many prime numbers, one can construct at least one prime number that is not a member of this set.” is likely quite high, even though it's fairly easy to understand what the claim is.
Since we do not differentiate between implications and facts, implication statements are regular members of $S_{h}$ . Consequently, a Cognition Space determines which approaches to explaining statements are difficult and which are easy. For example, say you're the first agent in Ideal Debate and have to explain the root statement $s_{0}$ . You might see two ways of doing this, either via $s_{1}$ and $s_{2}$ , or via $s_{3}$ and $s_{4}$ and $s_{5}$ . In this case, not only do these five statements have each a difficulty, but the set $S_{h}$ also includes two implication statements that precisely say $“ (s_{1}, s_{2}) imply s_{0}$ .” and “ $(s_{3}, s_{4}, s_{5}) imply s_{0}$ .”, respectively. All five of the $s_{i}$ and the two implication statements are assigned a difficulty by $d_{h}$ .
The ground truth here is entirely based on the human $h$ , which is either the human in HCH or the judge in Ideal Debate. $S_{h}$ contains all things that she would consider statements, and difficulty is measured by how hard it is for her to verify a statement. Thus, the human entirely determines the Cognition Space, and we've put her in the subscripts of both $S_{h}$ and $d_{h}$ to serve as a reminder of that.

We now turn to Ideal Debate in particular. Given the definitions above, we define a path through a Cognition Space $(S_{h}, d_{h})$ to be a pair

$(((s_{0}), S_{1}, . . ., S_{n}), s_{final})$

where $s_{final} \in S_{n}$ ^[9] and for each $j \in {1, . . ., n}$ , we have that $S_{j} \in S_{h}^{*}$ ^[10] and $S_{j} e \to s$ for some $s \in S_{j - 1}$ ; with $S_{0} := (s_{0})$ being just the initial answer to the input question. (Recall that the fat letters denote sequences, and that the notation $S_{j} e \to s$ is short for ' $S_{j}$ is an explanation for $s$ '.)

An Ideal Debate transcript (such as the prime number one shown above) precisely visualizes one path through a Cognition Space.^[11] To make sure you're following the formalism up to this point, here's an exercise.

EXERCISE (1-5 MINUTES): If we extend $d_{h}$ to paths, what is the correct definition for the difficulty of a path?

The difficulty should equal that of $s_{final}$ since that's the only statement the judge needs to verify.

3. Finding Explanations

Ideal Debate traverses a Cognition Space by having the two agents collectively choose a single path that starts at an answer $s_{0}$ to the input question and ends at some statement $s_{final}$ . If the first agent is honest (and we will assume so for this post), she will give a true initial statement (the answer to the input question) and will provide a true explanation for the previous statement at every step of the game. The second agent will, presumably, try to navigate to the most difficult part of the argument, hoping that the judge will fail to verify $s_{final}$ . (The second post will look into this a lot more.)

Conversely, a node in an HCH tree is initialized with a question, which she will attempt to answer while using subtrees to answer related questions. If she succeeds at this, then she must consider [the set of (question, answer) pairs she has exchanged with subtrees plus whatever cognitive work she's done herself] to be an explanation for the answer. (Otherwise, she risks returning an answer that isn't true.)

Thus, explanations^[12] are key in both cases; however, the purpose of the explanation is different:

in Ideal Debate, the statement that is explained is already known, and the explanation is meant to demonstrate that it is true; whereas
in HCH, the statement to be explained is not known, and the purpose of the explanation is to derive a [statement that answers the input question].

Put differently, Ideal Debate (if the first agent is honest) is analogous professor deciding which statements most cleanly demonstrate that such-and-such is the answer to a particular question, whereas HCH is analogous to a student trying to figure out how to answer the question in the first place, and asking subquestions to help with that. The difference between the resulting decompositions will vary – we can imagine questions where it is nonexistent, such as

$Q := “What is 987 \cdot 123 ?"$

In this case, we may end up with transcripts that look like this or this (with the usual disclaimers about hidden elements); they're functionally identical. But it's not hard to imagine cases where the difference is substantial. In fact, you need only take the prime number transcript above to have an example; without knowing the proof, many people would not think to ask, 'given any set of finitely many prime numbers, can you use them to construct a new prime number?', which means that an HCH transcript for the same question would look differently.

Let's take a shot at formalizing this difference. Recall our cognition space $(S_{h}, d_{h})$ . Given some statement $s^{*}$ , the space implicitly represents the tree of all possible explanations for $s^{*}$ . It looks something like this:

Note that the structure of the tree is entirely determined by the existing implication statements: for each implication statement $s^{'} = “ (s_{j}, s_{k}) imply s . ” \in S_{h}^{T}$ , there is an edge from $(s_{j}, s_{k}, s^{'})$ to $s$ in the tree.

In reality, there are likely far more than three explanations for $s$ , probably with more components, and then each component of each explanation has itself many more explanations, and the components of those have explanations as well, and so on. It becomes very intractable very quickly.

Formally, write $Explanations (s)$ to denote the set of all possible explanations in for $s$ that exist in $d_{h}$ . (This set has precisely as many members as there are implication statements of the form “ $⟨ arbitrary sequence ⟩ imply s$ .”) We can now define our requirement that Ideal Debate agents be 'maximally powerful' as the ability to search through all members of $Explanations (s)$ . In this setting, the first agent will pick the one that will lead to the easiest possible path, given the adversarial nature of the second agent. (Again, we will look into this more in the second post.)

HCH is much harder to formalize, but for now, we can crudely model the limitations that come with not knowing the answer as the ability to

only search through a subset of $Explanations (s)$ ; and
being able to derive the answer iff there is a sufficiently easy explanation on offer.

Here's another exercise to make sure the formalism is clear.

EXERCISE (3-8 MINUTES): Suppose a node in HCH succeeds in answering a question if she finds an explanation $(s_{1}, . . ., s_{n + 1}) \in Explanations (s_{0})$ such that $\sum_{j = 1}^{n + 1} d_{h} (s_{j}) \leq 100$ , where $s_{0}$ is the correct answer to the input question. Suppose further that she can search through a hundred elements (randomly chosen) in $Explanations (s_{0})$ to find such an explanation. Come up with a toy example where this would likely lead to her failing to derive $s_{0}$ , even though a simple explanation does exist. To do this, define a full cognition space $(S_{h}, d_{h})$ .^[13] You can choose arbitrary values; they need not correspond to anything real.

$S_{h} := {s_{0}, s_{1}, . . ., s_{n}} \cup {“ (s_{j}) imply s_{0} .” | 1 \leq j \leq n}$ .
$d_{h} (s_{n}) := 1$
$d_{h} (“ (s_{n}) imply s_{0} .”) := 1$
$d_{h} (s) := 10^{100}$ for all other $s \in S_{h}$ .

Increase $n$ to make it arbitrarily unlikely for the human to find an explanation.

The prime number example shows that there are real cases where the difference is significant, and the formalism agrees.

So – an HCH tree whose human only has high school knowledge about math would not immediately guess the most elegant proof. Fortunately, it doesn't have to. HCH has a massive computational budget, so if it goes in a fruitless direction first, that's still fine, as long as it finds a correct proof eventually. Would it do that?

Who knows.^[14]

However, it certainly isn't obvious that deriving a mathematical proof has the same asymptotic difficulty as understanding it, which is what it means to say that Factored Cognition is guaranteed to either work for both stock IDA and Debate or neither.

We can thus end this post on our first conjecture:

Decompositions are an essential part of any Factored Cognition scheme, and changing how they are chosen is entirely allowed to change how the scheme scales to harder problems.

In the next post, we'll see how much we can do with the formalism. This will not be conclusive, which is why we will then switch gears and turn to the human component.

I say 'stock IDA' to refer to any implementation of IDA where a human is doing the decomposition. There are possible implementations where an agent is doing the decomposition or where there is no decomposition at all (those implementations don't rely on Factored Cognition). In its most general form, IDA is merely a template of a training scheme prescribing that there be two procedures called Distill and Amplify, and under this view, just about every training scheme is technically a variant of IDA (any method that uses gradient descent becomes an instance of IDA if we set [Amplification] = [Gradient Descent step] and [Distillation] = [Identity Function]), which is why we won't talk much about IDA in general.

Note that stock IDA still leaves the implementation of the distillation step open. ↩︎
Note that HCH is technically not a fully defined scheme, but a class of schemes ${{HCH}_{h, t, ℓ}}$ , where $h$ defines the human component (what human, what environment, etc.), $t$ is a parameter in $R_{+}$ that specifies a time limit, and $ℓ$ defines the communication channel (what kind of messages, what length, etc.).

Given these parameters, we can define a node semi-formally like so:

A node is a human with context specified in $h$ , initialized with some question $q$ . It exists for time $t$ .

During this time, it can spawn another node with some question $q^{'}$ arbitrarily often; whenever it does, it immediately receives the output from that node.

By the end of time $t$ , it needs to provide an output, obeying conditions governed by parameter $ℓ$ .

(Here, the nodes it can spawn are the same type of object as itself.)

Then, the entire scheme is simply a node initialized with the scheme's input question.

This definition always yields a tree of infinite depth. It corresponds to what Paul calls weak-HCH. I talk more about why this sequence looks at weak-HCH rather than strong-HCH in a later post; it's related to the concepts discussed in Hiding Complexity. ↩︎
Details that have been abstracted away include:
- Inner Alignment concerns: in the real world, $A_{k + 1}$ may have an objective other than trying to approximate $[H access ⟶ A_{k}]$ even if it was trained to do that.
- Bounded depth: for any $k \in N$ , the model $A_{k}$ only approximates an HCH tree of depth $k$ , not an infinite one.
- Computational limitations: a trained model can only approximate an exponentially large tree insofar as the tree's computations can be done more cheaply with better algorithms. (This is the part we won't ignore since it's a hard limitation.)
↩︎
It being an explanation is a precise requirement; recall the definition above. ↩︎
Note that the complement of 'the judge succeeds in verifying that $s_{j}$ is true' is not 'the judge succeeds in verifying that $s_{j}$ is false'. If the judge is uncertain, the second agent also wins the debate. ↩︎
Details that have been abstracted away include:
- Again Inner Alignment concerns: in real Debate, the agents may have motives that go beyond winning any one debate game (like trying to cause more debates to happen in the future)
- Ambiguity: in reality, the meaning of statements can shift due to ambiguous words. This problem is the motivation for cross-examination.
- Wireheading: with the setup as-is, the first agent could subtly delude the judge rather than playing honestly, and the second agent can't prevent this since she has much more restricted output channels.
- Weak Debaters: although Debate agents should eventually become very powerful (as long as the training signal is accurate, the only limit is the power of the best machine learning techniques), they don't start off that way, and the scheme has to work about even at that point.
↩︎
As an aside: this example probably illustrates that it won't be the case that every human is competent enough to judge Debate transcripts. For example, it requires some degree of familiarity with mathematical notation and some competence in logical thinking. (And they need to understand English.) The relevant question is how the difficulty scales with the complexity of the question. ↩︎
There are some striking examples of this principle in action in more difficult mathematical proofs. I may at some point dedicate a post to illustrating this. ↩︎
Here, we're abusing the $s \in S$ notation to mean ' $s$ appears in the sequence $S$ ', which is technically different from set membership. ↩︎
The symbol $S_{h}^{*}$ denotes the set of all sequences of statements in $S_{h}$ . (This use of the asterisk is standard.) ↩︎
Geoffrey Irving (inventor of the Debate scheme) said something functionally similar months ago on the AI alignment podcast: "A single debate transcript, in some sense, corresponds to a single path through the tree of amplification." ↩︎
Usually, people talk about decompositions rather than explanations. As far as I'm concerned, they're synonyms: both terms refer to the set of substatements given by a debate agent/the set of (question, answer) pairs an HCH node exchanges with subtrees. I'm talking about explanations to emphasize the fact that they imply a specific statement. ↩︎
Note that you don't need to define what the non-implication statements are, it's enough to postulate that they exist. You do need to define the implication statements since those determine which sequences are explanations. As an example, the following:
- $S_{h} := {s, s^{'}, s_{I}, s_{0}}$
- $s_{I} :=$ “ $(s, s^{'})$ imply $s_{0}$ .”
- $d_{h} (s) := d_{h} (s^{'}) := d_{h} (s_{I}) := d_{h} (s_{0}) := 1$
perfectly defines a Cognition Space. However, this is not a solution to the exercise since $Explanations (s_{0})$ only has a single element. ↩︎
This is a good time to mention that Ought may or may not be studying questions similar to this one. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

Idealized Factored Cognition

18

1. HCH and Ideal Debate

2. Cognition Spaces

3. Finding Explanations