(This post is part of a sequence that's meant to be read in order; see the preface.)

Post #1 was about developing and justifying a formalism for Factored Cognition. Now that we have this formalism, this post is about doing as much with it as possible.

1. Debate Trees

Recall that a Cognition Space is a pair where $S_{h}$ is a set of statements, $d_{h} : S_{h} \to R_{+}$ is a difficulty function, and $h$ is a human.

So far, I've only shown examples of single transcripts. A single transcript corresponds to one path through $d_{h}$ that is dependent on choices from both agents: at every step, the first agent outputs an explanation (which is a sequence of statements), the second agent points at one element of this sequence, and the first agent continues the path from that element onward. However, given that we model Ideal Debate agents as maximally powerful, it is also coherent to ask about the object that results if we fix all of the first agent's actions in advance, such that she 'pre-commits' for any possible combination of choices from the second agent.

I call such an object a Debate Tree, and we can define it formally as a directed rooted tree^[1] $(V, E)$ , where $V \subset S_{h}^{*}$ (so each node is a sequence of statements) and $E \subset V \times V$ , that satisfies the following three conditions:

The unique root of $(V, E)$ is a one-element 'sequence' $(s_{0})$ .
$\forall (S, S^{'}) \in E \exists s \in S$ ^[2] $: S^{'} e \to s$ .
$\forall (S, S^{'}), (S, S^{''}) \in E : [\exists s \in S : S^{'} e \to s \land S^{''} e \to s] ⟹ [S^{'} = S^{''}]$ .

The first condition says that the root of the tree needs to be a single statement (this should be the answer to the input question). The second condition says that every edge $(S, S^{'})$ needs to explain a statement in $S$ ; we don't have redundant edges in our tree. And the third condition says that each statement is only explained once; the formal way of saying this is that, if two explanations exist for the same statement, they're really the same. (Note that this restriction exists because the first agent has to choose one explanation during the game; it certainly doesn't imply that only one explanation exists.) There is no condition to demand that each statement needs to have an explanation – whenever there is none, it means that the first agent decides to end the debate when that statement is pointed at.^[3]

In this definition, nodes are explanations, and each node has one outgoing edge for each [statement of the explanation that the first agent wants to explain further], which links to an explanation for that statement. Defining the tree over individual statements would lead to a functionally equivalent definition, but I think defining it as-is makes arguments simpler.

Recall that a Debate Tree encodes all decisions that the first agent makes (for any possible combination of choices from the second agent). This means that, once we fix this object, the second agent is free to choose any path she wants out of the tree, without further involvement from the first agent. Formally, a path through a Debate tree with root $(s_{0})$ is a pair^[4]

$(((s_{0}), S_{1}, . . ., S_{n}), s_{final})$

where $(S_{j - 1}, S_{j}) \in E \forall j \in {1, . . ., n}$ and $s_{final} \in S_{n}$ . (And $S_{0} = (s_{0})$ .) (You can go back to post 1 to convince yourself that a path through a Debate tree is also a path through a Cognition Space.)

As mentioned in the first post, we define the difficulty of a path $P := (p, s_{final})$ as the difficulty of the final statement (since that is the one being judged). In symbols, $d_{h} (P) := d_{h} (s_{final})$ .

2. The Ideal Debate-FCH

So far, we haven't talked about how precisely the two agents make their decisions. Given the concepts of Debate Trees and paths, this is now easy. The second agent wants the first agent to lose, which means she'll choose the most difficult path in whatever Debate Tree the first agent chooses. (We still assume that the first agent outputs only true statements, which means that the second agent can't win if the judge successfully verifies the final statement.) The first agent, knowing this, chooses the Debate tree such that the difficulty of the hardest path is minimized.

With this, we are almost ready to define a FCH for Ideal Debate. But first, we need a bit more notation:

Given a question $q$ , we (for now) assume there is a unique statement $a_{q} \in S_{h}^{T}$ that correctly answers the question.
Given a cognition space $(S_{h}, d_{h})$ , we write $T ((S_{h}, d_{h}), a_{q})$ for the set of all Debate trees that begin with statement $a_{q}$ .
Given a Debate tree $T$ , we write $P (T)$ for the set of all paths in $T$ .

Given this, the difficulty of the path we will end up with is

$min T \in T ((S_{h}, d_{h}), a_{q}) (max p \in P (T) d_{h} (p))$

Since the absolute values of the difficulty function $d_{h}$ are arbitrary (it doesn't matter whether we denote difficulties from 0 to 100 or from 0 to $10^{10000}$ ), we can assume without loss of generality that the hardest difficulty a human can handle is 1.^[5] Thus, we can formulate:

$the Ideal Debate FCH (h, Q) : \forall q \in Q : min T \in T ((S_{h}, d_{h}), a_{q}) (max p \in P (T) d_{h} (p)) \leq 1$

where $h$ is a human and $Q$ a set of questions.

Going forward, it will also be useful to talk about the difficulty of Debate Trees. Thus, we define

$d_{h} (T) := max p \in P (T) d_{h} (p)$

and we say that a Debate tree $T$ can handle question $q$ if $d (T) \leq 1$ .^[6]

Here is a different view of the problem. In the space of all statements, there is a subset that the human judge can verify directly, i.e.,

$D_{h}^{(0)} := {s \in S_{h} | d_{h} (s) \leq 1}$

Then, there is a larger subset that contains all of the above plus the statements that can be explained solely in terms of statements in $D_{h}^{(0)}$ , i.e.,

$D_{h}^{(1)} := D_{h}^{(0)} \cup {s \in S_{h} | \exists S \in (D_{h}^{(0)})^{*} : S e \to s}$

In general, given any $k \in N_{+}$ , we can extend $D_{h}^{(k - 1)}$ by adding all statements that can be explained solely in terms of statements in $D_{h}^{(k - 1)}$ , i.e.,

$D_{h}^{(k)} := D_{h}^{(k - 1)} \cup {s \in S_{h} | \exists S \in (D_{h}^{(k - 1)})^{*} : S e \to s}$

Note that this gives us a chain of expanding sets, i.e.,

$D_{h}^{(0)} \subseteq D_{h}^{(1)} \subseteq \dots \subseteq D_{h}^{(k)} \subseteq D_{h}^{(k + 1)} \subseteq \dots$

We can also define the set of all statements that are eventually explainable in this way, i.e.,

$D_{h} := \infty ⋃ j = 0 D_{h}^{(j)}$

Intuitively, it seems like Ideal Debate should be able to handle all questions with answers in $D_{h}$ , since they can be explained in terms of progressively easier statements – and then any path should eventually bottom out at $D_{h}^{(0)}$ , which means that the second agent cannot delay success indefinitely.

This brings us to our first (and as of now, only) theorem:

Theorem. Given any $h$ and $Q$ , Ideal Debate-FCH $(h, Q) ⟺ \forall q \in Q : a_{q} \in D_{h}$ .

Proof. First, note that, while the Ideal Debate-FCH is formulated as a hypothesis, the definition also defines a set, namely

$X_{h} = {s \in S_{h} ∣ ∣ min T \in T ((S_{h}, d_{h}), s) d (T) \leq 1}$

and the Ideal-Debate FCH simply says that $\forall q \in Q : a_{q} \in X_{h}$ . It thus suffices to show that $X_{h} = D_{h}$ . We will prove an even stronger statement. Note that we can restrict the set $X_{h}$ by limiting the maximum depth of the Debate Trees that can handle the respective statements. Formally, we can define

$X_{h}^{(k)} := {s \in S_{h} ∣ ∣ min T \in T^{(k)} ((S_{h}, d_{h}), s) d (T) \leq 1}$

for any $k \in N$ , where $T^{(k)}$ denotes the set of Debate Trees with depth at most $k$ . ('Depth' is defined as the number of edges in the longest path through the tree.) By construction, we now have $X_{h} = ⋃_{j = 0}^{\infty} X_{h}^{(j)}$ , just as $D_{h} = ⋃_{j = 0}^{\infty} D_{h}^{(j)}$ . What we will show is that

$D_{h}^{(k)} = X_{h}^{(k)} \forall k \in N .$

We proceed by induction. First, if $s \in D_{h}^{(0)}$ , then $d_{h} (s) \leq 1$ , which means that the trivial tree $({(s)}, \emptyset)$ is a Debate Tree of depth 0 that can handle $s$ , so that $s \in X_{h}^{(0)}$ . Conversely, if $s \in X_{h}^{(0)}$ , then the Debate tree $T$ handling $s$ must have no edges (otherwise, its depth would be at least 1). Thus, $p := (((s)), s)$ is a path in this tree, and we have $1 \geq d (p) = d (s)$ , hence $s \in D_{h}^{(0)}$ .

Now, suppose the statement is true for some $k \in N$ . We show that $D_{h}^{(k + 1)} = X_{h}^{(k + 1)}$ .

" $\subset$ ": Let $s \in D_{h}^{(k + 1)}$ . Then, there exists $S = (s_{1}, . . ., s_{n + 1}) \in (D_{h}^{(k)})^{*}$ such that $S e \to s$ . Apply the Inductive Hypothesis to find Debate Trees $T_{1}, . . ., T_{n + 1}$ of depth at most $k$ such that tree $T_{j}$ handles statement $s_{j}$ . We combine these trees into a larger tree with root $(s)$ and an additional edge $((s), S)$ .^[7] Since all $T_{j}$ have depth at most $k$ , this tree has depth at most $k + 1$ . Furthermore, given any path $p$ through $T$ , by construction, the path must end in a node that exists in one of the $T_{j}$ , which implies that $d (p) \leq 1$ . It follows that $T$ handles $s$ and hence $s \in X_{h}^{(k + 1)}$ .

" $\supset$ ": Let $s \in X_{h}^{(k + 1)}$ . Then, there exists a Debate Tree of depth at most $k + 1$ that handles $s$ . Let $((s), S)$ be the unique^[8] edge descending from the root. For each $s_{j} \in S$ , let $T_{j}$ be the subtree growing out of $s_{j}$ .^[9] By construction, $T_{j}$ has depth at most $k$ and handles $s_{j}$ , so (applying the Inductive hypothesis), we have $s_{j} \in D_{h}^{(k)}$ . Then, $S \in (D_{h}^{(k)})^{*}$ and $S e \to s$ , and hence $s \in D_{h}^{(k + 1)}$ .

3. Interpretation

At this point, we have a bunch more definitions and a theorem. Now, what does this mean?

Let's start with Debate Trees. A Debate Tree is actually a very natural object; it's what you get if you explain a subject in a hierarchical rather than a linear way. ( $X$ is true because of $Y_{1}, . . ., Y_{4}$ ; then $Y_{1}$ is true because [...].) It is very similar to Elizabeth's project of breaking questions down. Notably, that project never mentions Factored Cognition; it's just presented as an epistemic tool.

In a better world, would textbooks use something like Debate Trees to explain proofs? I'm almost certain the answer is yes. There is no way that a purely sequential presentation of information is optimal. Our understanding doesn't work that way (compare post #-2).

There is one difference between Debate Trees and a hierarchical presentation of information optimized for being easily understandable. In the former, only one part is actually explored, which means that a Debate Tree doesn't mind having redundancy in it (by explaining stuff in more than one place). Conversely, if you optimize for understandability with respect to a single reader, you'll want to avoid redundancy. Nonetheless, they are very similar.

So much for Debate Trees. What about the theorem we've just proved? What does it mean?

Essentially, it means that Debate is nicely behaved in the limit. As both agents become stronger and the structure of the game becomes stricter, we approach a situation where the scheme can answer a question if and only if its answer can be recursively explained until there are no more difficult components. Even though the game results from two powerful agents applying optimization in opposite directions, the result is can be described without mentioning either one of them. Note that the same is not true for Iterated Amplification; even in the limit of perfect Factored Cognition, it is entirely possible that the scheme fails at a question for which an easy explanation exists.

Notably, the theorem stops being true if we drop the assumption that both agents are maximally powerful.^[10] If the first agent is weaker, she might fail to find the best explanation, which shrinks the set of statements Debate can handle. Conversely, if the second agent is weaker, she may fail to point to the most problematic statement, which enlarges the set of statements Debate can handle. Do these factors equal out? I'm not sure. One of the things that I haven't yet tried but may be reasonable is to model inadequacy and see whether this benefits the first or second agent.

It's worth pointing out that the Ideal Debate FCH doesn't talk about false statements. It formalizes the claim 'the first agent can always win by being honest' which leaves open the possibility that she can also win by being dishonest. (And if she could do both, she would presumably do what's easier or safer.) It is necessary but perhaps not sufficient to realize Factored Cognition with Debate.

I think focusing on the honest case makes the most sense. Nonetheless, the next chapter is about what happens if the first agent wants to defend a lie.

4. Relaxing the truth assumption

To model dishonesty by the first agent, we assume that

In addition to $S_{h}^{T}$ , there is a set $S_{h}^{F}$ of false statements. We have $S_{h} = S_{h}^{T} ⊔ S_{h}^{F}$ .^[11]
Given a question $q$ , in addition to the honest answer $a_{q} \in S_{h}^{T}$ , there is at least one dishonest answer ${~ a}_{q} \in S_{h}^{F}$ . The first agent may defend this dishonest answer.
Any explanation for a false statement needs to contain at least one false element, i.e., if $(s_{1}, . . ., s_{n + 1}) e \to s$ and $s \in S_{h}^{F}$ , then $\exists j \in {1, . . ., n + 1}$ such that $s_{j} \in S_{h}^{F}$ .

Since the second agent is also maximally powerful, she can repeatedly point to a false statement in the explanation. Due to the third assumption made above, this becomes an invariant: the ability to point to a false statement in any one step leads to the ability to point to a false statement in the next step, and therefore, since she can point to one in the beginning, she can do so at every step in the game.

In particular, this means that, regardless of when the first agent ends the game, we will have that $s_{final} \in S_{h}^{F}$ .

At this point, we have two conflicting intuitions. On the one hand, the judge should only declare the first agent the winner if she is certain that the statement to be verified is correct. This should only happen for simple statements; if $s_{final}$ is complicated, she should declare the second agent the winner. Thus, the first agent should only be able to win the game by choosing a $s_{final}$ that looks simple and obviously true, yet is false. This sounds extremely hard to accomplish.

On the other hand, we are, in some sense, pitching a human in a battle of wits against a potentially superintelligent AI, which sure sounds like doom. The superintelligent AI has to follow a very rigid structure, but... it still kinda sounds like doom. In particular, we probably expect that the first agent will navigate the game so cleverly that she only appears to follow the rigid structure while really doing something else. Crucially, we lack the ability of the second agent to interfere through an equally powerful I/O channel that exists in regular Debate.

There thus seems to be a tradeoff here, where at the one end, we have the 'stock Debate' protocol that enforces no structure, and on the other, the 'Ideal Debate' protocol that enforces extremely rigid structure. More structure means more concern with things like wireheading, but also stronger theoretical guarantees.

I strongly suspect that running literal Ideal Debate is a recipe for disaster since the first agent can break away from the format entirely and attempt to wirehead the judge. At the very least, it would require a way for the judge to prematurely declare a winner (in literal Ideal Debate, the first agent has complete control over when to end the game), and it would probably also require the second agent to claim that wireheading is occurring. On the other hand, I do have the intuition that one should aim to get as close to Ideal Debate as possible while managing these concerns.

5. The HCH-FCH

Ideal Debate has been amendable to a formal description because the human only makes decisions at the end. In HCH, the human is involved all throughout and, crucially, is in charge of decomposing questions. To have an accurate model, one needs to abstract away the entire process of the human decomposing the question as well as any other cognitive work she might do on the question – but if this is done, there is nothing left to formally capture.

(And older version of this post used to have an attempt of formalizing it more, but I've since concluded that it can't be done right.)

To have something analogous to the Ideal Debate-FCH, we trivially define:

$the HCH-FCH (Q, h, t, ℓ) : {HCH}_{h, t, ℓ} can solve every question in Q$

6. Conclusion

This post concludes the first part of the sequence. There probably weren't any huge surprises so far. Is having a formalism useful?

I think so. One purpose of formalizing a setup is that it overcomes the illusion of transparency. Without it, it is possible to think the setup is clear even when it really is not. I can take myself as one data point: my first attempt at coming up with a formalism looked different (and, I think, wrong). At least my past self did not understand the problem to the point that the formalism is trivial.

Anyway, at this point in the sequence, I want to defend the following two claims:

It doesn't make any sense to talk about the 'Factored Cognition Hypothesis'; there is no one requirement that works for both schemes.
The formalism of Cognition Spaces is accurate, in that it doesn't include anything that misrepresents Factored Cognition as implemented by IDA or Debate. It may be incomplete (e.g., there is nothing about defining terms, as people have pointed out on post #1, and maybe you could do more with false statements).

I think these are pretty conservative claims, even though the first clearly contradicts what I've heard other people say. If anyone doubts them, this is the place to discuss it. My conclusions from the second part are probably going to be a lot more controversial.

A directed rooted tree, also called an arborescence, is a directed acyclic graph such that there exists [a node from which there is exactly one path to every other node]. This node is called the root; it's unique because, if there were two such nodes $x$ and $y$ , there would be a path $x ⟶ y$ and a path $y ⟶ x$ and hence a cycle $x ⟶ y ⟶ x$ . ↩︎
We continue writing $s \in S$ to denote that $s$ appears in the sequence $S$ . ↩︎
The 'trivial tree' $((s_{0}), \emptyset)$ for some $s_{0} \in S$ is a proper Debate Tree, according to this definition. It corresponds to the first Debate agent giving an answer $s_{0}$ to the input question and deciding that this answer is already self-evident. ↩︎
Note that the first element of this pair is a path as defined in Graph theory (through the Debate Tree, which is a graph) whereas the second is an additional element that denotes which statement in final explanation we end up in. ↩︎
In particular, one could model the same by having a difficulty threshold $c_{h}$ for a human, such that the human can deal with all questions that are at most $c_{h}$ hard. However, the pair $(d_{h}, c_{h})$ is equivalent to the pair $(d_{h}^{'}, 1)$ , where $d_{h}^{'}$ is like $d_{h}$ except that all difficulties are scaled by $\frac{1}{c_{h}}$ . ↩︎
Given this, one could alternatively phrase the Ideal Debate FCH as

$\forall q \in Q : min T \in T (d_{h}, a_{q}) d_{h} (T) \leq 1$

Or in words, one could say, 'every question in $Q$ can be handled by a Debate Tree'. ↩︎
This is a step where defining the nodes of Debate Trees to be explanations rather than individual statements makes things harder. For each subtree, one needs to replace its current root (that's a one-element sequence) with the node $S$ . This can be done formally in terms of the underlying nodes and edge sets; it's just cumbersome. ↩︎
The edge from the root is unique because the root is a one-element sequence $(s)$ , and each statement can only have one explanation (and thus only one edge) by the third condition of Debate Trees. This corresponds to the fact that the first agent only has to prepare one explanation for her initial answer; there is not yet a choice from the second agent involved. ↩︎
This step is the reverse of the combining step. The subtree growing out of $s_{j}$ is the tree we get by taking the sub-graph out of $S$ and replacing its root $S$ with just $(s_{j})$ . ↩︎
Recall that there is, in fact, only one agent playing against itself. Thus, we can assume that both agents are always equally competent. ↩︎
The 'squared cup' symbol $⊔$ means the same as $\cup$ plus the information that the two sets are disjoint. ↩︎

AI ALIGNMENT FORUM
AF