*Most of this document is composed of thoughts from Geoffrey Irving (safety researcher at DeepMind) written on January 15th, 2021 on Learning the Prior/Imitative Generalization, cross-examination, and AI safety via debate, plus some discussion between Geoffrey and Rohin and some extra commentary from me at the end. – Evan*

# Geoffrey on learning the smooth prior

## Vague claims

This doc is about a potential obstacle to Paul’s learning the prior scheme (LTP). Before reading this doc, please read either Beth’s simplified exposition or Paul’s original.

The intention of this doc was to argue for two claims, but weakly since I don’t have much clarity:

- LTP has an obstacle in assigning joint probabilities to similar statements.
- The best version of LTP may collapse into a version of debate + cross-examination

However, I don’t quite believe (2) after writing the doc (see the section on “How LTP differs from cross-examination”).

## Wall of text → generative model

As originally sketched, the prior z in learning the prior is a huge wall of text, containing useful statements like “A husky is a large, fluffy dog that looks quite like a wolf”, and not containing wrong facts like “if there are a lot of white pixels in the bottom half of the image, then it’s a husky” (statements taken from Beth’s post).

A wall of text is of course unreasonable. Let’s try to make it more reasonable:

- No one believes that a wall of text is the right type for z; instead, we’d like z to be some sort of generative network that spits out statements. (The original proposal wasn’t really a wall of text either; the wall was just a thought experiment.)
- We likely want probabilities attached to these statements so that the prior can include uncertain statements.
- Whether we attach probabilities or not, the validity of a statement like “A husky is a large, fluffy dog that looks quite like a wolf” depends on the definition of the terms. At a high level, we presumably want to solve this with something like cross-examination, so that our generative z model can be independently asked what a husky is, what fluffy is, etc.

The high level LTP loss includes a log p(z) term: we need to be able to compute log probabilities for z **as a whole**. It’s at least plausible to me that humans can be asked to assign probabilities to individual statements like our husky statements, but stitching this together seems rough.

## The interpolation problem

Consider the following statements:

- A husky is a large, fluffy dog that looks quite like a wolf.
- A husky is a large, fluffy dog that’s very similar to a wolf.
- A husky is a big, fluffy dog that’s very similar to a wolf.
- A husky is a big, fluffy dog that’s closely related to wolves.
- ...
- Tomatoes are usually red.

The first four statements are all true with overwhelming probability, as is the last, but to make the thought experiment better let’s say their individual probabilities are all around p = 0.9. What about their joint probabilities? For any subset of the first four statements, the joint probability will also be roughly p = 0.9, since the statements have extremely high correlation. However, if we take a set that includes 1-4 of the first 4 statements and the last statement, the probability will be closer to , since the two clusters of statements are mostly independent.

What’s the ellipsis? Since we’re in neural net land, we likely have a variety of natural ways to approximately map statements into a continuous vector space: in terms of random bits drawn, in terms of the activations resulting from whatever statement these statements conditioned on, etc. For any of these, we’ll get a natural interpolation scheme between any two statements, even statements that are completely unrelated to each other.

LTP needs the ability to compute an overall estimate of log p(z). My sense is that any solution to this implies a solution to estimating probabilities for arbitrary sets of statements, and in particular sets of statements taken from interpolation paths through statement space. Since log p(z) comes from humans, this means we have a human+machine protocol with this ability, and I don’t currently see how to do this with humans.

## Few-term cutoffs?

What I wanted to talk about in this section was assumptions of the form “all dependence structure in z is contained in dependencies between at most n statements.” However, upon arriving at this section I’ve realized I have no idea how to formulate that. The statement can’t be a direct analogue of the Lovasz Local Lemma-like assumption that each statement is conditionally independent of all other statements given n or n-1 neighbors, since as the interpolation example shows we should expect to be able to write down extremely large sets of highly dependent statements.

Thus, I’m instead going to just handwave at something like

**n-way assumption:**p(z) can be computed from all n-way probabilities

without saying how one would do it. In some sense this is a dimensionality assumption: any non-uniform Wiener process satisfies the assumption with n = 2, though maybe it’s wrong to call it a dimensionality assumption since I suppose the general Gaussian process also satisfies the assumption with n = 2.

I suppose I’m hopeful there might be a good solution based on some version of the n-way assumption. The vague intuition is that while not all sets of n statements will be that informative, given a couple statements we want to analyze we might be able to choose how to complete out to statements in a less degenerate way. In the interpolation example, this would mean choosing statements that try to isolate the dependence, in something like a higher dimensional version of how picking a point in the Wiener process conditionally separates points on either side.

Apologies for how vague the above is. I don’t have any clarity here, but it still seems useful to try to write the intuition down.

## What do humans do?

Of course humans do deal with subsets of this problem all the time, and we muddle through. Indeed, a large part of our strategy for muddling through is the local cutoff route: we try to frame our argument in terms of a relatively small set of statements whose dependence structure is simple enough to think through.

The bad side of this is that if we choose the statement set adversarially, there is a big space of “lying with statistics” strategies that appear, in particular choosing sets that seem independent but are not to argue for incorrectly small probabilities of the whole set. And even for a single statement, getting humans to emit calibrated probabilistic estimates seems rough. Of course, this is already a problem for stock debate about probabilistic statements; the difference here would only be if LTP leans harder on getting consistent structure out of the probabilities.

## How LTP differs from cross-examination

When I started writing this doc I thought that the best version of LTP would look like a version of cross-examination. This was because I was imagining that the neural representation of the prior ends up just being the weights of the network, and thus that there isn’t a separate object called the prior in the final system.

I still think z-is-weights is a likely endpoint, but after writing the doc I do think LTP would add a fundamentally new term to the loss / a new aspect to the protocol. Here’s a picture trying to capture this intuition:

Picture an extremely long closed loop of statements, as would be generated by a closed interpolation path through a smoothed version of statement space. Assume that for a local protocol like stock debate or cross-examination, we can fit only part of the way around the loop into any particular argument tree, and that the truth or falsity of some question of intuition depends on the normalization of the overall probabilities. Under these assumptions, it seems like there should be a strategy of “lie via normalization”, which pretends that whatever part of the circle we’re talking about at the moment is high probability, even if that produces an inconsistent normalization around the whole ring.

In contrast, if LTP works it would necessarily involve a protocol which attempts to estimate the normalization constant. This is still compatible with z-is-weights, where statements aren’t stored explicitly until we ask for them, and is compatible with the ring being exponentially far around. Roughly, we’d sample probabilities around the ring to build up an approximation of the long-term structure, and if the network consistently argued that the ring probabilities were too high, we’d detect that normalization failure and push it down.

What exactly this involves depends on protocol details, whether that’s a specific version of the low-term cutoff approximation or something else.

## Case study: consistency about elasticities

Tyler Cowen’s Consistency about elasticities provides an interesting thought experiment. It feels like something I might want to keep thinking through, so worth writing down. Roughly, Tyler’s claim is that

- Supply curves are either elastic or inelastic (or rather, they have a particular elasticity)
- Stimulus makes sense only if supply is elastic.
- Higher minimum wages make sense only if supply is inelastic.

I don’t have the economics intuition to know whether these points are right, but assume they are for now. One could imagine that questions about stimulus and minimum wages are asked separately of an agent, either by the same person at different times or by different people. In both cases, if the agent is pinned down to give an answer on (1), they have a ready answer.

If stimulus and higher minimum wages came up as different nodes in the same tree, cross-examination would let us back up to the split point and ask whether supply curves are elastic, at which point we get a contradiction unless the agent is consistent. My intuition is that the thing LTP adds, once we figure out the right non-wall-of-text version, is a mechanism that looks around the space for possible conversations to find inconsistencies of this form. It does seem like this might be materially different to stock cross-examination.

# Rohin and Geoffrey discussions

## Estimating p(z) with debate

From “The interpolation problem” above:

Since log p(z) comes from humans, this means we have a human+machine protocol with this ability, and I don’t currently see how to do this with humans.

Rohin Shah:

I thought the hope was to do this with amplification (or equivalently debate)

Geoffrey Irving:

Sure, but what do those debates look like? I don't currently see how to structure them into a form that humans can accurately judge.

Rohin Shah:

I guess I don't see why regular debate doesn't work, e.g. when z is a wall of text:

A: P(z) is 1e-30. If we split z into halves, then P(z1) is 1e-13 and P(z2 | z1) is 1e-17.

B: No, P(z) is 1e-24, because P(z2 | z1) is 1e-11.

... Iterate until you disagree about a single sentence ...

A: No, P("a husky is a type of dog" | z1, ...) = 0.99, because P("a husky is a type of dog") = 0.99 and none of the rest of z materially changes this conclusion.

From this point on it seems like a pretty standard debate?

Maybe your point is that we don't know what to do when z is not a wall of text, though I don't see how that connects with the n-way assumption.

Geoffrey Irving:

It looks like you've written out a debate protocol that assumes the 2-way assumption?

That is, most of your protocol doesn't need to involve a human, it's just agents bisecting the wall of text. And then at the end you ask humans about 1 or 2 statements. The fact that you can reconstruct the overall p(z) from these leaves looks a lot like it implies

2-way assumption: p(z) can be computed from all 2-way probabilities p(x0, x1)

Rohin Shah:

Interesting. I think actually it's more that the question is what happens after the point which I got to -- bisecting gets you down to P(zi | z1 ... z_{i-1}, z_{i+1}, ... zK), but that's not something the human can immediately determine, so we need further debate.

My interpretation of your position is that P(zi | z1 ... z_{i-1}, z_{i+1}, ... zK) can't be bisected any more, because you have to deal with the normalization constant which is exponentially sized. (I'm not convinced that you couldn't write down a protocol for this, but I haven't thought about it much.) My position is that this is just standard debate, where the z1, ... z_{i-1}, z_{i+1}, ... zK consists of some large external resource that we are allowed to quote from. If this seems hard / impossible / requiring some assumption, shouldn't that also imply that regular debate on regular questions requires the same thing?

I'm also not sure I understand what you mean by the n-way assumption any more. Let me try rephrasing it: the n-way assumption holds if, interpreting z as a list of statements z1 ... zK, we have that P(zi | z1 ... z_{i-1}, z_{i+1}, ... zK) can be computed from terms of the form P(zi | z_{j1}, z_{j2}, ... z_{jn}).

Geoffrey Irving:

It certainly isn't the case that just because there isn't a human-checkable debate protocol for one question, it means there isn't a human-checkable debate protocol for any question. The analogous theory statement would be that if M is a parameterized polynomial time algorithm, then M is either a judge for all statements in PSPACE or none of them, which is false.

Past this point, your claim is that it's just standard debate, but you don't know what the debate transcripts would look like or whether a human could check them for correctness because you haven't thought about it much. Having thought about it more, I also don't know what the debate transcripts would look like, and the claim is that my lack of knowledge of such a protocol is evidence that it may not exist, or at least that it may require work to find.

As to your restatement of the n-way assumption, your statement isn't equivalent to mine as far as I can tell: you're asking for more things to be computed, not just p(z), and the inputs in your version have n+1 statements, not n. However, I do think your restatement is similar in spirit.

## Is the n-way assumption sufficient?

From “Few-term cutoffs?” above:

I suppose I’m hopeful there might be a good solution based on some version of the n-way assumption.

Rohin Shah:

Good in the sense that it is robust to arbitrary amounts of intelligence? It seems like you'd want to increase n over time if you did go this route, given that the assumption is probably not literally true (and instead is a good approximation).

Geoffrey Irving:

Paul's hope is certainly that the amplification/debate protocols contain essentially a precise homomorphic image of the structure employed by the neural net. This would be great if it's practical, but my current sense is that it isn't, in which case we need some mechanism for cutting off questions for which we can't resolve and admitting uncertainty instead. The hope here is not that all questions would get resolved, but that such a vast sea of questions would be resolvable that cutting off past that point is workable.

Rohin Shah:

Right, I understand that, but generally speaking arguments of this form only work so far -- if you keep expanding the space of possible questions / arguments (or equivalently, increasing the intelligence of the AI system exploring the full space), eventually there will exist some incorrect / deceptive arguments leveraging the fact that the assumption isn't true.

I can see three ways of resolving this:

- Arguing that actually there aren't any incorrect / deceptive arguments, e.g. because the n-way assumption is in fact true, or because you have some way to prevent arguments that use the n-way assumption to argue for an incorrect conclusion in cases where they know the n-way assumption doesn't apply
- Arguing that it doesn't matter that these arguments exist, e.g. because these are only used for learning a prior, and having a slightly wrong prior won't matter in practice
- Increasing the value of n as the intelligence of the agents increases, so that even though deceptive arguments might exist the chances of an agent finding them are always very low.
I'm not sure which of these three you'd agree with.

Geoffrey Irving:

I think I agree with either a softened version of (1) or (4), where (4) is that I'm not sure LTP works.

For (1), I think it's not the case that because an assumption is not exactly true a sufficiently intelligent agent will necessarily find a hole in it. The fact that the assumption is not exactly true means we need enough slack to fill the resulting holes so that a softened version of the assumption becomes true, and the amount of slack required does not go to infinity automatically as agent intelligence goes to infinity.

The simplest example of this is that if we have a backward stable floating point calculation, an arbitrarily intelligent agent can't magically find inputs that violate the stability property, even though we're not using real arithmetic.

In this case, this does mean that for (1) to go through we need to understand the failure modes of something like the n-way assumption or whatever replaces it, so we know what kind of safety margins to add.

Rohin Shah:

Yeah all of that makes sense.

# Evan’s thoughts

It definitely seems to me that being able to deal with the entirety of z is a serious obstacle to getting any sort of LTP scheme to work. As Geoffrey points out, dealing with issues like non-independence and normalization in particular seem quite tricky.

One mechanism that I think might help here is what I’ll call “cross-examination of the zeroth debater.” In the standard cross-examination setup, debaters are allowed to query multiple copies of opposing debaters without communication between those copies, allowing for ferreting out and then citing of inconsistencies. However, this only allows for the identification of inconsistencies *within a single debate.* To use Geoffrey’s example, if you ask in one debate “Should we do stimulus?” and in another debate “Should we increase the minimum wage?”, you might get inconsistent responses across debates without there being any inconsistency within a single debate. Since we want the first debater to converge on the truth, however, the first debater should be consistent across all debates—which means it should be a legal move for one debater to point out that, on another question, the first debater would have said something inconsistent with what they just said. What this amounts to in practice is functionally a cross-examination procedure where the debater being cross-examined is the debater model prior to even seeing the current question being debated—hence “cross-examination of the zeroth debater.”

Structurally, cross-examination of the zeroth debater also brings debate closer to AI safety via market making, where a key component of that setup is the ability of the trader to exhibit market probabilities on other questions. In fact, I think the market in AI safety via market making can essentially be seen as a z in the LTP sense—with the market making procedure acting to ensure that z is globally consistent.

That being said, the same problems that Geoffrey points out here can also play out in market making: if the market is globally inconsistent, but there’s no procedure via which that inconsistency can be shown to a human (e.g. because it’s too distributed), market making has no mechanism to correct that inconsistency. Implicitly, the assumption that market making is relying on to make this work is that any debate tree has a short summarization that can be fed to a human—since otherwise there won’t always be a way for the trader to summarize to the human what the market believes about something in a way that’ll fit into the transcript. Structurally, this is very similar to a version of Geoffrey’s n-way assumption, since it’s assuming that complex probabilities can be understood by looking at only a bounded set of information.

One other modification that might help with this sort of short summarization assumption is allowing the text of the transcript to contain hypertext-style “pointers.” This lets the transcript be effectively arbitrarily long—and even though the human is still only able to look at a small bounded portion, if the human can also pass hypertext back to the market, they can leverage the market to help them understand it. From a complexity standpoint, adding these sorts of pointers brings both imitative amplification and market making from EXPTIME to R, which is a substantial jump—though with very unclear practical implications.