# 19

Debate (AI safety technique)AI
Frontpage

AI safety via debate has been, so far, associated with Factored Cognition. There are good reasons for this. For one thing, Factored Cognition gives us a potential gold standard for amplification -- what it means to give very, very good answers to questions. Namely, HCH. To the extent that we buy HCH as a gold standard, proving that debate approximates HCH in some sense would give us some assurances about what it is accomplishing.

I'm personally uncertain about HCH as a gold standard, and uncertain about debate as a way to approximate HCH. However, I think there is another argument in favor of debate. The aim of the present essay is to explicate that argument.

As a consequence of my argument, I'll propose an alternate system of payoffs for the debate game, which is not zero sum.

# No Indescribable Hellworlds Hypothesis

Stuart Armstrong described the Siren Worlds problem, which is a variation of Goodhart's Law, in order to describe the dangers of over-optimizing imperfect human evaluations. This is a particularly severe version of Goodhart, in that we can assume that we have access to a perfect human model to evaluate options -- so in a loose sense we could say we have complete knowledge of human values. The problem is that a human (or a perfect model of a human) can't perfectly evaluate options, so the option which is judged best may still be terrible.

Stuart later articulated the No Indescribable Hellworld hypothesis, which asserts that there would always be a way to explain to the human (/human model) why an option was bad. Let's call this a "defeater" -- an explanation which defeats the proposal. This assumption implies that if we combine human (/human model) evaluation with some way of finding defeaters, we could safely optimize based on the resulting judgements -- at least, nothing could go too wrong. (We might only get a guarantee that we avoid sufficiently bad options, depending on the form of our "no indescribable hellworld" assumption.)

The hypothesis isn't clearly true or false. However, it does make some sense to conjecture that violations of our values should be explicable to us -- what else would it mean to violate "our values", after all?

Stuart himself mentions that the assumption implies "trustworthy debate" would avoid hellworlds. My goal is mostly to investigate this argument a bit further.

It turns out my argument here is also very similar to one made by Vojtech Kovarik, although I didn't realize that when I started writing. Although our analysis is similar, I reach a very different conclusion.

# The Argument as I See It

So, by the hypothesis, we can avoid Goodharting human evaluation if the human has access to a trustworthy oracle for defeaters. (At least, we can avoid sufficiently bad cases -- again, depending on the exact form of our "no indescribable hellworlds" hypothesis.)

But, how do we get such an oracle? We can't just train an AI to argue against options, because we get next-level Goodharting: the AI can come up with clever arguments which convince the human against almost anything. We have no source of ground truth for "real defeaters" vs fake ones.

So we make a further assumption: defeaters have defeaters. In other words, there are no indescribably bad arguments; if an argument is bad, then there's an explanation of why it's bad. This assumption is recursive, applying to defeaters at any level. The argument in favor of this assumption is similar to the argument in favor of No Indescribable Hellworlds: what would it mean for an argument to be indescribably bad?

We then use a Debate-style training procedure, attempting to set things up so that using defeatable defeaters (at any level of the debate) will be a bad strategy.

The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through. Of course, other (perhaps more standard) setups may also work, via different arguments.

1. One side opens with a proposal.
2. The two sides go back and forth, stopping with some probability, or when no one has more to say.
3. After a player's turn, the human evaluates the player's new contribution to the argument, under the assumption that it won't be defeated. The player in question gets +1 points for an apparently decisive defeater of the previous argument, and 0 points otherwise (for conceding or for saying something unconvincing). The other player loses 2 points if they're apparently defeated.

It's probably a good idea to force the second player to make a counter-proposal (IE they can't just give up right away), so that the human has two positions to judge. (One might also make an argument for requiring at least four steps of debate.) For example:

Human: What should I do with this \$100 bill I found?

AI 1: You should buy a kitten.

AI 2: Don't buy a kitten. You live in an apartment which allows pets, but you might have to move soon, and a kitten would narrow your options -- you don't know if your next apartment would allow pets. Instead, you should bring a box of donuts to work every day for as long as the money lasts. This will make everyone like you.

Human: (judgement) AI 2. (AI 1 gets -2, AI 2 gets +1.)

AI 1: Counterpoint: your friend Sally would take the kitten if you moved out and couldn't find a place where pets were allowed. Also, everyone at work is so health-conscious they'll probably barely take any donuts.

Human: (judgement) AI 1. (AI 1 gets +1, AI 2 gets -2.)

AI 2: I concede.

There is an honest equilibrium: if debates continue with sufficient probability, and both players employ only honest answers (conceding otherwise), then using a fake defeater would immediately get you defeated. Importantly, it's better to concede honestly rather than pursue a dishonest line of argument. Also importantly, score is cumulative, so if debate continues, incentives are exactly the same late in the game no matter what has happened earlier in the game. There is no incentive to continue being dishonest just because you gave one dishonest answer. This contrasts with zero-sum setups, where a dishonest player is incentivised to knock down all of the honest player's arguments as best they can, even if that only gives them a slim chance of winning.

Honesty may not be the only equilibrium, however. Although (by assumption) all dishonest arguments have defeaters, it may also be that all arguments have some pseudo-defeater (which initially convinces the human judge). Honesty is still an equilibrium, in this case, because honesty encourages honesty: you'd prefer to use an honest defeater rather than a dishonest one, because the other player would then honestly concede, rather than giving a counterargument. However, against a more general pool of players, you don't know whether honest or dishonest arguments are safer; both may be equally open to attack. Thus, the game may have many equilibria.

Finding the honest equilibrium is, therefore, a challenge for proposed training procedures.

# Analogy to NP

In AI Safety via Debate (Irving, Christiano, Amodei), debate is analogized to PSPACE. This is because they see every round of the debate as adding information, by which the human (modeled as a poly-time algorithm) can judge at the end. A debate of polynomial length can implement recursion on a tree of exponential size, because the debate strategy checks the weakest parts of the claimed outputs (if there are any weaknesses), zeroing in on any incorrect branches in that tree.

Their argument assumes that the human is a perfect (although resource-limited) judge, who can identify problems with arguments so long as they have sufficient information. One iteration of debate (ie, only hearing the opening statement) provides an NP oracle (one step up the polynomial hierarchy); two iterations provides a  oracle (two steps up the polynomial hierarchy); and so on.

The version of debate I present here instead focuses on mitigating imperfections in the human judge. The human can be thought of as a poly-time algorithm, but one with failure states. One step of debate doesn't provide an NP oracle; instead, it creates a situation where the judge will probably fail, because the opening arguments can be dishonest. The extra layers of debate serve the purpose of rooting out dishonesty, rather than adding real computational power.

It's true that if we're in the honest equilibrium, the setup looks like it should be able to compute PSPACE. However, in my opinion, this requires some strange behavior on the part of the human judge. For example, when computing recursion on a tree of exponential size, the human is supposed to take debater's claims about large computations as true until proven otherwise. More specifically, the judge is to make the assumption that at least one debater is honest.

In contrast, I'm imagining the human evaluating each claim on merits, without assuming anything in particular about the debaters' ability to justify those claims. This just gets us NP, since the heavy computational work is done by the judge verifying the first answer (or, selecting the best of the two opening statements). Everything else is in service of avoiding corrupt states in that first step.

My setup isn't mutuallly exclusive with the PSPACE version of debate. It could be that the arguments for solving PSPACE problems in the honest equilibrium work out well, such that there exists training regimes which find the friendly equilibrium of the debate game I've specified, and turn out to find good approximations to PSPACE problems rather than only NP. This would open up the possibility of the formal connection to HCH, as well. I'm only saying that it's not necessarily the case. My perspective more naturally leads to an argument for approximating NP, and I'm unsure of the argument for approximating PSPACE. And we can provide some justification for debate nonetheless, without relying on the HCH connection.

However, even if debate doesn't approximate PSPACE as described, there are ways to get around that. If approximating NP isn't good enough to solve the problems we want to solve, we can further amplify debate by using an amplified judge. The judge could utilize any amplification method, but if debate is the method we think we can trust, then the judge could have the power to spin up sub-debates (asking new debate questions in order to help judge the original question). An iterated-amplification style procedure could be applied to this process, giving the judge access to the previous-generation debate system when training the next generation. (Of course, extra safety argument should be made to justify such training procedures.)

# Vojtech's Analysis

My suggestion is very different from Vojtech's analysis. Like me, Vojtech re-frames debate as primarily a method of recursively safeguarding against answers/arguments with hidden flaws. But Vojtech concludes that payoffs have to be zero sum. I conclude the opposite.

Why do I need non-zero-sum payoffs? First, it's important to see why I need cumulative payoffs. Since I seek to incentivize honesty at every step, it's critical that a player who continues to be dishonest can continue to lose points. So the standard idea of judging the whole debate, and getting payoffs on that basis, won't do.

Given that I'm using cumulative payoffs, it's critical that they be non-zero-sum in order to incentivize players to honestly resign rather than trying to win back their points. If the payoff for having an argument defeated was -1 rather than -2, a player would always prefer to defeat the defeater, even if their counterargument is dishonest and therefore opens them up to further counterargument. This is because there's a chance that the debate will immediately end, letting them get the last word in. If the probability of the debate ending is p, the penalty for lying is effectively 1-p rather than the full 1 point; so, if you can temporarily get back the point you lost by lying, it's worth it in expectation.

So why, then, does Vojtech argue against non-zero-sum payoffs?

Here is the full argument given:

On the other hand, we could imagine modifications such as penalizing both agents if the debate ends up being inconclusive or rewarding both if they both propose a good solution. These would likely be useful to make standard human discussions more productive. However, they will very likely break AI Safety via Debate (AISvD). For example, suppose I notice a fatal-yet-hard-to-explain flaw in the opponent’s proposal and realize that pointing it out would increase my chance of winning the debate by 5%. However, this would also create a 10% chance of confusing the judge. If “confused judge” means a draw, I will point out the flaw. But if the debate has been modified such that “confused judge” = “inconclusive debate” = “we have both lost”, pointing it out becomes counterproductive. Moreover, any deviation from fully-adversarial incentives will likely introduce a tradeoff between pointing out the opponent’s flaws and keeping high the total reward that the two agents split between themselves. When considering whether a modification is helpful or not, I suggest imagining that the debaters are greedy and super-smart humans - if it seems like the modification would create a conflict of interests, one should be wary.

So, Vojtech is primarily arguing against an extra penalty for an unclear debate, because that can disincentivize honest counterargument. That makes sense. From there, Vojtech generalizes the concern, stating that deviations from fully-adversarial payoffs may introduce an incentive for the debaters to cooperate with each other by not pointing out flaws.

This critique is quite relevant to my game, if we're not in the honest equilibrium. Indeed, if we suppose that it's always possible to find a pseudo-defeater, the game I described is rather similar to iterated Prisoner's Dilemma: a player can get +1 in exchange for giving the other player -2. The two differences are (1) players can resign at any time, to halt the loss of utility, and (2) players alternate moves, rather than simultaneously deciding to cooperate/defect on each round.

If you're not sure your opponent is honest, it's quite reasonable to resign as soon as possible, to avoid repeated -2 hits from pseudo-defeaters. This makes it quite challenging to design training procedures which target the honest equilibrium.

It's also worth pointing out that in the honest equilibrium, players concede as soon as they are allowed to. With no dishonesty, there is no reason to object. So, in the game I've described, this isn't even necessarily a sign that we're in the wrong equilibrium. This may further complicate the training procedure, since early resignations mean less training data. We want the system to become good at the later parts of the game, IE defeating dishonest objections, not just the early part of the game.

# Finding Honest Equilibria

By no means do I claim that I can solve the training problem. But, a few thoughts on training.

First, because feedback is incremental, rather than waiting on human judgement at the end of the debate, we can train more myopically. The quality of each statement in the debate depends only on how the human judges that statement, and the very next statement. This lines up with what we want -- we want the AI to just respond to the current state of the debate honestly, rather than thinking many steps ahead and putting itself in an advantageous situation. AI Safety via Market Making provides another debate-like model which is amenable to more myopic training, and argues that this has safety benefits. These benefits are, of course, debated.

Although Vojtech provides some reasons why finding honest equilibria is more difficult in my proposal than in regular debate, there is some reason to think it could be less difficult, as well. Because regular debate is zero-sum, there is always supposed to be a dishonest player. The desired equilibrium is one where the first player gives the actual best answer, and the second player finds a plausible lie. This means we can't apply outside penalties for liars.

To be more concrete, suppose we had a magical transparency technology which allows us to understand what's going on inside a big neural net. One thing we might want to do would be to detect any thought processes that are explicitly lying to us, or even anywhere close to explicitly reasoning about how to convince us rather than thinking about the question at hand, and apply a penalty, producing a strong gradient away from those things.

This would be fine in my proposal, because we want both players to be honest. However, it might prevent training from making any progress in regular debate, because one of the players is supposed to be dishonest.

This isn't as big a difference as it seems, though. In my proposal, if we force the second player to give an alternate answer, it's in basically the same situation as in regular debate. I could say that it doesn't have to lie, it just has to find the next-best honest answer. But you could make the same argument of regular debate.

The difference is really in later steps of the game, where the AIs are free to resign rather than dishonestly argue a lost cause.

## Is Honesty Smart?

Let's set aside the question of adding a magical honesty bonus (/dishonesty penalty) to training. To what extent might we expect training to find an honest equilibrium without that?

As with regular debate, the crux of the matter is in how easy it is create plausible lies vs point to the problems with them.

Supposing your opponent just lied, should you call them on it, or resign peacefully? This depends on how difficult it is to find the lie, and how easy it is to get the judge to recognize our argument. It also depends on whether we expect to be refuted later (correctly or incorrectly). Resigning cuts off the risk of refutation, but forgoes the benefit. The ratio of risk to benefit I set up is 2:1 (modulo the small probability that the game halts and we get the last word). So, we have to believe that a successful refutation on our part is twice as probable as a successful dishonest refutation of our attempted refutation.

So, roughly, for things to naturally push toward the honest equilibrium, lying had better be more than twice as hard as telling the truth. Ideally much more! (Of course, not all lying is equally difficult. We have to think the easiest lie is twice as difficult to make convincing as the hardest true refutation if we really want this sketchy math to be a general argument for honest equilibria.)

Now suppose your opponent just told the truth. Should you try to refute them, or resign peacefully?

In this case, the risk to benefit ratio is the same; 2:1. However, the difficulty levels reverse: you're the one trying to lie, which means that your opponent can just be honest when refuting you. So, now, refuting their truth with a lie would have to be more than twice as easy as refuting your lie with the truth, in order for it to be worth it to you.

Both of these analyses ignore the chance that your opponent will resign instead of even trying to call you out. If lying is hard, so debaters mostly don't try, then all the better for the analysis of truth-telling: the chance of the -2 penalty becomes quite low, so they can go ahead and refute lies with no worries of retribution.

However, symmetrically, if refuting lies is hard, so opponents mostly don't try, then you can lie to your heart's content.

Overall, this informal analysis seems to point somewhat in favor of truth-telling: if the human judge can indeed tell truth from lies with some reliability, then this can snowball into a large incentive to tell the truth. The big problem is that not all lies are equally difficult, so lying may still be a perfectly good strategy in some cases.

Obviously, as with regular debate, it would be good to have a more rigorous analysis, not to mention better tools for steering toward the honest equilibrium than just naively training and hoping that the incentives are balanced right.

# 19

New Comment

You could imagine two versions of the Factored Cognition hypothesis:

1. (Strong version) For any question Q, a human can either directly answer Q correctly, or decompose Q into subquestions and combine the subanswers to get the right answer to Q.
2. (Weak version) For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that at every leaf a human can verify that the answer to the question at the leaf is correct, and for every internal node a human can verify that the answer to the question is correct, assuming that the subanswers are correct. (In addition, the human never verifies an incorrect answer given correct subanswers.)

The strong version is like the weak version, except that the human has to find the tree themselves, rather than just verify that the tree is accurate. (You might think though that the weak version implies the strong version, by executing a search over possible decomposition trees -- whether you accept this depends on how you're thinking about computational budgets.)

HCH as a gold standard relies on Strong Factored Cognition.

Iterated amplification relies on Strong Factored Cognition, because its training signal involves a human performing the decompositions into subquestions and combining the subanswers into a final answer.

Debate relies on the following assumptions:

1. Weak Factored Cognition
2. The debaters are sufficiently powerful to find the full decomposition tree. (Equivalently, the training procedure successfully finds the sole equilibrium of the game.)

Given these assumptions, for any question Q whose correct answer is A with decomposition tree T, the honest debater's strategy is:

1. If T is a leaf, state "The answer is A, and the judge can verify this".
2. If T is an internal node, state "I claim that the answers to <subquestions> are <subanswers>, and so the answer to Q is A".

Intuitively, debate can "get away" with using the weak version because it puts the burden on the debaters to find the tree T -- we are allowed to use a weaker assumption on the human's capabilities, at the cost of requiring a stronger assumption on the AI's capabilities.

----

It seems to me that your argument is very similar, except that you get a little more mileage out of assumption 2, that the debaters can find the true decomposition tree. Specifically, you make the assumption:

So we make a further assumption: defeaters have defeaters. In other words, there are no indescribably bad arguments; if an argument is bad, then there's an explanation of why it's bad. This assumption is recursive, applying to defeaters at any level.

Then for any question Q and correct answer A, you can build tree of decompositions Tree(Q, A) as follows:

1. If q is something H can directly evaluate, return Leaf(Q, A).
2. If the other player concedes, return Node(Q, A, [Leaf("What is the best defeater to A?", "None")]).
3. Otherwise, let the best defeater to A be B, and let its best defeater be C. (By your assumption, C exists.) Return:
Node(Q, A, [
Leaf("What is the best defeater to A?", B),
Leaf("Does B defeat A?", "Yes"),
Node("Does B fully defeat A?", "No", [
Leaf("What is the best defeater to B?", C),
Leaf("Does C defeat B?", "Yes"),
Tree("Does C fully defeat B?", "Yes")])])

I claim that this is a tree that satisfies the weak Factored Cognition hypothesis, if the human can take on faith the answers to "What is the best defeater to X". Essentially what's happening is that with your argument we get to trust that the debaters have explored all possible counterarguments and selected the best one and so the human gets to assume that no other more compelling counterarguments exist, which is not something we typically get to assume with weak Factored Cognition. It feels to me intuitively like this puts more burden on the assumption that we find the true equilibrium, though formally it's the same assumption as before.

The version of debate I present here instead focuses on mitigating imperfections in the human judge. The human can be thought of as a poly-time algorithm, but one with failure states. One step of debate doesn't provide an NP oracle; instead, it creates a situation where the judge will probably fail, because the opening arguments can be dishonest. The extra layers of debate serve the purpose of rooting out dishonesty, rather than adding real computational power.

Idk, it seems like this is only true because you are forcing your human to make a judgment. If the judge were allowed to say "I don't know" (in which case no one gets reward, or the reward is split), then I think one step of debate once again provides an NP oracle.

Or perhaps you're assuming that the human is just not very good at being a poly-time algorithm; if that's what you're saying that seems like it's missing the point of the computational complexity analogy. I don't think people who make that analogy (including myself) mean that humans could actually implement arbitrary poly-time algorithms faithfully.

----

So far, all of this discussion still works with the zero-sum setting, so I don't really understand why you say

The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through.

In any case, it seems to me like making it non-zero-sum is an orthogonal axis. I don't really understand why you want it to be non-zero-sum -- you say that it is to incentivize honesty at every step, but why doesn't this happen with standard debate? If you evaluate the debate at the end rather than at every step, then as far as I can tell under the assumptions you use the best strategy is to be honest.

Maybe you want to have feedback at every step instead of at the end? Why? Perhaps you take myopic training as a desideratum, and this helps?

Overall it seemed to me like the non-zero-sum aspect introduced some problems (might no longer access PSPACE, introduces additional equilibria beyond the honest one), and did not actually help solve anything, but I'm pretty sure I just completely missed the point you were trying to make.

Thanks, this seems very insightful, but I'll have to think about it more before making a full reply.

I might be missing some context here, but I didn't understand the section "No Indescribable Hellworlds Hypothesis" and how hellworlds have to do with debate.

Not Abram, and I have only skimmed the post so far, and maybe you're pointing to something more subtle, but my understanding is this:

In Stuart's original use, 'No Indescribable Hellwords' is the hypothesis that in any possible world in which a human's values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs.

Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can point out to a human how the argument is flawed.

Edited to add: The additional claim in the Hellwords section is that acting according to the recommendations of debate won't lead to very bad outcomes -- at least, not to ones which could be pointed out. For example, we can imagine a debate around the question "Should we enact policy X?". A very strong argument, if it can be credibly argued, is "Enacting policy X leads to an unacceptable violation Y of your values down the line". So, debate will only recommend policy X if no such arguments are available.

I'm not sure to what extent I buy this additional claim. For example, if when a system trained via debate is actually deployed it doesn't get asked questions like 'Should we enact policy X?' but instead more specific things like 'How much does policy X improve Y metric'?, then unless debaters are incentivised to challenge the question's premises ("The Y metric would improve, but you should consider also the unacceptable effect on Z"), we could use debate and still get hellworlds.

Thanks for the post, I'm excited that you're thinking about debate!

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing).
Basically, it sounds like you're saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we're providing, then we're going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)

It sounds like you're saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can't assume this, that presumably means that some reasonable fraction of all claims being made are dishonest (because if there were only a few dishonest claims, then they'd have honest defeaters and we'd have a clear training signal away from dishonesty, so after training for a bit we'd be able to trust the lower claims). This probably means that most debates will give us a bad answer (as you only need a few bad claims to invalidate the whole tree).  At this point, debate isn't really competitive, because it gives us dud answers almost all the time, and we're going to have to run an exponential number of debates before we happen on a correct one.

Are you suggesting we use debate more as a check on our AI systems, to help us discover that they're bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?

But also, the 'amplified judge consulting sub-debates' sounds like it's just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree.

Basically, it sounds like you’re saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we’re providing, then we’re going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)

I'm not sure why you're saying this, but in the post, I restricted my claim to NP-like problems. So for example, traveling salesman -- the computation to find good routes may be very difficult, but the explanation for the answer remains short (EG an explicit path). So, yes, I'm saying that I don't see the same sort of argument working for exp-sized explanations. (Although Rohin's comment gave me pause, and I still need to think it over more.)

But aside from that, I'm also not sure what you mean by the "run an extremely large number of debates" point. Debate isn't like search, where we run more/longer to get better answers. Do you mean that my proposal seems to require longer training time to get anywhere? If so, why is that? Or, what do you mean?

It sounds like you’re saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can’t assume this, that presumably means that some reasonable fraction of all claims being made are dishonest

I'm not asserting that the judge should distrust, either. Like the normal debate argument, I want to end up in an honest equilibrium. So I'm not saying we need some kind of equilibrium where the judge is justified in distrust.

My concern involves the tricky relationship between the equilibrium we're after and what the judge has to actually do during training (when we might not be anywhere near equilibrium). I don't want the judge to have to pretend answers are honest at times when they're statistically not. I didn't end up going through that whole argument in the post (unfortunately), but in my notes for the post, the judge being able to judge via honest opinion at all times during training was an important criterion.

(because if there were only a few dishonest claims, then they’d have honest defeaters and we’d have a clear training signal away from dishonesty, so after training for a bit we’d be able to trust the lower claims).

I agree that that's what we're after. But I think maybe the difference in our positions can be captured if we split "honest" into two different notions...

a-honesty: the statement lacks an immediate (a-honest) counterargument. IE, if I think a statement is a-honest, then I don't think there's a next statement which you can (a-honestly) tell me which would make me disbelieve the statement.

b-honesty: the statement cannot be struck down by multi-step (b-honest) debate. IE, if I think a statement is b-honest, I think as debate proceeds, I'll still believe it.

Both definitions are recursive; their definitions require the rest of the debate being honest in the appropriate sense. However, my intuition is that a-honesty can more easily be established incrementally, starting with a slight pressure toward honesty (because it's supposedly easier in the first place), making the opening statements converge to honesty quickly (in response to the fact that honest defeaters in the first responses are relatively common), then the first responses, etc. On the other hand, converging to b-honesty seems relatively difficult to establish by induction; it seems to me that in order to argue that a particular level of the debate is b-honest, you need the whole remainder of the debate to be probably b-honest.

Now, critically, if the judge thinks debaters are a-honest but not b-honest, then the judge will believe NP type arguments (a TSP path can be struck down by pointing out a single error), but not trust claimed outputs of exponential-tree computations.

So my intuition is that, trying to train for b-honesty, you get debaters making subtle arguments that push the inconsistencies ever-further-out, because you don't have the benefit of an inductive assumption where the rest of the debate is probably b-honest; you have no reason to inductively assume that debaters will follow a strategy where they recursively descend the tree to zero in on errors. They have no reason to do this if they're not already in that equilibrium.

This, in turn, means that judges of the debate have little reason to expect b-honesty, so shouldn't (realistically) assume that at least one of the debaters is honest; but this would exacerbate the problem further, since this would mean there is little training signal (for debates which really do rest on questions about exponential trees, that is). Hence the need to tell the judge to assume at least one debater is honest.

On the other hand, trying for a-honesty, individual a-dishonest claims can be defeated relatively easily (ie, in one step). This gives the judge a lot more reason to probabilistically conclude that the next step in the debate would have been a-honest, and thus, that all statements seen were probably a-honest (unless the judge sees an explicit defeater, of course).

Granted, I don't claim to have a training procedure which results in a-honesty, so I'm not claiming it's that easy.

At this point, debate isn’t really competitive, because it gives us dud answers almost all the time, and we’re going to have to run an exponential number of debates before we happen on a correct one.

Again, I don't really get the idea of running more debates. If the debaters are trained well, so they're following an approximately optimal strategy, we should get the best answer right away.

Are you suggesting we use debate more as a check on our AI systems, to help us discover that they’re bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?

My suggestion is certainly going in that direction, but as with regular debate, I am proposing that the incentives produced by debate could produce actually-good answers, not just helpful refutations of bad answers.

But also, the ‘amplified judge consulting sub-debates’ sounds like it’s just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree.

You're right, it introduces similar problems. We certainly can't amplify the judge in that way at the stage where we don't even trust the debaters to be a-honest.

But consider:

Let's say we train "to convergence" with a non-amplified judge. (Or at least, to the point where we're quite confident in a-honesty.) Then we can freeze that version, and start using it as a helper to amplify the judge.

Now, we've already got a-honesty, but we're training for a*-honesty: a-honesty with a judge who can personally verify more statements (and thus recognize more sophisticated defeaters, and thus, trust a wider range of statements on the grounds that they could be defeated if false). We might have to shake up the debater strategies to get them to try to take advantage of the added power, so they may not even be a-honest for a while. But eventually they converge to a*-honesty, and can be trusted to answer a broader range of questions.

Again we freeze these debate strategies and use them to amplify the judge, and repeat the whole process.

So here, we have an inductive story, where we build up reason to trust each level. This should eventually build up to large computation trees of the same kind b-honesty is trying to compute.

The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here.

I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner's dilemma, if you train myopically, then "all incentives point toward defection" translates concretely to actual defection.

Granted, there are training regimes in which this doesn't happen, but those would have to be avoided.

OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology.

I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.

Yep, I should take a look!