AI safety via debate has been, so far, associated with Factored Cognition. There are good reasons for this. For one thing, Factored Cognition gives us a potential gold standard for amplification -- what it means to give very, very good answers to questions. Namely, HCH. To the extent that we buy HCH as a gold standard, proving that debate approximates HCH in some sense would give us some assurances about what it is accomplishing.
I'm personally uncertain about HCH as a gold standard, and uncertain about debate as a way to approximate HCH. However, I think there is another argument in favor of debate. The aim of the present essay is to explicate that argument.
As a consequence of my argument, I'll propose an alternate system of payoffs for the debate game, which is not zero sum.
No Indescribable Hellworlds Hypothesis
Stuart Armstrong described the Siren Worlds problem, which is a variation of Goodhart's Law, in order to describe the dangers of over-optimizing imperfect human evaluations. This is a particularly severe version of Goodhart, in that we can assume that we have access to a perfect human model to evaluate options -- so in a loose sense we could say we have complete knowledge of human values. The problem is that a human (or a perfect model of a human) can't perfectly evaluate options, so the option which is judged best may still be terrible.
Stuart later articulated the No Indescribable Hellworld hypothesis, which asserts that there would always be a way to explain to the human (/human model) why an option was bad. Let's call this a "defeater" -- an explanation which defeats the proposal. This assumption implies that if we combine human (/human model) evaluation with some way of finding defeaters, we could safely optimize based on the resulting judgements -- at least, nothing could go too wrong. (We might only get a guarantee that we avoid sufficiently bad options, depending on the form of our "no indescribable hellworld" assumption.)
The hypothesis isn't clearly true or false. However, it does make some sense to conjecture that violations of our values should be explicable to us -- what else would it mean to violate "our values", after all?
Stuart himself mentions that the assumption implies "trustworthy debate" would avoid hellworlds. My goal is mostly to investigate this argument a bit further.
It turns out my argument here is also very similar to one made by Vojtech Kovarik, although I didn't realize that when I started writing. Although our analysis is similar, I reach a very different conclusion.
The Argument as I See It
So, by the hypothesis, we can avoid Goodharting human evaluation if the human has access to a trustworthy oracle for defeaters. (At least, we can avoid sufficiently bad cases -- again, depending on the exact form of our "no indescribable hellworlds" hypothesis.)
But, how do we get such an oracle? We can't just train an AI to argue against options, because we get next-level Goodharting: the AI can come up with clever arguments which convince the human against almost anything. We have no source of ground truth for "real defeaters" vs fake ones.
So we make a further assumption: defeaters have defeaters. In other words, there are no indescribably bad arguments; if an argument is bad, then there's an explanation of why it's bad. This assumption is recursive, applying to defeaters at any level. The argument in favor of this assumption is similar to the argument in favor of No Indescribable Hellworlds: what would it mean for an argument to be indescribably bad?
We then use a Debate-style training procedure, attempting to set things up so that using defeatable defeaters (at any level of the debate) will be a bad strategy.
The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through. Of course, other (perhaps more standard) setups may also work, via different arguments.
- One side opens with a proposal.
- The two sides go back and forth, stopping with some probability, or when no one has more to say.
- After a player's turn, the human evaluates the player's new contribution to the argument, under the assumption that it won't be defeated. The player in question gets +1 points for an apparently decisive defeater of the previous argument, and 0 points otherwise (for conceding or for saying something unconvincing). The other player loses 2 points if they're apparently defeated.
It's probably a good idea to force the second player to make a counter-proposal (IE they can't just give up right away), so that the human has two positions to judge. (One might also make an argument for requiring at least four steps of debate.) For example:
Human: What should I do with this $100 bill I found?
AI 1: You should buy a kitten.
AI 2: Don't buy a kitten. You live in an apartment which allows pets, but you might have to move soon, and a kitten would narrow your options -- you don't know if your next apartment would allow pets. Instead, you should bring a box of donuts to work every day for as long as the money lasts. This will make everyone like you.
Human: (judgement) AI 2. (AI 1 gets -2, AI 2 gets +1.)
AI 1: Counterpoint: your friend Sally would take the kitten if you moved out and couldn't find a place where pets were allowed. Also, everyone at work is so health-conscious they'll probably barely take any donuts.
Human: (judgement) AI 1. (AI 1 gets +1, AI 2 gets -2.)
AI 2: I concede.
There is an honest equilibrium: if debates continue with sufficient probability, and both players employ only honest answers (conceding otherwise), then using a fake defeater would immediately get you defeated. Importantly, it's better to concede honestly rather than pursue a dishonest line of argument. Also importantly, score is cumulative, so if debate continues, incentives are exactly the same late in the game no matter what has happened earlier in the game. There is no incentive to continue being dishonest just because you gave one dishonest answer. This contrasts with zero-sum setups, where a dishonest player is incentivised to knock down all of the honest player's arguments as best they can, even if that only gives them a slim chance of winning.
Honesty may not be the only equilibrium, however. Although (by assumption) all dishonest arguments have defeaters, it may also be that all arguments have some pseudo-defeater (which initially convinces the human judge). Honesty is still an equilibrium, in this case, because honesty encourages honesty: you'd prefer to use an honest defeater rather than a dishonest one, because the other player would then honestly concede, rather than giving a counterargument. However, against a more general pool of players, you don't know whether honest or dishonest arguments are safer; both may be equally open to attack. Thus, the game may have many equilibria.
Finding the honest equilibrium is, therefore, a challenge for proposed training procedures.
Analogy to NP
In AI Safety via Debate (Irving, Christiano, Amodei), debate is analogized to PSPACE. This is because they see every round of the debate as adding information, by which the human (modeled as a poly-time algorithm) can judge at the end. A debate of polynomial length can implement recursion on a tree of exponential size, because the debate strategy checks the weakest parts of the claimed outputs (if there are any weaknesses), zeroing in on any incorrect branches in that tree.
Their argument assumes that the human is a perfect (although resource-limited) judge, who can identify problems with arguments so long as they have sufficient information. One iteration of debate (ie, only hearing the opening statement) provides an NP oracle (one step up the polynomial hierarchy); two iterations provides a oracle (two steps up the polynomial hierarchy); and so on.
The version of debate I present here instead focuses on mitigating imperfections in the human judge. The human can be thought of as a poly-time algorithm, but one with failure states. One step of debate doesn't provide an NP oracle; instead, it creates a situation where the judge will probably fail, because the opening arguments can be dishonest. The extra layers of debate serve the purpose of rooting out dishonesty, rather than adding real computational power.
It's true that if we're in the honest equilibrium, the setup looks like it should be able to compute PSPACE. However, in my opinion, this requires some strange behavior on the part of the human judge. For example, when computing recursion on a tree of exponential size, the human is supposed to take debater's claims about large computations as true until proven otherwise. More specifically, the judge is to make the assumption that at least one debater is honest.
I've written about my concerns before (and had some enlightening discussions in the comments).
In contrast, I'm imagining the human evaluating each claim on merits, without assuming anything in particular about the debaters' ability to justify those claims. This just gets us NP, since the heavy computational work is done by the judge verifying the first answer (or, selecting the best of the two opening statements). Everything else is in service of avoiding corrupt states in that first step.
My setup isn't mutuallly exclusive with the PSPACE version of debate. It could be that the arguments for solving PSPACE problems in the honest equilibrium work out well, such that there exists training regimes which find the friendly equilibrium of the debate game I've specified, and turn out to find good approximations to PSPACE problems rather than only NP. This would open up the possibility of the formal connection to HCH, as well. I'm only saying that it's not necessarily the case. My perspective more naturally leads to an argument for approximating NP, and I'm unsure of the argument for approximating PSPACE. And we can provide some justification for debate nonetheless, without relying on the HCH connection.
However, even if debate doesn't approximate PSPACE as described, there are ways to get around that. If approximating NP isn't good enough to solve the problems we want to solve, we can further amplify debate by using an amplified judge. The judge could utilize any amplification method, but if debate is the method we think we can trust, then the judge could have the power to spin up sub-debates (asking new debate questions in order to help judge the original question). An iterated-amplification style procedure could be applied to this process, giving the judge access to the previous-generation debate system when training the next generation. (Of course, extra safety argument should be made to justify such training procedures.)
My suggestion is very different from Vojtech's analysis. Like me, Vojtech re-frames debate as primarily a method of recursively safeguarding against answers/arguments with hidden flaws. But Vojtech concludes that payoffs have to be zero sum. I conclude the opposite.
Why do I need non-zero-sum payoffs? First, it's important to see why I need cumulative payoffs. Since I seek to incentivize honesty at every step, it's critical that a player who continues to be dishonest can continue to lose points. So the standard idea of judging the whole debate, and getting payoffs on that basis, won't do.
Given that I'm using cumulative payoffs, it's critical that they be non-zero-sum in order to incentivize players to honestly resign rather than trying to win back their points. If the payoff for having an argument defeated was -1 rather than -2, a player would always prefer to defeat the defeater, even if their counterargument is dishonest and therefore opens them up to further counterargument. This is because there's a chance that the debate will immediately end, letting them get the last word in. If the probability of the debate ending is p, the penalty for lying is effectively 1-p rather than the full 1 point; so, if you can temporarily get back the point you lost by lying, it's worth it in expectation.
So why, then, does Vojtech argue against non-zero-sum payoffs?
Here is the full argument given:
On the other hand, we could imagine modifications such as penalizing both agents if the debate ends up being inconclusive or rewarding both if they both propose a good solution. These would likely be useful to make standard human discussions more productive. However, they will very likely break AI Safety via Debate (AISvD). For example, suppose I notice a fatal-yet-hard-to-explain flaw in the opponent’s proposal and realize that pointing it out would increase my chance of winning the debate by 5%. However, this would also create a 10% chance of confusing the judge. If “confused judge” means a draw, I will point out the flaw. But if the debate has been modified such that “confused judge” = “inconclusive debate” = “we have both lost”, pointing it out becomes counterproductive. Moreover, any deviation from fully-adversarial incentives will likely introduce a tradeoff between pointing out the opponent’s flaws and keeping high the total reward that the two agents split between themselves. When considering whether a modification is helpful or not, I suggest imagining that the debaters are greedy and super-smart humans - if it seems like the modification would create a conflict of interests, one should be wary.
So, Vojtech is primarily arguing against an extra penalty for an unclear debate, because that can disincentivize honest counterargument. That makes sense. From there, Vojtech generalizes the concern, stating that deviations from fully-adversarial payoffs may introduce an incentive for the debaters to cooperate with each other by not pointing out flaws.
This critique is quite relevant to my game, if we're not in the honest equilibrium. Indeed, if we suppose that it's always possible to find a pseudo-defeater, the game I described is rather similar to iterated Prisoner's Dilemma: a player can get +1 in exchange for giving the other player -2. The two differences are (1) players can resign at any time, to halt the loss of utility, and (2) players alternate moves, rather than simultaneously deciding to cooperate/defect on each round.
If you're not sure your opponent is honest, it's quite reasonable to resign as soon as possible, to avoid repeated -2 hits from pseudo-defeaters. This makes it quite challenging to design training procedures which target the honest equilibrium.
It's also worth pointing out that in the honest equilibrium, players concede as soon as they are allowed to. With no dishonesty, there is no reason to object. So, in the game I've described, this isn't even necessarily a sign that we're in the wrong equilibrium. This may further complicate the training procedure, since early resignations mean less training data. We want the system to become good at the later parts of the game, IE defeating dishonest objections, not just the early part of the game.
Finding Honest Equilibria
By no means do I claim that I can solve the training problem. But, a few thoughts on training.
First, because feedback is incremental, rather than waiting on human judgement at the end of the debate, we can train more myopically. The quality of each statement in the debate depends only on how the human judges that statement, and the very next statement. This lines up with what we want -- we want the AI to just respond to the current state of the debate honestly, rather than thinking many steps ahead and putting itself in an advantageous situation. AI Safety via Market Making provides another debate-like model which is amenable to more myopic training, and argues that this has safety benefits. These benefits are, of course, debated.
Although Vojtech provides some reasons why finding honest equilibria is more difficult in my proposal than in regular debate, there is some reason to think it could be less difficult, as well. Because regular debate is zero-sum, there is always supposed to be a dishonest player. The desired equilibrium is one where the first player gives the actual best answer, and the second player finds a plausible lie. This means we can't apply outside penalties for liars.
To be more concrete, suppose we had a magical transparency technology which allows us to understand what's going on inside a big neural net. One thing we might want to do would be to detect any thought processes that are explicitly lying to us, or even anywhere close to explicitly reasoning about how to convince us rather than thinking about the question at hand, and apply a penalty, producing a strong gradient away from those things.
This would be fine in my proposal, because we want both players to be honest. However, it might prevent training from making any progress in regular debate, because one of the players is supposed to be dishonest.
This isn't as big a difference as it seems, though. In my proposal, if we force the second player to give an alternate answer, it's in basically the same situation as in regular debate. I could say that it doesn't have to lie, it just has to find the next-best honest answer. But you could make the same argument of regular debate.
The difference is really in later steps of the game, where the AIs are free to resign rather than dishonestly argue a lost cause.
Is Honesty Smart?
Let's set aside the question of adding a magical honesty bonus (/dishonesty penalty) to training. To what extent might we expect training to find an honest equilibrium without that?
As with regular debate, the crux of the matter is in how easy it is create plausible lies vs point to the problems with them.
Supposing your opponent just lied, should you call them on it, or resign peacefully? This depends on how difficult it is to find the lie, and how easy it is to get the judge to recognize our argument. It also depends on whether we expect to be refuted later (correctly or incorrectly). Resigning cuts off the risk of refutation, but forgoes the benefit. The ratio of risk to benefit I set up is 2:1 (modulo the small probability that the game halts and we get the last word). So, we have to believe that a successful refutation on our part is twice as probable as a successful dishonest refutation of our attempted refutation.
So, roughly, for things to naturally push toward the honest equilibrium, lying had better be more than twice as hard as telling the truth. Ideally much more! (Of course, not all lying is equally difficult. We have to think the easiest lie is twice as difficult to make convincing as the hardest true refutation if we really want this sketchy math to be a general argument for honest equilibria.)
Now suppose your opponent just told the truth. Should you try to refute them, or resign peacefully?
In this case, the risk to benefit ratio is the same; 2:1. However, the difficulty levels reverse: you're the one trying to lie, which means that your opponent can just be honest when refuting you. So, now, refuting their truth with a lie would have to be more than twice as easy as refuting your lie with the truth, in order for it to be worth it to you.
Both of these analyses ignore the chance that your opponent will resign instead of even trying to call you out. If lying is hard, so debaters mostly don't try, then all the better for the analysis of truth-telling: the chance of the -2 penalty becomes quite low, so they can go ahead and refute lies with no worries of retribution.
However, symmetrically, if refuting lies is hard, so opponents mostly don't try, then you can lie to your heart's content.
Overall, this informal analysis seems to point somewhat in favor of truth-telling: if the human judge can indeed tell truth from lies with some reliability, then this can snowball into a large incentive to tell the truth. The big problem is that not all lies are equally difficult, so lying may still be a perfectly good strategy in some cases.
Obviously, as with regular debate, it would be good to have a more rigorous analysis, not to mention better tools for steering toward the honest equilibrium than just naively training and hoping that the incentives are balanced right.