The limits of AI safety via debate

Marius Hobbhahn

I recently participated in the AGI safety fundamentals program and this is my cornerstone project. During our readings of AI safety via debate (blog, paper) we had an interesting discussion on its limits and conditions under which it would fail.

I spent only around 5 hours writing this post and it should thus mostly be seen as food for thought rather than rigorous research.

Lastly, I want to point out that I think AI safety via debate is a promising approach overall. I just think it has some limitations that need to be addressed when putting it into practice. I intend my criticism to be constructive and hope it is helpful for people working on debate right now or in the future.

Update: Rohin Shah pointed out some flaws with my reasoning in the comments (see below). Therefore, I reworked the post to include the criticisms and flag them to make sure readers can distinguish the original from the update.

Update2: I now understand all of Rohin’s criticisms and have updated the text once more. He mostly persuaded me that my original criticisms were wrong or much weaker than I thought. I chose to keep the original claims for transparency. I’d like to thank him for taking the time for this discussion. It drastically improved my understanding of AI safety via debate and I now think it’s even better than I already thought.

The setting

In AI safety via debate, there are two debaters who argue for the truth of different statements to convince a human adjudicator/verifier. In OpenAI’s example, the debaters use snippets of an image to argue that it either contains a dog or a cat. The dog-debater chooses snippets that show why the image contains a dog and the cat-debater responds with snippets that argue for a cat. Both debaters can see what the other debater has argued previously and respond to that, e.g. when the dog-debater shows something that indicates a dog, the cat-debater can refute this claim by arguing that this snipped actually indicates a cat. At some point, the human verifier chooses whether the image shows a cat or a dog and the respective debater wins.

Update: I think there were two things I didn’t understand or emphasize enough in the first write-up of this post. Firstly, the tool of debate can be used in many circumstances. However, when we use debate for AI safety, we assume that the final judgment will be made by someone who really cares about alignment, e.g. an alignment researcher. Secondly, I want to emphasize that debate is a tool that can break down exponentially complex world states under optimal conditions and find a linear path through them (see picture from OpenAI post below). This alone makes it very powerful.

A simplified but helpful analogy

I think AI safety via debate works well in cases where the verifier and the debaters broadly have a similar understanding of the world and level of intelligence. When this is not the case, failures get more frequent. Thus, my intuitive example for thinking about failure modes is “Let a person from 1800 evaluate the truth of the statement ‘Today I played Fortnite.’”. In this setting, you travel back through time and have to convince a random person from 1800 that you played Fortnite before traveling. Your opponent is someone who has a similar level of knowledge and intelligence as you.

Obviously, this setting is imperfect, due to all the problems with time travel but, in my opinion, it still intuitively shows some of the problems of AI safety via debate. The worlds of someone who played Fortnite in 2022 and someone who lived in 1800 are just so different that it is hard to even begin persuading them. Furthermore, so many of the concepts necessary to understand Fortnite, e.g. computers, the internet, etc. are nearly impossible to verify for a person from 1800 even if they wanted to believe you.

Update: I misunderstood something about the verification procedure. It isn’t necessary for the person from 1800 to fully understand electricity to be able to verify the question (see more below).

Limitations

In the following, I list different implicit and explicit assumptions of debate that can lead to problems if they aren’t met.

Assumption 1: concept must break down into parts that are verifiable in a reasonable timeframe

Original claim: In cases where the verifier is not able to verify a concept from the beginning, it needs to be broken down into smaller subcomponents that are all verifiable. However, this might not always be possible--especially when given limited time.

In the “1800 Fortnite” example, the debater would have to convince the verifier of the existence of electricity, TVs or computers, video games, the internet, etc.

A second example is a question that probably requires very elaborate and time-intensive experiments to yield high-confidence answers such as in a “nature vs nurture” debate. The debater might have to run multi-generational studies to provide low-uncertainty evidence for their side.

Update: Rohin points out that the verifier doesn’t need to fully understand all concepts, they just need to find them sufficiently plausible. In the case of the 1800 Fortnite example, it would be sufficient to believe the claim about electricity more than the counterclaim. There was a second disagreement about complexity. I argued that some debates actually break down into multiple necessary conditions, e.g. if you want to argue that you played Fortnite you have to show that it is possible to do that and then that it is plausible. The pro-Fortnite debater has to show both claims while the anti-Fortnite debater has to defeat only one. Rohin argued that this is not the case, because every debate is ultimately only about the plausibility of the original statement independent of the number of subcomponents it logically breaks down to (or at least that’s how I understood him).

Update 2: I misunderstood Rohin’s response. He actually argues that, in cases where a claim X breaks down into claims X1 and X2, the debater has to choose which one is more effective to attack, i.e. it is not able to backtrack later on (maybe it still can by making the tree larger - not sure). Thus, my original claim about complexity is not a problem since the debate will always be a linear path through a potentially exponentially large tree.

Assumption 2: human verifiers are capable of understanding the concept in principle

Original claim: I’m not very sure about this but I could imagine that there are concepts that are too hard to understand in principle. Every attempt to break them down doesn’t solve the fundamental problem of the verifiers' limited cognitive abilities.

For example, I’m not sure if there is someone who “truly understood” string theory, or high-dimensional probability distributions sufficiently to make a high-confidence judgment in a debate. It might just be possible that these are beyond our abilities.

A second example would be explaining the “1800 Fortnite” scenario to a far-below-average intelligent person from 1800. Even if the debater did the best job possible, concepts like electricity or the internet might be beyond the capabilities of that specific verifier.

This leads to a potentially sad conclusion for a future with AI systems. I could very well imagine that smart humans today could not understand a concept that is totally obvious to an AGI.

Update: Rohin argues that as long as the verifier passes a universality threshold they should, in principle, be able to understand all possible concepts through debate. As long as there exists an exponentially large tree explaining the concept, debate should find a linear path through it.

I’m convinced by this claim and I don’t think my original criticism makes that much sense anymore. I still believe that debate could run into practical problems once problems take many generations to explain but I don’t believe in the principled point anymore.

Assumption 3: human verifiers are well-intentioned

Original claim: I think there are many implementations of AI safety via debate that lead to simple practical problems due to basic human flaws.

For example, the human verifier might just want to get on with it and thus doesn’t really make sure they actually believe the concept. I think this mostly implies that there will be some stochasticity to the answer which might be fixable by careful experimental design.

Secondly, especially in group settings, verifiers might also try to hide their own misunderstandings to not seem unknowledgeable in front of their peers. My intuition is that this biases group settings towards answers that “are allowed to be said”. During 2020, for example, a debater that argues that Covid might come from a lab leak would have lost by default just because it was seen as unspeakable in many social groups.

Update: I think the setting I had in mind when writing this post was more about regular scientists trying to discover new truths by using debate and less about the AI safety setting. In the use-case of AI safety, Rohin convinced me that most of these biases are much less relevant because the actors have probably clear incentives and good intentions. Therefore, this point is mostly irrelevant. I still could imagine scenarios in which human biases perturb the results but it’s much less likely.

Assumption 4: human verifiers are impartial

Original claim: Humans have a lot of motivations, many of which are not “just find the truth over everything else”. They like personal profits, seek status, have many biases such as a desire to confirm their current beliefs and follow ideologies and tribes.

Therefore, there is a risk that human verifiers might vote in favor of bad motivations and preferences rather than find the true answer. Especially when the truth of the result is not personally meaningful to them but one outcome is clearly beneficial to their wallet or party, they might just go with that. Since many of these motivations are hidden or subconscious, it’s not easy to create a setting in which they don’t influence the results.

Update: Similar update to point 3. The point is mostly irrelevant now.

Assumption 5: It’s easier to explain the true concept than a flawed simple one

Original claim: There are some examples of concepts that take a while to understand but have simplified but false alternative explanations. Examples of these false alternatives include rent-control as a solution to the housing crisis, tariffs benefitting your economy, immigrants stealing jobs or autocrats acting the way they do simply because they are evil. In all of these cases, there is a simple intuition that can be explained in one sentence and a complex alternative that requires some understanding of economics, history, incentives, etc. to get why the simple intuition is false.

My gut reaction to this problem is something like “just pick smart verifiers” but there is a deeper problem. I think it is very likely that even the smartest humans today hold lots of beliefs that seem true and intuitive to them but will turn out to be wrong when investigated in more detail. It might be possible that the background knowledge to understand the true alternative takes too long to explain compared to providing a simple yet false alternative. Therefore, since the incentive of the debater is to win and not to find the truth, they will likely choose the simple yet false scenario.

Update: Rohin argues that it’s very hard to explain a simplified flawed argument than make the long true one in the debate setting. Whenever debater A makes a bad argument, debater B can point out the flaws and say “it’s not that simple because X”. I find this plausible in the AI safety via debate setting where the human verifier is well-intentioned and has clear incentives and less plausible in a setting where the human verifier is a more general public. I think for the most part, Rohin has convinced me but it seems like I’m more skeptical about whether debate would always converge to the most rational option. I guess we have slightly different conceptions of how rational/prone to biases even the smartest most well-intentioned human verifiers are.

Assumption 6: AI is not deceptive

Original claim: The incentive of the debater is not to seek the truth but to convince the verifier that they have found the truth. While this sounds like a small discrepancy it can create really large problems. In short, the debater does everything to convince the verifier even if it knows that it’s providing incorrect information. If the debater has more knowledge about the world than the verifier and a very accurate model of what the verifier thinks and likes, it can use its powers to deceive the human.

For an intuitive example, think of a hypothetical theoretical physics professor who knows you very well. They broadly know how you think and they know much more about physics than you. So if they wanted to convince you that a specific fact in theoretical physics is true, they could probably do so independent of whether it’s actually true.

I think this is the biggest problem for AI safety via debate since it is a) so hard to distinguish between deception and honest mistakes and b) a problem that will almost surely happen in scenarios where the AI is very powerful.

Update: This is the only point where Rohin hasn’t convinced me yet. He argues that the debaters have no incentive to be deceptive since the other debater is equally capable and has an incentive to point out this deception. I think this is true, as long as the reward for pointing out deception is bigger than the reward for alternative strategies, e.g. being deceptive yourself, you are incentivized to be truthful. Let’s say, for example, our conception of physics was fundamentally flawed and both debaters know this. To win the debate, one (truthful) debater would have to argue that our current concept of physics is flawed and establish the alternative theory while the other one (deceptive) could argue within our current framework of physics and sound much more plausible to the humans. The truthful debater is only rewarded when the human verifier waits long enough to understand the alternative physics explanation before giving the win to the deceptive debater. In case the human verifier stops early, deception is rewarded, right? What am I missing?

In general, I feel like the question of whether the debater is truthful or not only depends on whether they would be rewarded to do so. However, I (currently) don’t see strong reasons for the debater to be always truthful. To me, the bottleneck seems to be which kind of behavior humans intentionally or unintentionally reward during training and I can imagine enough scenarios in which we accidentally reward dishonest or deceptive behavior.

Update2: We were able to agree on the bottleneck. We both believe that the claim "it is harder to lie than to refute a lie" is the question that determines whether debate works or not. Rohin was able to convince me that it is easier to refute a lie than I originally thought and I, therefore, believe more in the merits of AI safety via debate. The main intuition that changed is that the refuter mostly has to continue poking holes rather than presenting an alternative in one step. In the “flawed physics” setting described above, for example, the opponent doesn’t have to explain the alternative physics setting in the first step. They could just continue to point out flaws and inconsistencies with the current setting and then slowly introduce the new system of physics and how it would solve these inconsistencies.

Conclusions & future research

My main conclusion is that AI safety via debate is a promising tool but some of its core problems still need addressing before it will be really good. There are many different research directions that one could take but I will highlight just two

Eliciting Latent Knowledge (ELK) - style research: Since the biggest challenge of AI safety via debate is deception, in my opinion, the natural answer is to understand when the AI deceives us. ELK is, in my opinion, the most promising approach to combat deception we have found so far.
Social science research: If we will ever be at a point when we have debates between AI systems to support decision-making, we also have to understand the problems that come with the human side of the setup. Under which conditions do humans choose for personal gain rather than seek the truth? Do the results from such games differ in group settings vs. individuals alone and in which ways? Can humans be convinced of true beliefs if they previously strongly believed something that was objectively false?

Update: I still think both of these are very valuable research directions but with my new understanding of debate as a tool specifically designed for AI safety, I think technical research like ELK is more fitting than the social science bit.

Update2: Rohin mostly convinced me that my remaining criticisms don’t hold or are less strong than I thought. I now believe that the only real problem with debate (in a setting with well-intentioned verifiers) is when the claim “it is harder to lie than to refute a lie” doesn’t hold. However, I updated that it is often much easier to refute a lie than I anticipated because refuting the lie only entails poking a sufficiently large hole into the claim and doesn’t necessitate presenting an alternative solution.

If you want to be informed about new posts, you can follow me on Twitter.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

17

The limits of AI safety via debate

17

The setting

A simplified but helpful analogy

Limitations

Assumption 1: concept must break down into parts that are verifiable in a reasonable timeframe

Assumption 2: human verifiers are capable of understanding the concept in principle

Assumption 3: human verifiers are well-intentioned

Assumption 4: human verifiers are impartial

Assumption 5: It’s easier to explain the true concept than a flawed simple one

Assumption 6: AI is not deceptive

Conclusions & future research