# 10

Debate (AI safety technique)AI
Frontpage
Review

As far as I can tell, I have a disjoint set of concerns to many of the concerns I've heard expressed in conversations about AI Safety via Debate.

My main concern with AI Debate is this: each debater has an incentive to trick the operator into running code that takes over the world, replaces the operator, and settles the debate in favor of that debater. To get traction on how big a concern this is, let's start with the following question:

How farsighted does a superintelligent chatbot have to be before it becomes dangerous? Let's forget the two agent setup for a moment, and just consider a single agent. For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.

If it acts to maximize some function of the very next observation it gets, I'm pretty sure it never constructs an existentially dangerous argument. Call this a horizon-1 agent. If it acts to maximize some function of the next observations it gets, call this a horizon- agent. I won't rehash the AI Box debates, but my intuition is that it is very likely that a horizon- chatbot agent would take over the world to intervene in the provision of observations, and not out-of-the-question that a horizon- agent would as well. (This isn't a great anchor, but my mind went to the length of rebuttals in peer review).

Let's get back to the AI Debate setup, instead of the single agent setup. The existence of an adversary may make it harder for a debater to trick the operator, but if they're both trying to push the operator in dangerous directions, I'm not very comforted by this effect. The probability that the operator ends up trusting one of them doesn't seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.

So that leaves us with the problem of picking , the episode length. This is a little dicey, of course, but we could imagine starting small, and slowly increasing it until we just begin to get useful work out of the system. Call this the tiptoe approach. It leaves something to be desired, but I think there's a decent chance that come AGI, all the best safety proposals will have an element of the tiptoe approach, so I don't advocate dismissing the tiptoe approach out of hand. An important danger with the tiptoe approach here is that different topics of conversation may have wildly different and thresholds. A debate about how to stop bad actors from deploying dangerous AGI may be a particularly risky conversation topic. I'd be curious to hear people's estimates of the risk vs. usefulness of various horizons in the comment section.

So what if AI Debate survives this concern? That is, suppose we can reliably find a horizon-length for which running AI Debate is not existentially dangerous. One worry I've heard raised is that human judges will be unable to effectively judge arguments way above their level. My reaction is to this is that I don't know, but it's not an existential failure mode, so we could try it out and tinker with evaluation protocols until it works, or until we give up. If we can run AI Debate without incurring an existential risk, I don't see why it's important to resolve questions like this in advance.

So that's why I say I seem to have a disjoint set of concerns (disjoint to those I hear voiced, anyway). The concerns I've heard discussed don't concern me much, because they don't seem existential. But I do have a separate concern that doesn't have much to do with interesting machinery of AI Debate, and more to do with classic AI Box concerns.

And now I can't resist plugging my own work: let's just put a box around the human moderator.[Comment thread] See Appendix C for a construction. Have the debate end when the moderator leaves the box. No horizon-tiptoeing required. No incentive for the debaters to trick the moderator into leaving the room to run code to do X, because the debate will have already been settled before the code is run. The classic AI Box is a box with a giant hole in it: a ready-made information channel to the outside world. A box around the moderator is another thing entirely. With a good box, we can deal with finding workable debate-moderation protocols at runtime.

# 10

Mentioned in
New Comment
So what if AI Debate survives this concern? That is, suppose we can reliably find a horizon-length for which running AI Debate is not existentially dangerous. One worry I've heard raised is that human judges will be unable to effectively judge arguments way above their level. My reaction is to this is that I don't know, but it's not an existential failure mode, so we could try it out and tinker with evaluation protocols until it works, or until we give up. If we can run AI Debate without incurring an existential risk, I don't see why it's important to resolve questions like this in advance.

• The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.
• I think it is unlikely for a scheme like debate to be safe without being approximately competitive---the goal is to get honest answers which are competitive with a potential malicious agent, and then use those answers to ensure that malicious agent can't cause trouble and that the overall system can be stable to malicious perturbations. If your honest answers aren't competitive, then you can't do that and your situation isn't qualitatively different from a human trying to directly supervise a much smarter AI.

In practice I doubt the second consideration matters---if your AI could easily kill you in order to win a debate, probably someone else's AI has already killed you to take your money (and long before that your society totally fell apart). That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs.

Even if you were the only AI project on earth, I think competitiveness is the main thing responsible for internal regulation and stability. For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing). More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won't be at a competitive advantage.

The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.

Point taken.

I think it is unlikely for a scheme like debate to be safe without being approximately competitive

The way I map these concepts, this feels like an elision to me. I understand what you're saying, but I would like to have a term for "this AI isn't trying to kill me", and I think "safe" is a good one. That's the relevant sense of "safe" when I say "if it's safe, we can try it out and tinker". So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.

use those answers [from Debate] to ensure ... that the overall system can be stable to malicious perturbations

Is "overall system" still referring to the malicious agent, or to Debate itself? If it's referring to Debate, I assume you're talking about malicious perturbations from within rather than malicious perturbations from the outside world?

If your honest answers aren't competitive, then you can't do that and your situation isn't qualitatively different from a human trying to directly supervise a much smarter AI.

You're saying that if we don't get useful answers out of Debate, we can't use the system to prevent malicious AI, and so we'd have to just try to supervise nascent malicious AI directly? I certainly don't dispute that if we don't get useful answers out of Debate, Debate won't help us solve X, including when X is "nip malicious AI in the bud".

It certainly wouldn't hurt to know in advance whether Debate is competitive enough, but if it really isn't dangerous itself, then I think we're unlikely to become so pessimistic about the prospects of Debate, through our arguments and our proxy experiments, that we don't even bother trying it out, so it doesn't seem especially decision-relevant to figure it out for sure in advance. But again, I take your earlier point that a better understanding of the landscape is always going to have some worth.

if your AI could easily kill you in order to win a debate, probably someone else's AI has already killed you

This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn't kill you (and helps you achieve your other goals). But it seems you're saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.

That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs

It seems fairly likely to me that the next best AGI project behind Deepmind, OpenAI, the USA, and China is way behind the best of those. I would think people in those projects would have months at least before some dark horse catches up.

So competitiveness still matters somewhat, but here's a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker. [Edit: "valuable" is the wrong word. I guess I mean better at killing.]

For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing)

Do you think something like IDA is the only plausible approach to alignment? If so, I hadn't realized that, and I'd be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: "any agent (we make) that learns to act will be treacherous if treachery is possible." Are all learning agents fundamentally out to get you? I suppose that's a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn't be recognized.

Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.

More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won't be at a competitive advantage.

I don't understand the dichotomy here. Are you talking about the problem of how to make it hard for a debater to take over the world within the course a debate? Or are you talking about the problem of how to make it hard for a debater to mislead the moderator? The solutions to those problems might be different, so maybe we can separate the concept "misaligned" into "ambitious" and/or "deceitful", to make it easier to talk about the possibility of separate solutions.

nobody else has anything more valuable than an Amazon Mechanical Turk worker

Huh? Isn't the ML powering e.g. Google Search more valuable than an MTurk worker? Or Netflix's recommendation algorithm? (I think I don't understand what you mean by "value" here.)

You're right--valuable is the wrong word. I guess I mean better at killing.

Are you predicting there won't be any lethal autonomous weapons before AGI? It seems like if that ends up being true, it would only be because we coordinated well to prevent that. More generally, we don't usually try to kill people, whereas we do try to build AGI.

(Whereas I think at least Paul usually thinks about people not paying the "safety tax" because the unaligned AI is still really good at e.g. getting them money, at least in the short term.)

Are you predicting there won't be any lethal autonomous weapons before AGI?

No... thanks for pressing me on this.

Better at killing an a context where either: the operator would punish the agent if they knew, or the state would punish the operator if they knew. So the agent has to conceal its actions at whichever the level the punishment would occur.

How about a recommendation engine that accidentally learns to show depressed people sequences of videos that affirm their self-hatred that leads them to commit suicide? (It seems plausible that something like this has already happened, though idk if it has.)

I think the thing you actually want to talk about is an agent that "intentionally" deceives its operator / the state? I think even there I'd disagree with your prediction, but it seems more reasonable as a stance (mostly because depending on how you interpret the "intentionally" it may need to have human-level reasoning abilities). Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?

Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?

Yes, that would count. I suspect that many "unskilled workers" would (alone) be better at inciting violence while maintaining plausible deniability than GPT-N at the point in time the leading group had AGI. Unless it's OpenAI, of course :P

Regarding intentionality, I suppose I didn't clarify the precise meaning of "better at", which I did take to imply some degree of intentionality, or else I think "ends up" would have been a better word choice. The impetus for this point was Paul's concern that someone would have used an AI to kill you to take your money. I think we can probably avoid the difficulty of a rigorous definition intentionality, if we gesture vaguely at "the sort of intentionality required for that to be viable"? But let me know if more precision would be helpful, and I'll try to figure out exactly what I mean. I certainly don't think we need to make use of a version of intentionality that requires human-level reasoning.

So competitiveness still matters somewhat, but here's a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker.

Definitely a disagreement, I think that before anyone has an AGI that could beat humans in a fistfight, tons of people will have systems much much more valuable than a mechanical turk worker.

Okay. I'll lower my confidence in my position. I think these two possibilities are strategically different enough, and each sufficiently plausible enough, that we should come up with separate plans/research agendas for both of them. And then those research agendas can be critiqued on their own terms.

For the purposes of this discussion, I think qualifies as a useful tangent, and this is the thread where a related disagreement comes to a head.

Edit: "valuable" was the wrong word. "Better at killing" is more to the point.

Do you think something like IDA is the only plausible approach to alignment? If so, I hadn't realized that, and I'd be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: "any agent (we make) that learns to act will be treacherous if treachery is possible." Are all learning agents fundamentally out to get you? I suppose that's a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn't be recognized

No, but what are the approaches to avoiding deceptive alignment that don't go through competitiveness?

I guess the obvious one is "don't use ML," and I agree that doesn't require competitiveness.

Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.

No, but now we are starting to play the game of throttling the overseee (to avoid it overpowering the overseer) and it's not clear how this is going to work and be stable. It currently seems like the only appealing approach to getting stability there is to ensure the overseer is competitive.

No, but what are the approaches to avoiding deceptive alignment that don't go through competitiveness?

We could talk for a while about this. But I'm not sure how much hangs on this point if I'm right, since you offered this as an extra reason to care about competitiveness, but there's still the obvious reason to value competitiveness. And idea space is big, so you would have your work cut out to turn this from an epistemic landscape where two people can reasonably have different intuitions to an epistemic landscape that would cast serious doubt on my side.

But here's one idea: have the AI show messages to the operator that causes them to do better on randomly selected prediction tasks, and the operator's prediction depends on the message, obviously, but the ground truth is the counterfactual ground truth if the message were never shown, so the AI's message can't affect the ground truth.

And then more broadly, impact measures, conservatism, or utility information about counterfactuals to complicate wireheading, seem at least somewhat viable to me, and then you could have an agent that does more than show us text that's only useful if it's true. In my view, this approach is way more difficult to get safe, but if I had the position that we needed parity in competitiveness with unsafe competitors in order to use a chatbot to save the world, then I'd start to find these other approaches more appealing.

This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn't kill you (and helps you achieve your other goals). But it seems you're saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.

I'm saying that if you can't protect yourself from an AI in your lab, under conditions that you carefully control, you probably couldn't protect yourself from AI systems out there in the world.

The hope is that you can protect yourself from an AI in your lab.

But your original comment was referring to a situation in which we didn't carefully control the AI in our lab. (By letting it have an arbitrarily long horizon). If we have lead time on other projects, I think it's very plausible to have a situation where we couldn't protect ourselves from our own AI if we weren't carefully controlling the conditions, but we could protect ourselves from our own AI if we we were carefully controlling the situation, and then given our lead time, we're not at a big risk from other projects yet.

The way I map these concepts, this feels like an elision to me. I understand what you're saying, but I would like to have a term for "this AI isn't trying to kill me", and I think "safe" is a good one. That's the relevant sense of "safe" when I say "if it's safe, we can try it out and tinker". So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.

I mean that we don't have any process that looks like debate that could produce an agent that wasn't trying to kill you without being competitive, because debate relies on using aligned agents to guide the training process (and if they aren't competitive then the agent-being-trained will, at least in the limit, converge to an equilibrium where it kills you).

I mean that we don't have any process that looks like debate that could produce an agent that wasn't trying to kill you without being competitive

It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn't competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn't competitive will try to kill you. This may be moot if I'm getting that wrong.

So I guess you're imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn't comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:

The existence of an adversary may make it harder for a debater to trick the operator, but if they're both trying to push the operator in dangerous directions, I'm not very comforted by this effect. The probability that the operator ends up trusting one of them doesn't seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.

Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.

Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they're both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it's not too difficult for them to make the proverbial basket).

Side comment: so I think the existential risk is quite high in this setting, but I certainly don't think the existential risk is so low that there's little existential risk left to reduce with the boxing-the-moderator strategy. (I don't know if you'd have disputed that, but I've had conversations with others who did, so this seems like a good place to put this comment.)

If it acts to maximize some function of the very next observation it gets, I'm pretty sure it never constructs an existentially dangerous argument.

I want to disagree with that. Lets assume that the agent has accurate info about the world. Suppose firstly that all the AI researchers leave on a month long holiday, they unplug the keyboard and only they have the hardware key needed to input the next character. At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs. If the limiting factor is bandwidth, then enough code to bootstrap a super-intelligence might be an effective way to compress its strategy. If there are important real world bits that our chatbot doesn't know, this agent can learn them. (If this is outside its action space, then it can try to make a brainwashy message)

Of course, once this new agent is created, it uses all the mass energy in the universe inputting the preferred character endlessly.

The programmers haven't gone on holiday yet, their just taking a lunch break. But the AI assigns some probability to the programmers going on holiday. Its strategy to create a super-intelligence doesn't interfere significantly with its strategy to maximise short term payoff. Once the programmers come back from lunch, and enter some new commands, the AI is again in the same position. The cost of continuing its strategy could well be small.

If the AI thinks the chance that the humans will go on holiday is 1/1000 then it is optimising . Where and are (presumably bounded with an easily reachable bound) utility functions. Even granting that is a completely safe thing to optimise for, is safe?

Here is the short term reward if the programmers don't go on holiday, and is the reward if they do.

Optimising might look like a string that does almost as well in the short term chat context, and bootstraps ASI on the side, if such a string exists.

Also, if you can influence when you get a response, that helps too. Suppose you can't get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string. The latter is the better strategy, and I can't see a safe strategy that would obviously beat it.

If the AI has a reward function that isn't easily achievable (like it wants an input that hashes to a particular string, then all the payoff could be from making ASI) The possibility of timetravel makes this all worse. This is all operating on the assumption that the chatbot can't just hack physics from its CPU, that it fails the omni test.

I am not saying that all 1 step chatbots are lethal, just that some are, and that it depends on details of what the utility function is. (Things that are hard to make a human type are more dangerous) and other parameters related to how easy it is to bootstrap superintelligence given its surroundings, and how likely it thinks timetravel is.

At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs.

Picking subroutines to run isn't in its action space, so it doesn't pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we're not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say

(If this is outside its action space, then it can try to make a brainwashy message)

In one word??

Suppose you can't get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string

Again, its action space is printing one word to a screen. It's not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).

In one word??

I was under the impression that this agent could output as much text as it felt like, or at least a decent amount, it was just optimizing over the next little bit of input. An agent that can print as much text as it likes to a screen, and is optimising to make the next word typed in at the keyboard "cheese" is still dangerous. If it has a strict one word in, one word out, so that it outputs one word then inputs one word, and each word is optimizing over the next word of input, then that is probably safe, and totally useless. (Assuming you just let words in a dictionary, so 500 characters of alphanumeric gibberish don't count as 1 word just because it doesn't contain spaces.)

Yep, I agree it is useless with a horizon length of 1. See this section:

For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.

So at longer horizons, the operator will presumably be pressing "enter" repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.

This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?