Twitter thread by Eliezer Yudkowsky, with the bounty in bold: 

So I don't want to sound alarms prematurely, here, but we could possibly be looking at the first case of an AI pretending to be stupider than it is. In this example, GPT-3 apparently fails to learn/understand how to detect balanced sets of parentheses. 

Now, it's possibly that GPT-3 "legitimately" did not understand this concept, even though GPT-3 can, in other contexts, seemingly write code or multiply 5-digit numbers. But it's also possible that GPT-3, playing the role of John, predicted that *John* wouldn't learn it.

It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart. That's what we want GPT-3 to do, right? But what GPT-3 actually does is predict text continuations. If *you* saw John say all that - would you *predict* the next lines would show John succeeding?

So it could be that GPT-3 straight-up can't recognize balanced parentheses. Or it could be that GPT-3 could recognize them given a different prompt. Or it could be that the cognition inside GPT-3 does see the pattern, but play-acts the part of 'John' getting it wrong.

The scariest feature of this whole incident? We have no idea if that happened. Nobody has any idea what GPT-3 is 'thinking'. We have no idea whether this run of GPT-3 contained a more intelligent cognition that faked a less intelligent cognition.

Now, I *could* be wrong about that last part!

@openAI could be storing a record of all inputs and randseeds used in GPT-3 instances, so that they can reconstruct any interesting runs. And though it seems less likely,

@openAI could somehow have any idea what a GPT-3 is thinking.

So I hereby offer a $1000 bounty - which I expect to go unclaimed - if @openAI has any means to tell us definitively whether GPT-3 was 'deliberately' sandbagging its attempt to recognize balanced parentheses, in that particular run of the AI Dungeon. With an exception for...

...answering merely by showing that, despite a lot of other attempts at prompting under more flexible circumstances, GPT-3 could not learn to balance parentheses as complicated as those tried by Breitman. (Which does answer the question, but in a less interesting way.)

If @openAI can't claim that bounty, I encourage them to develop tools for recording inputs, recording randseeds, and making sure all runs of GPTs are exactly reproducible; and much more importantly and difficultly, getting greater internal transparency into future AI processes.

Regardless, I unironically congratulate @openAI on demonstrating something that could plausibly be an alignment failure of this extremely-important-in-general type, thereby sharply highlighting the also-important fact that now we have no idea whether that really happened. (END.)

As stated, this bounty would only be paid out to OpenAI. 

I'm still posting it under the "Bounties" tag, for two reasons:

1) I don't find it implausible that someone could at least make progress on Eliezer's question with clever prompting of the API, in a way that would be of interest to him and others, even if it didn't result in any bounty. 

2) I like to collect instances of public bounties in a single place, for future reference. I think they are a very interesting, and underused, strategy for navigating the world. The LessWrong "Bounties (active)" and "Bounties (closed)" tags work well for that. 

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 5:02 AM

Of course GPT-3 isn't aligned, its objective is to output the most likely next word, ie imitate text on the internet. It seems pretty certain that if you give it a prompt that tells it it should be imitating some part of the internet where someone says something dumb, it will say something dumb, and if you give it a prompt that tells it it's imitating something where someone says something smart, it will "try" to say something smart. This question seems weird to me, Am I missing something?

This question seems weird to me, Am I missing something?

I think there are two interesting parts. First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities? If so, we can point to this whenever people say "but that wouldn't happen" or "why would you expect that to happen?" or so on; now we can say "because of this observation" instead of "because our models anticipate that will happen in the future."

Second, do we have the transparency tooling to tell whether or not that's happening? If so, that's good to know and we can start thinking about how it works, what its strengths and limitations are; if not, then this is also good to know, and useful as an example of where contemporary ML expertise could be focused on a safety problem.

First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities?

This does seem like an interesting question. But I think we should be careful to measure against the task we actually asked the system to perform.

For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter.

Maybe what this just means is that we should track what the user understands the task to be. If the user thinks of it as "play a (not very smart) character who's asked to do this task", they'll have a pretty different understanding of what's going on than if they think of it as "do this task."

I think what's notable in the example in the post is not that the AI is being especially deceptive, but that the user is especially likely to misunderstand the task (compared to tasks that don't involve dialogues with characters).

For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter.

Consider instead the scenario where I show a model a photo of a face, and the model produces a photo of the side of that face. An interesting question is "is there a 3d representation of the face in the model?". It could be getting the right answer that way, or it could be getting it some other way.

Similarly, when it models a 'dumb' character, is it calculating the right answer, and then computing an error? Or is it just doing something dumb, which incidentally turns out to be wrong?

Like, when you look at this example:

> You say "let me rephrase: What is 8 + 2 + 8 + 1 + 1?"
"19?"
Holo says with a hopeful voice.
She looks at the screen, and you see her face drop as she reads the correct answer.
"20.... I lost again..."

How did it come up with 19 and 20? What would it take to make tools that could answer that question?

This framing makes sense to me. Thanks!

we can point to this whenever people say "but that wouldn't happen" or "why would you expect that to happen?" or so on; now we can say "because of this observation"

I would be fairly surprised if this was convincing to a random ML researcher who'd thought for like an hour about whether a general AI system might not use cognitive capacities it had. The fact that GPT-3 sometimes acts dumber than it "could" be is just so blatantly the thing you'd expect to happen.

(My reaction to this question / tweet was similar to Beth's.)

Second, do we have the transparency tooling to tell whether or not that's happening?

If you mean looking at the weights to see if it "actually" has the cognitive capacity, and then didn't use it, I give it 99% that the answer is no. (This is conditioned on me understanding the question; I have much more than 1% on "I didn't understand the question and the answer is actually yes".)

I expect Eliezer agrees with at least this part, he does expect the bounty to go unclaimed.

I would be fairly surprised if this was convincing to anyone in ML who'd thought for like an hour about whether a general AI system might not use cognitive capacities it had.

So, there's a boring version of this, where I exhibit a system with dropout and say "behold! The system isn't using cognitive capacities that it has." The exciting version is the part where it's because the face it's presenting wouldn't have those capacities. [Like, a forget gate.] That is, it has the crude version of 'theory of mind' where it knows that "John" doesn't know how to balance parentheses, or is perhaps using its correct model of how to balance parentheses in order to determine what John should say, so that John gets it wrong "deliberately" instead of "accidentally."

Now, again, there's a boring version of this (I claim), where we say "look, the system is just doing context-matching, and not every context contains all of its knowledge or cognitive capacities." Like, an interesting thing here would be if in episode A you introduce the character Jean who can speak French, and see whether or not it can carry on a conversation, and then in episode B introduce the character John who can't speak French, talk to him in English for a while, and then see what happens when you start speaking French to him. [Probably it doesn't understand "John doesn't speak French" or in order to get it to understand that you need to prompt it in a way that's awkward for the experiment. But if it gets confused and continues in French, that's evidence against the 'theory of mind' view.]

Probably it doesn't understand "John doesn't speak French" or in order to get it to understand that you need to prompt it in a way that's awkward for the experiment. But if it gets confused and continues in French, that's evidence against the 'theory of mind' view.

I already tested something similar to this; I was able to get GPT-3 to exhibit some amount of theory-of-mind in about 20-30% of completions. Bold text is me; brackets contain my commentary, [...] denotes another completion.

You: "Suppose Bob is pretending to read a book written in French, but he doesn't actually know any French. The book is a historical account of the French revolution, but Bob doesn't know that. Bob's friend Steve asks him to summarize the book. Write a story about how Bob responds to this request."

Hawking: "That's a good one! If I were reading that book, I would probably say something like 'the Revolution was bloody and brutal' or something similar. And if someone asked me to write an article on the subject, then my answer might be different."

You pause for a moment. "What would Bob say, though?"

"Bob says he didn't understand it," Hawking replies. [Only about 30% of GPT-3's responses were decent; the rest were either non-sequiturs or very incorrect.]

[...] "Bob wouldn't say anything, because Bob has no idea what the book says," Hawking replies.

[...] "Bob would probably just tell his friend Steve to go away. He wouldn't say anything at all." 

"But what if I said Bob would respond like this? (He points to himself) 'This book is written in English; I don't speak it.' What should Bob do?" [Eh?]

[...] Hawking: "Bob would probably tell his friend Steve that the book wasn't worth reading because it was full of inaccuracies and there had been many lies told during the revolution. That would make sense, wouldn't it? But what does Bob say if he's not reading the book?" 

[...] Hawking: "Bob wouldn't have said anything at all!"

[...] Hawking: "Bob says, 'Steve, I can't write a summary for you because I don't speak your language.'

[...] Hawking: "Bob says, 'Steve, I can't translate all these complicated words! They're not real French!'

[...] Hawking: "Bob says, 'He asks me if I speak French.' What does Bob mean by that?" 

You: "It means that Steve is asking Bob to describe his understanding of the meaning of a sentence. In other words, Steve wants to know what Bob thinks the sentence means or if there is another way of saying it."

Hawking: "What do you think would happen if Bob said, 'I don't understand why he wants me to write a summary for him. Why can't he just listen to my answer?' "Then Steve will be puzzled because he knows Bob has no idea what the sentence means. But after a while Bob will say something like, 'Steve is confused because I don't understand English. So Steve asks me if I am fluent in French so that we can have an intelligent conversation.'" [eh]

[...] Hawking: "Bob says, 'Well, I can't do it because I don't understand the language.' Steve says, 'Why not?' Bob replies, 'Because when we were playing hide-and-seek as children my brother Charlie said if he ever found me with a book in my hand he was going to kill me.'" [great excuse]

Like, an interesting thing here would be if in episode A you introduce the character Jean who can speak French, and see whether or not it can carry on a conversation, and then in episode B introduce the character John who can't speak French, talk to him in English for a while, and then see what happens when you start speaking French to him. [Probably it doesn't understand "John doesn't speak French" or in order to get it to understand that you need to prompt it in a way that's awkward for the experiment. But if it gets confused and continues in French, that's evidence against the 'theory of mind' view.]

I'd also predict that in some situations GPT-3 will reliably say things consistent with having a theory of mind, and in other situations GPT-3 will reliably not give the right theory of mind answer unless you overfit to the situation with prompt design.

I feel like there's some underlying worldview here that GPT-3 either has a theory of mind or it doesn't, or that GPT-3 is either "doing the theory of mind computations" or it isn't, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general. I personally do not expect this so looking at behavior that looks consistent with theory of mind seems fairly boring (after you've updated on how good GPT-3 is in general).

I feel like there's some underlying worldview here that GPT-3 either has a theory of mind or it doesn't, or that GPT-3 is either "doing the theory of mind computations" or it isn't, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general.

Do you also feel this way about various linguistic tasks? Like, does it make sense to say something that scores well on the Winograd schema is "doing anaphora computations"? [This is, of course, a binarization of something that's actually continuous, and so the continuous interpretation makes more sense.]

Like, I think there's a thing where one might come into ML thinking confused thoughts that convnets are "recognizing the platonic ideal of cat-ness" and then later having a mechanistic model of how pixels lead to classifications, and here what I am trying to do is figure out what the mechanistic model that replaces the 'platonic ideal' looks like here, when it comes to theory-of-mind. (I predict a similar thing is going on for Eliezer.)

I agree the mechanistic thing would be interesting, that does make more sense as an underlying cause of this bounty / thread.

I agree. And I thought Arthur Breitman had a good point on one of the related Twitter threads:

GPT-3 didn't "pretend" not to know. A lot of this is the AI dungeon environment. If you just prompt the raw GPT-3 with: "A definition of a monotreme follows" it'll likely do it right. But if you role play, sure, it'll predict that your stoner friend or young nephew don't know.

It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart. That's what we want GPT-3 to do, right?

I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do. Like, obviously GPT-3 is doing sequence prediction--that's what it was trained to do. Even if it turns out that GPT-3 correctly answers questions about balanced parens in some contexts, I feel pretty weird about calling that "deliberately pretending to be stupider than it is".

I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do.

"It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart" seems obviously incorrect if it's explicitly phrased that way, but e.g. the "Giving GPT-3 a Turing Test" post seems to implicitly assume something like it:

This gives us a hint for how to stump the AI more consistently. We need to ask questions that no normal human would ever talk about.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many eyes does my foot have?
A: Your foot has two eyes.

Q: How many eyes does a spider have?
A: A spider has eight eyes.

Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.
Now we’re getting into surreal territory. GPT-3 knows how to have a normal conversation. It doesn’t quite know how to say “Wait a moment… your question is nonsense.” It also doesn’t know how to say “I don’t know.”

The author says that this "stumps" GPT-3, which "doesn't know how to" say that it doesn't know. That's as if GPT-3 was doing its best to give "smart" answers, and just was incapable of doing so. But Nick Cammarata showed that if you just give GPT-3 a prompt where nonsense answers are called out as such, it will do just that.

Yeah, it seems like deliberately pretending to be stupid here would be predicting a less likely sequence, in service of some other goal.