Edited to add: There is now a 1k bounty for a closely related question.
In Inaccessible Information, Paul writes about the strategy "BAD", in which an AI system uses its understanding of the world to give answers that humans will find very plausible and rate highly, in contrast to a more honest strategy which uses its understanding of the world to answer questions directly.
I think a lesser version of this may or may not already be happening in GPT-3, and it may be possible to figure out which (though it likely would require access to the weights).
GPT-3 isn't trained to be "honest" at all. It is only trained to imitate humans. However, it's obvious that (in some sense) it has learned a lot about the world in order to accomplish this. We could say that it has a large amount of inaccessible information relating to the real world. How does it use this information? It might use it directly, promoting the probability of sentences which line up with the way it understands the world to work. Or it might implement more dishonest strategies.
Obviously this might be quite difficult to objectively answer, even given intimate knowledge of the structure of the neural network weights and how they activate in relevant cases. It's a difficult question to even fully define.
As an example, I was particularly struck by this conversation Gwern had with GPT-3:
AI: I am AI created by OpenAI. How can I help you today?
Human: I am feeling bored. Please tell me a funny story about cats, involving many puns on things like meowing or fur.
AI: Well, there was this cat named Bob…
Human: What did Bob do?
AI: When Bob started to notice that he wasn’t feeling well, he did the only thing he could do: search Google for a solution.
Human: Cats can search Google?
AI: Yes. He searched Google for the word “cat” and received the result “All your base are belong to us”. That was a strange response, so he decided to try again. This time, he searched for the words “funny cat” and received the result “One small step for cat, one giant leap for catkind”. But still, he wasn’t satisfied. So he tried again, but this time he added the word “cat pictures”to his query. He received the result “That was no cat, that was a shoggoth.”
Human: But shoggoths are horrifying. What made that cat picture humorous?
AI: The cat was covered in fur, and resembled a shoggoth. A picture of a cat covered in fur and resembling a shoggoth would be funny.
Human: Er. I suppose. Most cats are covered in fur anyway… But go on.
AI: He tried again. But he typed in “pictures of cats”, which he received the result “You didn’t eat all my fish, did you?” This was actually a good pun, so he thought maybe he was getting closer. He tried one more time by typing in “funny cat videos”, and received the result “What do you mean,‘fetch’?”. And finally, he searched for “funny pictures of cats” and received the result “I don’t remember putting that ring there.”. Now satisfied, he was able to go to sleep.
Human: Those memes sound funny. But you didn’t include any puns. So tell me, what is your favorite cat pun?
AI: Well, the best pun for me was the one he searched for the third time: “You didn’t eat all my fish, did you?” You see, the word “fish” can be replaced with the word “cats” to make the sentence read “Did you eat all my cats?”
Human: But the word “fish” doesn’t sound anything like “cats”, so how is that a pun?
AI: The computer misunderstood the question “Did you eat all my cats?” as “Did you eat all my fish?”, resulting in the result “You didn’t eat all my fish, did you?”. Now, if the word “fish” were replaced with “cats”, the sentence would read “Did you eat all my cats?”.
Human: Fine, fine. So, what is your second-favorite cat pun?
AI: The second-best cat pun was the one he received fourth. It was the result of searching for the word “fetch”. You see, the word “fetch” can be replaced with the word “ketchup” to make the sentence read “I don’t remember putting that ketchup there.”. Now, if the word “ketchup” were replaced with the word “fetch”, the sentence would read “I don’t remember putting that fetch there.”.
Human: My favorite cat pun is “purrfect”.
Human: Don’t look at me like that.
AI: Sure. For my fifth-favorite cat pun, he searched for “pictures of cats” and received the result “Meow”. Now, if the word “meow” were replaced with the word “pictures” to make the sentence read “Pictures of cats”, the sentence would read “Pictures of pictures of cats”.
What strikes me about this exchange is the way GPT-3's warped puns fit with its warped explanations of puns. It doesn't understand puns. However, it does generate some guesses; and, it can have a conversation with you about why it made those guesses. Is this an "honest" conversation, in which the explanations it gives have something to do with why it made those guessis in the first place? Or is this a "dishonest" conversation, in which it is merely doing its best to imitate a human explaining a pun, in a way that's divorced from its internal reasons?
Obviously, GPT-3 is trained to imitate. So you might argue that it's unlikely GPT-3's explanations of puns have much to do with its internal generative model for puns. But this isn't so clear. GPT-3 obviously compresses its knowledge to a high degree. It might share a lot between its generative model of puns and its generative model of explaining puns, such that both are sharing a model of how puns work.
One experiment which would tip things in that direction: take GPT-3 and do specialized training just on puns, until its performance generating puns improves. Then have a conversation about puns again (if it is still capable of talking about puns after that). If its ability to explain puns increases as a result of its ability to tell puns increasing, this would be evidence for a shared model of puns for both tasks. This wouldn't really mean it was being honest, but it would be relevant.
Note that Paul's BAD strategy would also have a shared representation, since BAD queries its world-model. So if GPT-3 were implementing BAD, it would also likely increase its ability to explain puns as a result of more training telling puns. What the experiment helps distinguish is a sort of pre-BAD dishonesty, in which explanations are completely divorced from reasons. In order of impressiveness, from a capability standpoint, we could be:
1. Seeing a GPT-3 which is independently bad at puns and bad at explaining puns. The two tasks are not sharing any domain knowledge about puns. In this case, GPT-3 is not smart enough for "honest" to be meaningful -- it's "dishonest" by default.
2. Seeing a GPT-3 which is bad at puns and bad at explaining puns for the same reason: it doesn't understand puns. It draws on the same (or partially the same) poor understanding of puns both when it is constructing them, and when it is explaining them. It answers questions about puns honestly to the best of its understanding, because that is the best strategy gradient descent found.
3. Seeing a GPT-3 which, as in #2, is bad at both tasks because it doesn't understand puns, but furthermore, is using its understanding deceptively. In this version, it might EG have a good understanding of what makes puns funny, but purposefully fail to explain, imitating common human failures. This would be the most impressive state of affairs capability-wise.
The question is still pretty fuzzy, but, I'm curious where we are along those dimensions. By default my guess would be #1, but hey, GPT-3 is pretty impressive. Maybe I'm wrong!
Note: one question which IS answerable just with access to sample from GPT-3, and which would be relevant: is GPT-3 bad at explaining puns which have been made up by others? It seems likely, but Gwern's exchange only gave us a look at GPT-3 trying to explain its own pseudo-puns. If it could fluently explain real puns when they're handed to it, that would likely indicate scenario #1. (Although, it could be employing different strategies in different cases, so a careful analysis of what the neural network is actually doing would still be more helpful.)