TLDR: Humans make systematic errors: apparent overconfidence, the conjunction fallacy, the gambler’s fallacy, and belief inertia. For decades, this has been taken to show that we reason in an irrational, non-probabilistic way. But GPT4—a probabilistic inference machine—exhibits every one of these “errors”. This gives some (defeasible!) evidence for rational explanations of human biases.


Charlie is a helpful chap. An experimenter’s dream subject, really. He knows a lot. Ask him any question you like, and he’ll try his best to answer it.

But sometimes he’s wrong. Indeed—as with most people—when he has to make judgments under uncertainty, his errors tend to be systematic.

He seems to display overconfidence: his average confidence in his guesses often exceeds the proportion of those guesses that are correct—on a typical test, his average confidence was 73% but his proportion-correct was only 53%.

He commits the conjunction fallacy: often saying that conjunctions (A&B) are more probable than their conjuncts (A)—contrary to the laws of probability.

He commits the gambler’s fallacy: if a coin lands tails a few times in a row, he tends to act as if a heads is “due”.

He displays both conservativeness and belief inertia when updating on data from random processes: he does not update as much as a Bayesian would, and once he becomes confident of a claim, he’s too slow to respond to evidence that undermines it.

Given all that, what do you conclude about how Charlie deals with uncertainty?

Perhaps—following some heroic psychologists and philosophers—you’d argue that these “errors” are in fact “resource-rational” responses given his limited information and cognitive resources

Maybe his “overconfidence” is largely due to selection effects. Maybe the conjunction fallacy results from a sensible tradeoff between accuracy and informativity. Maybe the gambler’s fallacy arises from reasonable uncertainty about the causal system. And maybe conservativeness is due to ambiguity in the experimental setup.

Maybe.

But that’s a bit forced, isn’t it? You’re probably more inclined to follow behavioral economists, concluding that these errors provide strong evidence that Charlie is irrational. Rather than maintaining complex probability distributions over possibilities, he must use other (worse) ways of handling uncertainty—such a simple heuristics that lead to these systematic biases.

You’re wrong.

For “Charlie” is ChatGPT. GPT4, in fact: a trillion-parameter large language model that we know performs inference in a fundamentally probabilistic way. It aces our hardest exams, can write code like a programmer, and is the closest we’ve come to a domain-general reasoner. And yet it readily exhibits the overconfidence effect, the conjunction fallacy, the gambler’s fallacy, and belief inertia.

Those effects are some of the strongest bits of evidence that have been marshaled to suggest that humans handle uncertainty in an irrational, non-probabilistic way. The fact that ChatGPT also exhibits them casts doubt on that irrationalist narrative, and provides some evidence for the “resource-rational” alternatives that initially seemed forced.

To see why, let's take a closer look at each type of "systematic error".

Overconfidence

The finding: The classic way psychologists have tried to get evidence about whether people are overconfident is with “2-alternative forced choice” tasks. People are forced to guess which of two alternatives they think is most likely, and then rate their confidence (between 50–100%) in those guesses. The overconfidence effect is the finding that—very often, especially with hard questions—people’s average confidence exceeds their proportion correct.

ChatGPT’s results: Taking a standard methodology from the literature, I asked ChatGPT to guess which of a set of pairs of American cities had a larger population (in the city proper). Its average confidence in its guesses was 73.3%, but its proportion correct was 52.9%. The difference is statistically significant.[1] (The trick is to ask it for pairs that are close to each other in the ranking.)

The Conjunction Fallacy

The finding: The classic example of the conjunction fallacy describes Linda:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in antinuclear demonstrations.

Subjects were then asked which they think is more likely:

  1. Linda is a bank teller.
  2. Linda is a bank teller who is active in the feminist movement.

85% of people chose option (2). But this is an error in probabilistic reasoning: as we can see with the Venn diagram, every possibility in which Linda is a feminist bank teller is also one in which she’s a bank teller; so the latter (a conjunction, A&B) can’t be more probably than former (a conjunct, A)—no matter how probable feminist is.

 

ChatGPT’s results: if you give ChatGPT more-or-less the exact same wording, it’ll catch on and realize this is a test of the conjunction fallacy. But vary the case ever so slightly, and it’ll fall for the conjunction fallacy:

 

And it’s not just this version of the case. Like people, ChatGPT also will commit the conjunction fallacy when no relevant evidence is presented or when one option is a disjunction (A or B) and the other is a disjunct (A).

The Gambler’s Fallacy

The finding: The gambler’s fallacy is the tendency to think that random processes tend to “switch”—thinking that after a streak of tails (like THTTTT), a heads is “due” in order to balance out the sequence. In fact, it’s not: a random (fair) coin has no memory, so each toss is exactly 50%-likely to be a tails, regardless of the past tosses.

What is true is that the “law of large numbers” implies that in the long run, roughly 50% of the (fair) tosses will land heads—so if you wait long enough, it’s very likely that the overall proportion will return toward 50%. But that’s because there are more possibilities in which a long string has roughly 50% Hs and Ts than ones in which it doesn’t.[2] This long-run effect has no influence on the next toss: there are exactly two equally-likely possibilities for that—H and T. For that reason, the gambler’s fallacy is sometimes called a fallacious belief in the “law of small numbers”.

The most well-studied way to elicit this fallacy is with production tasks: ask people to produce hypothetical outcomes from a fair coin, and then observe the statistical features of the outcomes they generate.

People are bad at it. The sequences they generate tend to switch (from heads to tails, or tails to heads) around 60% of the time—rather than the true rate of 50%. Moreover, the proportion of sequences that switch grows dramatically as the streak gets longer, as can be seen in the following table from this paper:

 

After a sequence of HHT, the probability of switching to heads is 48.7%; after HTT, it’s 62%; after TTT, it’s 70%.

ChatGPT’s results: ChatGPT behaves remarkably similarly. I asked it to generate 5 sequences of 100 results from a fair coin. The switching rate was 60.8%. Here’s one of those sequences:

1,0,1,0,0,1,1,0,0,1,0,0,0,1,1,0,1,0,0,1,1,1,0,1,0,1,1,0,0,0,1,0,1,1,0,1,0,0,0,1,0,1,0,0,1,0,1,1,0,0,1,0,1,1,0,1,1,0,1,0,0,0,1,1,1,0,0,1,1,0,0,1,0,1,0,1,0,0,1,0,1,1,1,1,0,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,1,0,1,1,0,1,0

As you can see from a glance, it switches far too often. Moreover, the switching rate increases dramatically with streak length—just like with humans. Here are the rates of switching, by streak length, from all 500 of the outcomes it generated:

 

After a streak of 1, it switched 52% of the time; after 2, it switched 72% of the time;  after 3, it switched 79% of the time; and after 4, it switched 100% of the time.[3]

Conservativeness and Belief Inertia

The findings: Conservativeness is the finding that people don’t update as much as a Bayesian would when they learn from the outcomes of a random process. Belief inertia—a form of confirmation bias—is the finding that once people form a belief in this way, they are too slow to revise it when the new evidence should undermine that belief.

The classic way to demonstrate these findings is to draw marbles (with replacement) from a bag of unknown composition. Tell people that a randomly-selected bag is either mostly-red (2 red, 1 green) or mostly-green (1 red, 2 green). Then draw a sequence of marbles.

After 3 marbles in a row, a Bayesian would jump from 50% to 94%-confident that it’s the mostly-red bag, while real people tend to be in the mid-80% range. That’s conservativeness.

And if you give people a sequence that starts out heavily favoring one bag (to induce a belief) but then later balances out, people don’t revise their beliefs back down to 50% the way they should. For example, consider the following “red first” sequence:

red, red, red, red, red, red, green, green, red, red, red, green, green, green, red, green, green, green, green, green

Given this sequence, people wind up more than 50%-confident the bag is mostly-red, despite the fact that it has the same total number of reds and greens (10 each). Conversely, if we swap “green” and “red” to make a green-first sequence, people wind up less than 50%-confident the bag is mostly red. That’s belief inertia.

ChatGPT’s result: Here are the results of giving GPT4 the “red first” and the “green first” sequences, comparing how it’s beliefs evolve to how a Bayesian's would:

 

ChatGPT displays both conservativeness and belief inertia. 

Conservativeness can be seen from the fact that the dashed curves are shallower than the solid (Bayesian) curves. 

Belief inertia can be seen from the fact that the “red first” sequence led it to be 61.5%-confident of mostly-red, while the “green first” sequence led it to be 35%—when in fact, both should’ve ended at 50%. (A gap of 26.5% probability given what should be equivalent evidence!)

What does this mean?

GPT4—an optimized, probabilistic, fairly domain-general reasoning machine—commits the same systematic errors that have been used to argue that humans couldn’t be optimized, probabilistic reasoning machines. What to make of that?

The cautious conclusion: It turns out that systems that are fundamentally probabilistic and highly optimized can still be expected to exhibit these errors. At the least, this helps reconcile the above systematic errors with the well-supported result that many low-level processes in the brain—such as visual perception, motor control, and unconscious decision-making—are thoroughly Bayesian.

The radical conclusion: Perhaps these “systematic errors” are actually indicators of optimality and rationality—as the “resource-rational” approaches have been saying all along.

“But wait!”, you might reply. “ChatGPT is learning from us. So maybe it’s just ‘optimized’ to replicate our systematic errors in reasoning. No surprise there. Right?”

Though tempting, this hypothesis seems to be too simple. In particular, it forgets what was so shocking and exciting about GPT4 to begin with.

The hypothesis would predict that GPT4 should be about as smart as the average internet user. If it were, it’d be scoring at (or below) the median on all our standardized tests; its reasoning, grammar, and coding would be about as good as the median internet user; and no one in their right mind would pay $20/month to be able to talk to it. 

None of those predictions pan out.

What’s shocking about ChatGPT was that it was trained on babble, and what came out was brilliance. What needs to be reckoned with is that an AI that is shockingly smart performs exactly the same sorts of “errors” that have been taken to show that people are surprisingly dumb.

Obviously this is all quite speculative. It'll be extremely interesting to find out both (1) how robust these results are with GPT4 (please try variations and report back!), and (2) whether future, more-advanced models exhibit the same behavior.

But at this point, I think GPT4 clearly provides some evidence against the irrationalist narrative about humans.

ChatGPT is smarter than we expected. Maybe we are too.

What next?

  1. ^

    18 of 34 correct; bootstrapped 95% confidence-interval for the proportion correct was [35.2%,70.5%], below the average confidence of 73.3%.

  2. ^

    For example, there’s only one 4-toss string that has 100% heads—namely, HHHH—while there are six that have 50% heads—namely, HHTT, HTHT, THHT, HTTH, THTH, TTHH.

  3. ^

    Though the sample size was small (there were only 9 streaks of 4), so this is likely an overestimate of the true rate.

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 9:25 PM

I am disappointed in the method here.

GPT is not a helpful AI that is trying to helpfully convey facts to you. It is, to first order, telling you what would be plausible if humans were having your conversation. For example, if you ask it what hardware it's running on, it will give you an answer that would be plausible if this exchange showed up in human text, it will not actually tell you what hardware it's running on.

Similarly, you do not learn anything about GPT's own biases by asking it to complete text and seeing if the text means something biased. It is predicting human text. Since the human text it's trying to predict exhibits biases... well, fill in the blank.

What I was so badly hoping this would be was an investigation of GPT's biases, not the training dataset's biases. For example, if in training GPT saw "cats are fluffy" 1000 times and "cats are sleek" 2000 times, when shown "cats are " does it accurately predict "fluffy" half as much as "sleek" (at temperature=1), or is it biased and predicts some other ratio? And does that hold across different contexts? Is it different for patterns it's only seen 1 or 2 times, or for patterns it's seen 1 or 2 million times?

The belief inertia result is the closest to this, but still needs a comparison to the training data.

Thanks for the thoughtful reply! Two points.

1) First, I don't think anything you've said is a critique of the "cautious conclusion", which is that the appearance of the conjunction fallacy (etc) is not good evidence that the underlying process is a probabilistic one.  That's still interesting, I'd say, since most JDM psychologists circa 1990 would've confidently told you that the conjunction fallacy + gambler's fallacy + belief inertia show that the brain doesn't work probabilistically. Since a vocal plurality of cognitive scientists now think they're wrong, this is still an argument for the latter, "resource-rational" folks.  

Am I missing something, or do you agree that your points don't speak against the "cautious conclusion"?

 

2) Second, I of course agree that "it's just a text-predictor" is one interpretation of ChatGPT.  But of course it's not the only interpretation, nor the most exciting one that lots of people are talking about.  Obviously it was optimized for next-word prediction; what's exciting about it is that it SEEMS like by doing so, it managed to display a bunch of emergent behavior.  

For example, if you had asked people 10 years ago whether a neural net optimized for next-word prediction would ace the LSAT, I bet most people would've said "no" (since most people don't).  If you had asked people whether it would perform the conjunction fallacy, I'd guess most people would say "yes" (since most people do).

Now tell that past-person that it DOES ace the LSAT.  They'll find this surprising.  Ask them how confident they are that it performs the conjunction fallacy.  I'm guessing they'll be unsure. After all, one natural theory of why it aces the LSAT is that it gets smart and somehow picks up on the examples of correct answers in its training set, ignoring/swamping the incorrect ones.  But, of course, it ALSO has plenty of examples of the "correct" answer to the conjunction fallacy in its dataset.  So if indeed "bank teller" is the correct answer to the Linda problem in the same sense  that "Answer B" is the correct answer to LSAT question 34, then why is it picking up on the latter but not the former?

I obviously agree that none of this is definitive.  But I do think that insofar as your theory of GPT4 is that it exhibit emergent intelligence, you owe us some explanation for why it seems to treat correct-LSAT-answer differently from "correct"-Linda-problem-answers.

The hypothesis would predict that GPT4 should be about as smart as the average internet user.

LLMs predict what the human-in-the-prompt would say, and you can easily put Feynman in there. It won't work very well with modern LLMs and datasets, but the objective would like for it to work better, and the replies will get smarter. The average internet user is not the thing LLMs aspire to predict, they aspire to predict specific humans, as suggested by context. There is a wide variety of humans (and not just humans) they learn to predict.

Instruct fine-tuned LLMs (chatbots) can be described as anchoring to a persona with certain traits[1], which don't need to be specified with context and that get expressed more reliably. The range of choices is still around what the dataset offers, you can't move too far out of distribution to get good prediction of how hypothetical superhumanly debiased people would talk.


  1. gwern: "So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer." ↩︎

This. Asking  GPT-4 a question might give an obviously wrong answer, but sometimes, just following up with "That answer contains an obvious error. Please correct it." (without saying what the error was) results in a much better answer. GPT-4 is not a person in the sense that each internet user is. 

How does that argument go?  The same is true of a person doing (say) the cognitive reflection task. 

"A bat and a ball together cost $1.10; the bat costs $1 more than the ball; how much does the ball cost?"

Standard answer: "$0.10".  But also standardly, if you say "That's not correct", the person will quickly realize their mistake.

Well, that's true. People do also do that. I was trying to point to the idea of LLMs being able to act like multiple different people when properly prompted to do so.

Hm, I'm not sure I follow how this is an objection to the quoted text.  Agreed, it'll use bits of the context to modify its predictions. But when the context is minimal (as it was in all of my prompts, and in many other examples where it's smart), it clearly has a default, and the question is what we can learn from that default. 

Clearly that default behaves as if it is much smarter and clearer than the median internet user. Ask it to draw a tikz diagram, and it'll perform better than 99% of humans. Ask it about the Linda problem, and it'll perform the conjunction fallacy. I was arguing that that is mildly surprising, if you think that the conjunction fallacy is something that 80% of humans get "wrong" (and, remember, 20% get "right"). 

 Where does the fact that it can be primed to speak differently disrupt that reasoning?

[-]qjh8mo40

Regarding overconfidence, GPT-4 is actually very very well-calibrated before RLHF post-training (see paper Fig. 8). I would not be surprised if the RLHF processes imparted other biases too, perhaps even in the human direction.

Nice point! Thanks.  Hadn't thought about that properly, so let's see.  Three relevant thoughts:

1) For any probabilistic but non-omniscient agent, you can design tests on which it's poorly calibrated on.  (Let its probability function be P, and let W = {q: P(q) > 0.5 & ¬q} be the set of things it's more than 50% confident in but are false.  If your test is {{q,¬q}: q ∈ W}, then the agent will have probability above 50% in all its answers, but its hit rate will be 0%.)  So it doesn't really make sense to say that a system is calibrated or not FULL STOP, but rather that it is (or is not) on a given set of questions.  

What they showed in that document is that for the target test, calibration gets worse after RLHF, but that doesn't imply that calibration is worse on other questions.  So I think we should have some caution in generalizing.

2) If I'm reading it right, it looks like on the exact same test, RLHF significantly improved GPT4's accuracy (Figure 7, just above).  So that complicates that "merely introducing human biases" interpretation.

3) Presumably GPT4 after RLHF is a more useful system than GPT4 without it, otherwise they would have released a different version.  That's consistent with the picture that lots of fallacies (like the conjunction fallacy) arise out of useful and efficient ways of communicating (I'm thinking of Gricean/pragmatic explanations of the CF).  

I can understand where you got these ideas from but they are far from reality.

The problem is that you are confusing the system with its outputs. Saying that GPT-4 is a wrapper of human knowledge is similar to saying the same about BERT-like models (and we obviously reject this idea).

GPT-4 is not just an approximation of the dataset, nor a reason machine, it is simply a next token predictor with many, many limitations. I know that the outputs are very impressive, but you can't take those models as comparisons to humans behavior's. Some of your examples can be explained by token problems, dataset bias, architecture limitation and objective limitation.

From a person that trains LLMs, many simple things can influence the outputs, even natural bias of language itself.

I don't think we should take GPT as an "almost human" system, or "human cousin", it is just another machine learning system very far from how our brain works. That one is impressive, but don't be mistaken, just impressed.