AI ALIGNMENT FORUM
AF

Language Models (LLMs)AI
Frontpage

105

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

by Lawrence Chan
11th Jun 2025
19 min read
19

105

Language Models (LLMs)AI
Frontpage
Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)
2DanielFilan
4Lawrence Chan
1Hruss
2Lawrence Chan
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 6:34 AM
[-]DanielFilan1mo20

Ironically, given that it's currently June 11th (two days after my last tweet was posted) my final tweet provides two examples of the planning fallacy.

"Hopefully" is not a prediction!

Reply
[-]Lawrence Chan1mo40

Fair, but in my head I did plan to get it done on the 10th. The tweet is not in itself the prediction, it's just evidence that I made the prediction in my head. 

And indeed I did finish the draft on June 10th, but at 11 PM and I decided to wait for feedback before posting. So I wasn't that off in the end, but I still consider it off. 

Reply
[-]Hruss1mo13

I find that studies criticizing current models are often used long after the issue is fixed, or without consideration to the actual meaning. I would wish that technology reporting is more careful, as much of this misunderstanding seems to come from journalistic sources. Examples:

Hands in diffusion models

Text in diffusion models

Water usage

Model collapse - not an issue for actual commercial AI models, the original study was about synthetic data production, and directly feeding the output of models as the exclusive training data

LLMs = Autocorrect - chat models have RLHF post training 

Nightshade/glaze: useless for modern training methods

AI understanding - yes, the weights are not understood, but the overall architecture is

 

It is surprising how many times I hear these, with false context.

Reply
[-]Lawrence Chan1mo21

There are indeed many, many silly claims out there, on either side of any debate. And yes, the people pretending that the AIs of 2025 have the limitations of those from 2020 are being silly, journalist or no.

I do want to clarify that I don't think this is a (tech) journalist problem. Presumably when you mention Nightshade dismissively, it's a combination of two reasons: 1) Nightshade artefacts are removable via small amounts of Gaussian blur and 2) Nightshade can't be deployed at scale on enough archetypal images to have a real effect? If you look at the Nightshade website, you'll see that the authors lie about 1):

As with Glaze, Nightshade effects are robust to normal changes one might apply to an image. You can crop it, resample it, compress it, smooth out pixels, or add noise, and the effects of the poison will remain.

So (assuming my recollection that Nightshade is defeatable by Gaussian noise is correct) this isn't an issue of journalists making stuff up or misunderstanding what the authors said, it's the authors putting things in their press release that, at the very least, are not at all backed up by their paper. 

(Also, either way, Gary Marcus is not a tech journalist!)

Reply
Moderation Log
Curated and popular this week
4Comments

1.

Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which “challenge[s] prevailing assumptions about [language model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning”.

Normally I refrain from publicly commenting on newly released papers. But then I saw the following tweet from Gary Marcus:

I have always wanted to engage thoughtfully with Gary Marcus. In a past life (as a psychology undergrad), I read both his work on infant language acquisition and his 2001 book The Algebraic Mind; I found both insightful and interesting. From reading his Twitter, Gary Marcus is thoughtful and willing to call it like he sees it. If he's right about language models hitting fundamental barriers, it's worth understanding why; if not, it's worth explaining where his analysis went wrong.

As a result, instead of writing a quick-off-the-cuff response in a few 280 character tweets, I read the paper and Gary Marcus’s substack post, reproduced some of the paper’s results, and then wrote this 4000 word post.

Ironically, given that it's currently June 11th (two days after my last tweet was posted) my final tweet provides two examples of the planning fallacy.

2.

I don’t want to bury the lede here. While I find some of the observations interesting, I was quite disappointed by the paper given the amount of hype around it. The paper seems to reflect generally sloppy work and the authors overclaim what their results show (albeit not more so than the average ML conference submission). The paper fails to back up the authors’ claim that language models cannot “reason” due to “fundamental limitations”, or even (if you permit some snark) their claim that they performed “detailed analysis of reasoning traces”.

By now, others have highlighted many of the issues with the paper: see for example twitter threads by Ryan Greenblatt or Lisan al Gaib, as well as the paper drafted by Alex Lawsen and Claude Opus 4[1] and Zvi Moshowitz’s substack post. Or, if you’re feeling really spicy, you can ask any of Gemini 2.5, o3, or Opus 4 to critique the paper as if they were reviewer #2. 

3.

It's important to keep in mind that this paper is not a bombshell dropped out of the blue. Instead, it's merely the latest entry to 60 years of claims about neural networks' fundamental limitations. I’m by no means an expert in this literature, but here are 6 examples off the top of my head: 

  1. In the late 1960s, Minsky and Papert published a book showing that single-layer perceptrons (a precursor to modern MLPs) cannot represent XOR, helping trigger the first AI winter.
  2. Gary Marcus argued in the 1990s and 2000s that undifferentiated, fully-connected neural networks cannot learn important aspects of natural language.
  3. From the 1990s to the mid 2010s, researchers from the statistical learning theory tradition argued that the class of hypotheses represented by neural networks have high intrinsic VC dimension – that is, they are hard to learn in the worst case.
  4. A group of researchers from the natural language processing community have recently argued that large language models (LLMs) are “stochastic parrots” that probabilistically link together words and sentences without consideration of their meaning. A related line of academic work argues that transformers cannot learn causality from statistical data.
  5. Yet another line of work looks at the complexity classes of circuits that transformers can represent and finds that finite-precision Transformers correspond to the (very limited) complexity class of uniform TC0 – a very restricted class of circuits.
  6. The most related line of work to the Illusion of Thought paper involves generating simple problems that humans can solve but that LLMs can not – probably the highest profile of these is ARC-AGI, but other examples include the much earlier CommonSenseQA or some of Gary Marcus’s puzzles. Also, LLMs cannot multiply 10 digit numbers.

Broadly speaking, the arguments tend to take the following form:

  • The authors concede that neural networks/LLMs can do seemingly impressive things in practice.
  • Current techniques fail to generalize to the clearly correct solution in a theoretical setting, or they fail empirically in a simple toy setting.
  • Ergo, their apparently impressiveness in practice is an illusion resulting from regurgitating memorized examples or heuristics from the training dataset.

(Unsurprisingly, the Illusion of Thinking paper also follows this structure: the authors create four toy settings, provide evidence that current LLMs cannot solve them when you scale the toy settings to be sufficiently large, and conclude that they must suffer from fundamental limitations of one form or another.)

It’s worth noting that there exist standard responses to each of the lines of work on “fundamental limitations” I mentioned above:

  1. Current neural networks have many layers, which allows them to represent more complicated functions.
  2. Current LLMs are not undifferentiated, fully connected neural nets. In fact, the field of deep learning as a whole moved away from fully connected neural networks in the mid 2010s, with the widespread adoption of CNNs and LSTMs (and later the transformer architecture).
  3. Academic work has shown that LLMs seem to be able to consistently respond with correct causal reasoning in a way inconsistent with pure dataset memorization, and there are even theoretical results in toy settings that show how this causal reasoning may arise. Also, many of the issues pointed to in earlier stochastic parrot papers seem to have been mitigated with increasing model scale.
  4. Several recent lines of theoretical work have argued that overparameterized neural networks exhibit a tendency toward simple solutions, such as the work on double descent, Singular Learning Theory, or Principles of Deep Learning Theory. Also, this is consistent with empirical work studying neural network generalization or adversarial examples.
  5. Increasing the precision of the attention mechanism greatly increases the representation power of the transformer forward pass. Also, while each individual forward pass may have limited capability, adding chain-of-thought (even while keeping precision fixed) also greatly increases the computational complexity of problems transformers can solve.
  6. Language models seem to be consistently getting better at all of these benchmarks – for example, o3 can solve 60.8% of ARC-AGI-1 problems, compared to a mere 30% performance from o1 and 4.5% from GPT-4o. CommonSenseQA is retired in that it’s too easy for all frontier language models now, and o3/Sonnet 4 can both respond appropriately to all of the examples in that Gary Marcus post. Also, while frontier LLMs still cannot do 10 digit multiplication reliably, the length of multiplication problem they can solve has been increasing over time – as few as five years ago, we were commenting on the fact that LLMs couldn’t even reliably do 2 digit multiplication!

Again, I want to emphasize that the Illusion of Thinking paper is not a bombshell dropped out of the blue. It exists in a context of much prior work arguing for both the existence of limitations and arguing against the applicability of these limitations in practice. Even without diving into this paper, it’s worth tempering your expectations for how much this paper should really affect your belief about the fundamental limits of current LLMs.

4.

Having taken a long digression into historical matters, let us actually go over the content of the Illusion of Paper.

The authors start by creating four different “reasoning tasks”, each parameterized by the number of objects in the problem n (which the authors refer to as the ‘complexity’ of the problem):[2]

Tower of Hanoi, where the model needs to output the (2^n - 1) steps needed to solve a Tower of Hanoi problem with n disks.

Checkers Jumping, where there are n blue checkers and n red checkers lined up on a board with (2n+1) spaces and the model needs to output the minimum sequence of moves to flip the initial board position.

River Crossing, where there are n pairs of actors and agents trying to cross a river on a boat that can hold k people, where the boat cannot travel empty, and where no actor can be in the presence of another agent without their own agent being present. This is generally known as the Missionaries and Cannibals problem (or sometimes the Jealous Husbands Problem).

Blocks World, where there are n ordered blocks divided evenly between two stacks with n/2 blocks each, with the goal of consolidating the two stacks into a single ordered stack using a third empty stack.

On all four tasks, the models are scored by their accuracy – the fraction of model generations that lead to a 100% correct solution.

The authors then run several recent language models on all four tasks, and find that for each task and model, there appears to be a threshold after which accuracy seems to drop to zero. They argue that the existence of this “collapse point” suggests that LLMs cannot truly be doing “generalizable reasoning”.

The authors also do some statistical analysis of the model generated chains-of-thought (CoTs). From this, they first find that  “models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty”.  They also find that in the Tower of Hanoi case, “counterintuitively”, providing the correct algorithm to the model does not seem to improve performance. Finally, they find that Claude 3.7 Sonnet can solve Tower of Hanoi with n=5 but not River Crossing with n=3, and argue that this is the result of River Crossing not being on the internet.[3] 

5.

A classic blunder when interpreting model evaluation results is to ignore simple, mundane explanations in favor of the fancy hypothesis being tested. I think that the Illusion of Thinking contains several examples of this blunder.[4]

When I reproduced the paper’s results on the Tower of Hanoi task, I noticed that for n >= 9, Claude 3.7 Sonnet would simply state that the task required too many tokens to complete manually, provide the correct Towers of Hanoi algorithm, and then output an (incorrect) solution in the desired format without reasoning about it. When I provide the question to Opus 4 on the Claude chatbot app, it regularly refuses to even attempt the manual solution![5] And for n=15 or n=20, none of the models studied have enough context length to output the correct answer, let alone reason their way to it in the author’s requested format.

A prototypical response from Claude Opus 4, where it calls the n=10 Tower of Hanoi task "extremely tedious and error prone" and refuses to do it.

The authors call it "counterintuitive" that language models use fewer tokens at high complexity, suggesting a "fundamental limitation." But this simply reflects models recognizing their limitations and seeking alternatives to manually executing thousands of possibly error-prone steps  – if anything, evidence of good judgment on the part of the models!

For River Crossing, there's an even simpler explanation for the observed failure at n>6: the problem is mathematically impossible, as proven in the literature,  e.g. see page 2 of this arxiv paper.[6]

Of the two environments I investigated in detail, there seem to be mundane reasons explaining the apparent collapse that the authors failed to consider. I did not have time to investigate or reproduce their results for the other two tasks, but I’d be surprised if similar problems didn’t plague the authors’ results for those as well.

Again, evals failing for mundane, boring reasons unrelated to the question you’re investigating (such as, “the model refuses to do it” or “the problem is impossible”) is a common experience in the field of LM evals. This is precisely why it’s so important to look at your data instead of just statistically testing your hypothesis or running a regex! The fact that the authors seemed to miss the explanation for why reasoning tokens decrease for large n suggests to me that they did not look at their data very carefully (if at all), and the fact that they posed an impossible problem as one of their four environments suggests that they also did not think about their environments very carefully.

I want to emphasize again that this is not an unusually bad ML paper. Writing good papers is hard and this is a preprint that has not been peer reviewed. Anyone who’s served as a peer reviewer for an ML conference knows that the ladder of paper quality goes all the way up and all the way down. Insofar as anyone involved with the paper deserves criticism beyond the standard sort, it’s the people who hyped it up based on the headline result or because it fit their narrative.

That being said, on the basis of research rigor alone, I think there’s good reason to doubt the conclusions of the paper. I do not think that this paper is particularly noteworthy as a contribution to the canon of fundamental limitations, let alone a “knockout blow” that shows the current LLM paradigm is doomed.

6.

Suppose we accepted the authors' results at face value, and accepted that language models could never manually execute thousands of algorithmic steps without error. Should we then conclude that LLMs fundamentally cannot do “generalizable reasoning” as we understand it? 

There’s a common assumption in many LLM critiques that reasoning ability is binary: either you have it, or you don’t. Either you have true generalization, or you have pure memorization.  Under this dichotomy, showing that LLMs fail to learn or implement the general algorithm in a toy example is enough to conclude that they must be blind pattern matchers in general. In the case of the Illusion of Thinking paper, the implicit argument is that, if an LLM cannot flawlessly execute simple toy algorithms, this constitutes damning evidence against generalizable reasoning. They argue this even though frontier LLMs can implement the algorithms in Python, often provide shorter solutions that fit within their context windows, and explain how to solve the problem in detail when asked. 

I’d argue this dichotomy does not reflect much of what we think of as “reasoning” in the real world. People can consistently catch thrown balls without knowing about or solving differential equations (as in the classic Richard Dawkins quote).[7] Even in the realm of mathematics, most mathematicians work via intuition and human-level abstractions, and do not just write formal Lean programs. And there’s a reason why some sociologists argue that heuristics learned from culture, not pure reasoning ability, are the secret of humanity’s success.

Whenever we deal with agents with bounded compute grappling with complicated real-world environments, we’ll see a reliance on heuristics that have worked in the past. In this view, generalization can be the limit of memorization: given only a few examples on a new task, you might start by memorizing individual data points, then learn “shallow heuristics”, and finally generalize to deeper, more useful heuristics. 

This is why I find most of the “fundamental limitation” claims to be unconvincing. The interesting question isn't whether a bounded agent relies on learned heuristics (of course it does!), but rather how well those heuristics generalize to domains of interest. Focusing on if LLMs can implement simple algorithms in toy settings or theoretical domains, without consideration of how these results will apply elsewhere, risks missing this point entirely. 

I’ll concede to the authors that it’s clear that LLMs are not well thought of as traditional computers. In fact, I’ll concede that there’s no way a modern LLM can output the 32,767 steps of the answer to the n=15 Tower of Hanoi in the author’s desired format, while even a simple Python script (written by one of these LLMs, no less) can do this in less than a second. 

But at the risk of repeating myself, do the results thereby imply that LLMs cannot do “generalizable reasoning”? To answer this question, I argue that we ought to be able to look at evidence other than a simple binary “can the LLM implement the general algorithm manually”. For example, perhaps we should consider evidence like the fact that frontier LLMs can implement the algorithm in Python, provide shorter solutions, and explain how to solve the problem – all of which suggest that the LLMs do understand the problem.[8] I think insofar as the results show that there’s a real, fundamental limit on whether LLMs can manually execute algorithms for hundreds or thousands of steps, this is a very different claim than “LLMs cannot do generalizable reasoning”.

7.

I have a confession: setting aside the abstract arguments above, much of my interest in the matter is personal. Namely, seeing the arguments on the fundamental limitations of LLMs sometimes make me question the degree to which I can do “generalizable reasoning”.  

People who know me tend to comment that I “have a good memory”. For example, I remember the exact blunder I made in a chess game with a friend two years ago on this day, as well as the conversations I had that day. By default, I tend to approach problems by quickly iterating through a list of strategies that have worked on similar problems in the past, and insofar as I do first-principles reasoning, I try my best to amortize the computation by remembering the results for future use. In contrast, many people are surprised when I can’t quickly solve problems requiring a lot of computation.

That’s not to say that I can’t reason; after all, I argue that writing this post certainly involved a lot of “reasoning”. I’ve also met smart people who rely even more on learned heuristics than I do. But from the inside it really does feel like much of my cognition is pattern matching (on ever-higher levels). Much of this post drew on arguments or results that I’ve seen before; and even the novel work involved applying previous learned heuristics.

I almost certainly cannot manually write out 1023 Tower of Hanoi steps without errors – like 3.7 Sonnet or Opus 4, I'd write a script instead. By the paper's logic, I lack 'generalizable reasoning.' But the interesting question was never about whether I can flawlessly execute an algorithm manually, but whether I can apply the right tools or heuristics to a given problem.

8.

To his credit, in his Substack post, Gary Marcus does mention the critique I provide above. However, he dismisses this by saying that:

That is, he makes the claim that in order to be AGI, the system needs to be able to reliably do 13-digit arithmetic, or consistently, manually write out the solution to Tower of Hanoi with 8 disks.

There are two responses I have to this: a constructive response and a snarky response.

The constructive response is that current LLMs such as Claude 3.7 Sonnet and o3 can easily write software that solves these tasks. Sure, an LLM might not be able to do “generalized reasoning” in the sense that the authors propose, but an LLM with a simple code interpreter definitely can. Here, the key question is why we must consider the LLM by itself, as opposed to an AI agent composed of an LLM and an agent scaffold – note that even chatbot-style apps such as ChatGPT provide the LLM with various tools such as a code interpreter and internet access. Why should we limit our discussion of AGIs to just the LLM component of an AI system, as opposed to the AI system as a whole?

The snarky response is that sure, I’m happy to concede that “AGI” (as envisioned by Gary Marcus) requires being able to multiply 13 digit numbers or flawlessly write out a 1023 step Towers of Hanoi solution. But there’s a sort of intelligence that humans possess, that is general in that it works across many domains, and does not require being able to multiply 13 digit numbers or write out 1023 steps of Towers of Hanoi. This is the sort of intelligence that can notice when a computer algorithm would be better for a problem and write that algorithm instead of solving the problem by hand. This is the sort of intelligence that allows researchers to come up with new ideas, or construction workers to tackle problems in the real world, or salespeople persuading their customers to buy a product. This is the sort of intelligence that I use when I apply complex heuristics acquired from decades of reading and writing as I write this paragraph. When I think about whether or not LLMs have “fundamental limitations”, I’m interested in whether or not they might become superhumanly intelligent in this sense, not whether or not they’re “AGI” in the sense laid out by Gary Marcus.

Or, if you’ll permit an amateur attempt at a meme:

9.

Having discussed why I think the paper is a bad critique of the existing LLM paradigm and why I find Gary Marcus’s rebuttal unconvincing, let us get back to the question of what a good “fundamental limitations” critique of LLMs would look like.

First, instead of only using a few toy examples, a good critique should either ideally be based on strong empirical trends or on good, applicable theoretical results. The history of machine learning is full of results on toy examples that failed to generalize outside of their narrow domains; if you want to convince me that there’s a  “fundamental limitation”, you’ll need to either offer me a mathematical proof or strong empirical evidence.

Second, the critique should address the key reasons why people are bullish about LLMs – e.g. their ability to converse in fluent English; their ability to write code; their broad and deep knowledge of subjects from biology and their increasing ability to act as software agents. At the very least, the critique should explain both why it does not apply in these cases and why we should expect the limitation to matter in the future.

Yes, I know Gary Marcus doesn't like this graph. If there's enough interest, I'll write another post responding to his critiques.

Finally, critiques need to argue why these limitations will continue to apply despite continuing AI progress and the best efforts of researchers trying to overcome them. Many of the failure modes that LLMs exhibited in the past were solved over time with  a combination of scale and researcher effort. In 2019, GPT-2 was barely able to string together sentences and in 2020 GPT-3 could barely do 2-digit arithmetic. The LLMs of today can often accomplish software tasks that take humans an hour to complete.  

Good critiques of LLMs exist – in fact, Gary Marcus has made many better critiques of LLM capabilities in the past, as have the authors of the Illusion of Thinking paper. But invariably, better critiques tend to be more boring and empirical; not a singular knock-down argument but a back-and-forth discussion with many small examples and personal anecdotes. And instead of originating from outside the field, these critiques center around issues that people who work on LLMs talk extensively about.

I’ll provide 3 such critiques below:

  1. There seem to be computational limitations to current LLMs. No modern LLM handles arbitrarily long context windows, and their performance degrades over very long contexts. More importantly, the amount of compute required to train LLMs has been growing at an exponential rate, and this exponential trend cannot continue for very long into the future. We also might run out of data for either pre-training or sufficiently diverse, long-horizon training environments for RL training. (That being said, I’m not sure how important handling super long contexts is, what level of capabilities we’ll hit before running out of compute/data/environments, or how fundamental these limits are in the face of human research effort.)
  2. Second, LLMs can be sensitive to the prompt in hard-to-foresee ways. Changing the framing of problems can greatly impact their behavior – again, see the GSM-noop work from the Illusion of Thinking authors.[9] This suggests that some sophisticated LLM behavior may be hard to elicit, if not entirely memorized. (That being said, LLMs seem to be becoming less susceptible to being tricked and more capable of solving novel problems over time. Also, humans are famously influenceable by small changes in framing as well.)
  3. LLMs have hallucinations and suffer from reliability issues in general. When I ask o3 or Claude Opus 4 to do research for me, I need to check their work, because they’ll sometimes flat-out lie about what a citation says. (But again, it seems that these issues are getting better over time, as evidenced by the METR time horizon results. Also, having worked with humans, I assure you that humans also make stuff up and suffer from reliability issues.)

Could these be fundamental limitations? I think it’s possible. It's possible that I'll learn tomorrow that we could not train models using more compute than our current ones, or observe that models continued to be easily distracted by irrelevant details in their prompt, or see that trend of increasing reliability seemed to stop at a level far below humans. I’d still want to think about if these purported limitations would continue to hold up to researchers trying to address them. But if they do, I would argue that these are good reasons to expect the modern LLM paradigm to hit a dead end. 

None of these will be an “LLMs are a dead field" knockout blow. Insofar as LLMs hype dies down due to limitations like these, it’ll have been a death to a thousand cuts as evidence accumulates over time and trends reveal themselves. It will not be to a single paper purporting to show “fundamental limitations”.

10.

One delightful irony is that, I suspect, most people would agree with the following tweet by Josh Wolfe, regardless of their thoughts on Gary Marcus-style skepticism:

LLM skeptics can read this as Apple vindicating the long-ignored, sage arguments from one of the foremost skeptics of LLM capabilities. But for others, “Gary Marcus” is synonymous with making pointless critiques that will soon be proven irrelevant, while completely failing to address the cruxes of those he’s arguing against.

I think this is a sad state of affairs. I much prefer a world in which “Gary Marcus”ing means making good, thoughtful critiques, engaging in good faith with critics, and focusing on the strongest points in favor of the skeptical position.

Empirically, this is not what is happening. Over the course of me drafting this post, Gary Marcus has doubled down on this paper being conclusive evidence for LLM limitations, both on Twitter:

And in an opinion piece posted in the Guardian, where he points specifically to large n Tower of Hanoi as evidence for fundamental limitations:

There’s plenty of room for nuanced critiques of LLMs. Lots of the LLM commentary is hype. Twitter abounds with hyperbolic statements that deserve to be brought back to earth.  All language models have limited context windows, show sensitivity to prompts, and suffer from hallucinations. Most relevantly, AIs are worse than humans at many important things: despite their performance on benchmarks, Opus 4 and o3 cannot replace even moderately competent software engineers wholesale, despite many claims that software engineering is a dead discipline. The world needs thoughtful critiques of LLM capabilities grounded in empiricism and careful reasoning.

But the world does not need more tweets or popular articles misrepresenting studies (on either side of the AI debate), clinging to false dichotomies, and making invalid arguments. Useful critiques of the LLM paradigm need to go beyond just theoretical claims or extrapolation on toy problems far removed from practice. Good-faith criticism should focus on the capabilities that “AGI believers” are hopeful for or concerned about, rather than redefining AGI to mean something else to dismiss their hopes or concerns out of hand, and “generalizable reasoning” in a way that implies the participants in the conversation lack it.

The appeal of claiming fundamental limitations is obvious. As too is the unsatisfactory nature of empirical ones. But given the track record, I continue to prefer reading careful analysis of empirical experiments over appreciating the “true significance” of bombastic, careless claims about so-called “fundamental limitations”.

 

Acknowledgements

Thanks to Ryan Greenblatt and Sydney von Arx for comments on this post. Thanks also to Justis Mills for copyediting assistance. 

 

  1. ^

    Edited to add: Though this paper is also quite sloppy, and I don't think all of the claims hold up. For example, it claims without citation that the block problem is PSPACE and river crossing is NP-hard. The former seems flat-out incorrect (you can clearly verify solutions efficiently, as the authors do). Generalized river crossing with arbitrary constraints and k=3 is known to be NP-hard, but I don't think it's the case for Agents/Actors or Missionaries/Cannibals. Maybe Opus got confused by how the river crossing problem was generalized?

  2. ^

     It’s worth noting that “complexity” as the authors use it is not the standard “computational complexity” – instead, the “complexity” of a problem is the number of objects n in the problem. Later on, the authors talk about the number of steps in an optimal solution; this is closer to computational complexity but not the same. For example, even though the solution for the Checkers Jumping task has length quadratic in n, the basic “guess and check” algorithm for finding this solution requires a number of steps exponential in n. Similarly, while the minimum solution length for Blocks World also scales linearly with the number of blocks n, the basic solution requires exploring an exponentially large state space.

  3. ^

     This “counterintuitive result” that Claude Sonnet “achieves near-perfect accuracy when solving the Tower of Hanoi with (N=5), which requires 31 moves, while it fails to solve the River Crossing puzzle when (N=3), which has a solution of 11 moves” has a simple explanation – the former requires executing a simple deterministic algorithm with 31 steps, while the latter requires searching over a much larger space of possible solutions.

    The author’s speculation that “... examples of River Crossing with N>2 are scarce on the web” also seems incorrect – a quick Google search for either Missionaries and Cannibals or the Jealous Husbands problem shows that there are plenty of n=3 k=2 solutions on the internet, including on Wikipedia. If anything, the fact that Claude 3.7 Sonnet fails at this task suggests that it is earnestly trying to solve the task, as opposed to regurgitating a memorized solution (!).

  4. ^

     The standard remedy for this blunder is to read model transcripts. Note that high-level statistical analysis can often fail to notice these simple alternative explanations (as seems to have happened with the authors here).

  5. ^

    Arguably, this behavior is a natural consequence of their RL training, where the environments tend to look like “solve a complicated math problem” or “write correct code for a coding task”, and not “manually execute an algorithm for hundreds of steps”. After all, if you’re given a coding task and you try to solve it by manually executing the algorithm, you’re probably not going to end up doing particularly well on the task. 

  6. ^

    (Edited to clarify: specifically, the authors use k=3 boat capacity for all problems with n>2 pairs. But for n>5 pairs, you need at least k=4 capacity to solve the problem.) 

  7. ^

     In fact, it seems likely that humans (and dogs!) follow a simple heuristic that allows them to chase down and catch a thrown ball.

  8. ^

    This is also my explanation for the authors' “counterintuitive observation” that giving LLMs the algorithm doesn’t improve their performance on the task – they already know the algorithm, it’s just hard for them to manually execute it for hundreds or thousands of steps in the requested format. 

  9. ^

    My best steelman of the Illusion of Thinking paper is also in this vein – the models seem to do a lot better on River Crossing with n=3, k=2 when you call it by the more common name of “jealous husbands” or “missionaries and cannibals”, rather than “actors and agents”. In fact, if you read their CoT, it seems that 3.7 Sonnet/Opus 4 can sometimes get the correct answer in their output even when their CoTs fail to get to the correct answer, suggesting that their performance here comes from memorizing a solution in their training data.