Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

[-]DanielFilan5mo20

Ironically, given that it's currently June 11th (two days after my last tweet was posted) my final tweet provides two examples of the planning fallacy.

"Hopefully" is not a prediction!

[-]LawrenceC5mo40

Fair, but in my head I did plan to get it done on the 10th. The tweet is not in itself the prediction, it's just evidence that I made the prediction in my head.

And indeed I did finish the draft on June 10th, but at 11 PM and I decided to wait for feedback before posting. So I wasn't that off in the end, but I still consider it off.

[-]Hruss5mo13

I find that studies criticizing current models are often used long after the issue is fixed, or without consideration to the actual meaning. I would wish that technology reporting is more careful, as much of this misunderstanding seems to come from journalistic sources. Examples:

Hands in diffusion models

Text in diffusion models

Water usage

Model collapse - not an issue for actual commercial AI models, the original study was about synthetic data production, and directly feeding the output of models as the exclusive training data

LLMs = Autocorrect - chat models have RLHF post training

Nightshade/glaze: useless for modern training methods

AI understanding - yes, the weights are not understood, but the overall architecture is

It is surprising how many times I hear these, with false context.

[-]LawrenceC5mo21

There are indeed many, many silly claims out there, on either side of any debate. And yes, the people pretending that the AIs of 2025 have the limitations of those from 2020 are being silly, journalist or no.

I do want to clarify that I don't think this is a (tech) journalist problem. Presumably when you mention Nightshade dismissively, it's a combination of two reasons: 1) Nightshade artefacts are removable via small amounts of Gaussian blur and 2) Nightshade can't be deployed at scale on enough archetypal images to have a real effect? If you look at the Nightshade website, you'll see that the authors lie about 1):

As with Glaze, Nightshade effects are robust to normal changes one might apply to an image. You can crop it, resample it, compress it, smooth out pixels, or add noise, and the effects of the poison will remain.

So (assuming my recollection that Nightshade is defeatable by Gaussian noise is correct) this isn't an issue of journalists making stuff up or misunderstanding what the authors said, it's the authors putting things in their press release that, at the very least, are not at all backed up by their paper.

(Also, either way, Gary Marcus is not a tech journalist!)

^{^}

Edited to add: Though this paper is also quite sloppy, and I don't think all of the claims hold up. For example, it claims without citation that the block problem is PSPACE and river crossing is NP-hard. The former seems flat-out incorrect (you can clearly verify solutions efficiently, as the authors do). Generalized river crossing with arbitrary constraints and k=3 is known to be NP-hard, but I don't think it's the case for Agents/Actors or Missionaries/Cannibals. Maybe Opus got confused by how the river crossing problem was generalized?

^{^}

It’s worth noting that “complexity” as the authors use it is not the standard “computational complexity” – instead, the “complexity” of a problem is the number of objects n in the problem. Later on, the authors talk about the number of steps in an optimal solution; this is closer to computational complexity but not the same. For example, even though the solution for the Checkers Jumping task has length quadratic in n, the basic “guess and check” algorithm for finding this solution requires a number of steps exponential in n. Similarly, while the minimum solution length for Blocks World also scales linearly with the number of blocks n, the basic solution requires exploring an exponentially large state space.

^{^}

This “counterintuitive result” that Claude Sonnet “achieves near-perfect accuracy when solving the Tower of Hanoi with (N=5), which requires 31 moves, while it fails to solve the River Crossing puzzle when (N=3), which has a solution of 11 moves” has a simple explanation – the former requires executing a simple deterministic algorithm with 31 steps, while the latter requires searching over a much larger space of possible solutions.

The author’s speculation that “... examples of River Crossing with N>2 are scarce on the web” also seems incorrect – a quick Google search for either Missionaries and Cannibals or the Jealous Husbands problem shows that there are plenty of n=3 k=2 solutions on the internet, including on Wikipedia. If anything, the fact that Claude 3.7 Sonnet fails at this task suggests that it is earnestly trying to solve the task, as opposed to regurgitating a memorized solution (!).

^{^}

The standard remedy for this blunder is to read model transcripts. Note that high-level statistical analysis can often fail to notice these simple alternative explanations (as seems to have happened with the authors here).

^{^}

Arguably, this behavior is a natural consequence of their RL training, where the environments tend to look like “solve a complicated math problem” or “write correct code for a coding task”, and not “manually execute an algorithm for hundreds of steps”. After all, if you’re given a coding task and you try to solve it by manually executing the algorithm, you’re probably not going to end up doing particularly well on the task.

^{^}

(Edited to clarify: specifically, the authors use k=3 boat capacity for all problems with n>2 pairs. But for n>5 pairs, you need at least k=4 capacity to solve the problem.)

^{^}

In fact, it seems likely that humans (and dogs!) follow a simple heuristic that allows them to chase down and catch a thrown ball.

^{^}

This is also my explanation for the authors' “counterintuitive observation” that giving LLMs the algorithm doesn’t improve their performance on the task – they already know the algorithm, it’s just hard for them to manually execute it for hundreds or thousands of steps in the requested format.

^{^}

My best steelman of the Illusion of Thinking paper is also in this vein – the models seem to do a lot better on River Crossing with n=3, k=2 when you call it by the more common name of “jealous husbands” or “missionaries and cannibals”, rather than “actors and agents”. In fact, if you read their CoT, it seems that 3.7 Sonnet/Opus 4 can sometimes get the correct answer in their output even when their CoTs fail to get to the correct answer, suggesting that their performance here comes from memorizing a solution in their training data.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

106

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

106

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

Acknowledgements