Inverse Scaling in Test-Time Compute
by Joe Benton, Ethan Perez, and aryopg
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. We identify five distinct failure modes when models reason for longer: * Claude models become increasingly distracted by irrelevant information * OpenAI o-series models...
Jul 22, 202520