Inverse Scaling in Test-Time Compute
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. We identify five distinct failure modes when models reason for longer: * Claude models become increasingly distracted by irrelevant information * OpenAI o-series models...
Jul 22, 202520

