Recent progress in AI has led to rapid saturation of most capability benchmarks - MMLU, RE-Bench, etc. Even much more sophisticated benchmarks such as ARC-AGI or FrontierMath see incredibly fast improvement, and all that while severe under-elicitation is still very salient.
As has been pointed out by many, general capability involves more than simple tasks such as this, that have a long history in the field of ML and are therefore easily saturated. Claude Plays Pokemon is a good example of something somewhat novel in terms of measuring progress, and thereby benefited from being an actually good proxy of model capability.
Taking inspiration from examples such as this, we considered domains of general capacity that are even further decoupled from existing exhaustive generators. We introduce BenchBench, the first standardized benchmark designed specifically to measure an AI model’s bench-pressing capability.
Bench pressing uniquely combines fundamental components of intelligence such as motor control, strategic resource allocation (energy and force), and resilience to fatigue. Just as text-based benchmarks serve as proxies for cognitive reasoning, bench pressing provides an objective measure of embodied intelligence.
BenchBench consists of three primary tasks:
BenchBench also ensures reproducibility through standardized equipment: all tests must be conducted with a calibrated Eleiko AI-Integrated barbell and RoboSpotter™ safety system.
We evaluated leading models:
Interestingly, Gemini 2.5 demonstrated an excellent theoretical understanding of the mechanics involved but also failed physically, matching GPT-4’s 0 kg performance.
Future iterations will include categories for deadlift (DeadBench) and squat (SquatBench), and introduce challenges under adversarial conditions, such as uneven floor surfaces and distracting gym music.
BenchBench represents a new standard in holistic AI benchmarking. We invite the research community to collaborate, compete, and push the boundaries—quite literally—of artificial intelligence.
Lifting above their weight class, so to speak.