This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
Wikitags
Sandbagging (AI)
Edited by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Subscribe
Subscribe
Discussion
0
Discussion
0
Posts tagged
Sandbagging (AI)
Most Relevant
50
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck
,
Julian Stastny
4mo
2
32
Notes on countermeasures for exploration hacking (aka sandbagging)
ryan_greenblatt
5mo
6
41
White Box Control at UK AISI - Update on Sandbagging Investigations
Joseph Bloom
,
Jordan Taylor
,
Connor Kissane
,
Sid Black
,
merizian
,
alexdzm
,
jacoba
,
Ben Millwood
,
Alan Cooney
2mo
5
32
The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
2y
13
23
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
5mo
0
35
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
1y
0
19
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Casey Barkan
,
Sid Black
,
Oliver Sourbut
2mo
0
19
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
1y
0
12
How to mitigate sandbagging
Teun van der Weij
5mo
0
9
Exploration hacking: can reasoning models subvert RL?
Damon Falck
,
Joschka Braun
,
Eyon Jang
1mo
4
6
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Joe Benton
,
Zachary Witten
6mo
0