This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
2881
Wikitags
Sandbagging (AI)
Edited by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged
Sandbagging (AI)
Most Relevant
8
50
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck
,
Julian Stastny
6mo
2
3
32
Notes on countermeasures for exploration hacking (aka sandbagging)
ryan_greenblatt
8mo
6
1
42
White Box Control at UK AISI - Update on Sandbagging Investigations
Joseph Bloom
,
Jordan Taylor
,
Connor Kissane
,
Sid Black
,
merizian
,
alexdzm
,
jacoba
,
Ben Millwood
,
Alan Cooney
4mo
5
1
32
The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
2y
13
1
23
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
8mo
0
0
35
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
1y
0
0
19
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Casey Barkan
,
Sid Black
,
Oliver Sourbut
4mo
0
0
19
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
2y
0
0
12
How to mitigate sandbagging
Teun van der Weij
8mo
0
0
9
Exploration hacking: can reasoning models subvert RL?
Damon Falck
,
Joschka Braun
,
Eyon Jang
4mo
4
0
6
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Joe Benton
,
Zachary Witten
9mo
0