x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
Sandbagging (AI) — AI Alignment Forum
Sandbagging (AI)
Edited by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged
Sandbagging (AI)
Most Relevant
8
50
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck
,
Julian Stastny
7mo
2
3
32
Notes on countermeasures for exploration hacking (aka sandbagging)
ryan_greenblatt
8mo
6
1
42
White Box Control at UK AISI - Update on Sandbagging Investigations
Joseph Bloom
,
Jordan Taylor
,
Connor Kissane
,
Sid Black
,
merizian
,
alexdzm
,
jacoba
,
Ben Millwood
,
Alan Cooney
5mo
5
1
32
The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
2y
13
1
23
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
8mo
0
0
35
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
1y
0
0
19
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Casey Barkan
,
Sid Black
,
Oliver Sourbut
5mo
0
0
19
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
2y
0
0
12
How to mitigate sandbagging
Teun van der Weij
8mo
0
0
9
Exploration hacking: can reasoning models subvert RL?
Damon Falck
,
Joschka Braun
,
Eyon Jang
4mo
4
0
6
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Joe Benton
,
Zachary Witten
9mo
0