AI ALIGNMENT FORUM
Wikitags
AF

Subscribe
Discussion0

Sandbagging (AI)

Subscribe
Discussion0
Written by Raymond Arnold last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Posts tagged Sandbagging (AI)
8
49Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck Shlegeris, Julian Stastny
9d
0
3
31Notes on countermeasures for exploration hacking (aka sandbagging)
Ryan Greenblatt
2mo
6
1
32The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
2y
13
1
23Automated Researchers Can Subtly Sandbag
Johannes Gasteiger, Akbir Khan, Sam Bowman, Vladimir Mikulik, Ethan Perez, Fabien Roger
2mo
0
0
35[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Ollie J, Sam Brown, Francis Rhys Ward
1y
0
0
18An Introduction to AI Sandbagging
Teun van der Weij, Felix Hofstätter, Francis Rhys Ward
1y
0
0
9How to mitigate sandbagging
Teun van der Weij
2mo
0
0
6Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Joe Benton, Zachary Witten
3mo
0
Add Posts