AI ALIGNMENT FORUM
AF

Wikitags

Sandbagging (AI)

Edited by Raemon last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Subscribe
Subscribe
Discussion0
Discussion0
Posts tagged Sandbagging (AI)
50Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck, Julian Stastny
4mo
2
32Notes on countermeasures for exploration hacking (aka sandbagging)
ryan_greenblatt
5mo
6
41White Box Control at UK AISI - Update on Sandbagging Investigations
Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
2mo
5
32The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
2y
13
23Automated Researchers Can Subtly Sandbag
gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger
5mo
0
35[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward
1y
0
19Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Casey Barkan, Sid Black, Oliver Sourbut
2mo
0
19An Introduction to AI Sandbagging
Teun van der Weij, Felix Hofstätter, Francis Rhys Ward
1y
0
12How to mitigate sandbagging
Teun van der Weij
5mo
0
9Exploration hacking: can reasoning models subvert RL?
Damon Falck, Joschka Braun, Eyon Jang
1mo
4
6Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Joe Benton, Zachary Witten
6mo
0
Add Posts