x

AI ALIGNMENT FORUM

AF

Sandbagging (AI) — AI Alignment Forum

Sandbagging (AI)

Edited by Raemon last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Add Posts

Posts tagged Sandbagging (AI)

8

51Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck, Julian Stastny

1y

2

3

33Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt

1y

6

1

42White Box Control at UK AISI - Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

10mo

5

1

32The “no sandbagging on checkable tasks” hypothesis

3y

13

1

23Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger

1y

0

0

35[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward

2y

0

0

19Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black, Oliver Sourbut

10mo

0

0

19An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter, Francis Rhys Ward

2y

0

0

12How to mitigate sandbagging

Teun van der Weij

1y

0

1

9A Conceptual Framework for Exploration Hacking

Joschka Braun, Eyon Jang, Damon Falck

2mo

0

0

11Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun, Eyon Jang

9mo

4

0

6Won't vs. Can't: Sandbagging-like Behavior from Claude Models

Joe Benton, Zachary Witten

1y

0

Add Posts