x

AI ALIGNMENT FORUM

AF

AndresCampero — AI Alignment Forum

AndresCampero

AndresCampero

Message

24

Ω

15

1

1

1y

AndresCampero

24

Ω

15

1y

Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations

We present a simple eval set of 4 scenarios where we evaluate Anthropic and OpenAI frontier models. We find various degrees of reward hacking-like behavior with high sensitivity to prompt variation. [Code and results transcripts] Intro and Project Scope We want to be able to assess whether any given model...

Jun 4, 2025•26