Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
by Jasmine Li and Alex Turner
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution...
May 2749