x

AI ALIGNMENT FORUM

AF

Alex Turner — AI Alignment Forum

Alex Turner

Alex Turner

Message

65

8y

Alex Turner

65

8y

Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming

by Jasmine Li and Alex Turner

Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution...