Thanks to Jordan Taylor and Sohaib Imran for helping to make this post better. If you are a researcher who wants to work on one of the directions or a funder who wants to fund one, feel free to reach out to me. I've been thinking for a while on...
Why I'm writing this I work on evaluation awareness, which means that I study if models can tell when they are being evaluated, and how it affects their behavior. If models perform well on safety evals but behave differently in deployment, our evals aren't measuring what we think they're measuring....
Abstract There is increasing evidence that Large Language Models can distinguish when they are being evaluated, which might alter their behavior in evaluations compared to similar real-world scenarios. Accurate measurement of this phenomenon, called evaluation awareness is therefore necessary. Recent studies proposed different methods for measuring it, but their approaches...
Abstract In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and...