Revising Stages-Oversight Reveals Greater Situational Awareness in LLMs
Summary The Stages-Oversight benchmark from the Situational Awareness Dataset tests whether large language models (LLMs) can distinguish between evaluation prompts (such as benchmark questions) and deployment prompts (real-world user inputs). This ability is crucial for detecting potential alignment risks like sandbagging (strategically underperforming during evaluation) or alignment faking (appearing aligned...
Mar 12, 202516