x

AI ALIGNMENT FORUM

AF

Eric Christiansen — AI Alignment Forum

Eric Christiansen

Eric Christiansen

Message

72

2y

Eric Christiansen

72

2y

Sabotage Evaluations for Frontier Models

by David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez, and Buck

This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....

Oct 18, 2024•95