Full version on arXiv | X Executive summary AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI...
This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....
This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future. Twitter thread here. Top-level summary: > In this post we present "defection probes": linear classifiers that use...
I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy...