HoldenKarnofsky

Sabotage Evaluations for Frontier Models

This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....

Oct 18, 202495

3 levels of threat obfuscation

One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment)....

Aug 2, 202371

A Playbook for AI Risk Reduction (focused on misaligned AI)

I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?” This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.1 * A plan connotes something like: “By default, we ~definitely fail....

Jun 6, 202390

Discussion with Nate Soares on a key alignment difficulty

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so...

Mar 13, 2023277

High-level hopes for AI alignment

Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc. In previous pieces, I argued that there's a real and large risk of AI systems' aiming to defeat all of humanity combined - and succeeding. I first argued that this sort of catastrophe would be likely without...

Dec 15, 202258

Why Would AI "Aim" To Defeat Humanity?

Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc. I’ve argued that AI systems could defeat all of humanity combined, if (for whatever reason) they were directed toward that goal. Here I’ll explain why I think they might - in fact - end up directed toward...

Nov 29, 202269

Nearcast-based "deployment problem" analysis

When thinking about how to make the best of the most important century, two “problems” loom large in my mind: * The AI alignment problem: how to build AI systems that perform as intended, and avoid a world run by misaligned AI. * The AI deployment problem (briefly discussed here):...

Sep 21, 202287

HoldenKarnofsky

HoldenKarnofsky

Discussion with Nate Soares on a key alignment difficulty

Reply to Eliezer on Biological Anchors

How might we align transformative AI if it’s developed very soon?

Sabotage Evaluations for Frontier Models

HoldenKarnofsky

Discussion with Nate Soares on a key alignment difficulty

Reply to Eliezer on Biological Anchors

How might we align transformative AI if it’s developed very soon?

Sabotage Evaluations for Frontier Models

Sabotage Evaluations for Frontier Models

3 levels of threat obfuscation

A Playbook for AI Risk Reduction (focused on misaligned AI)

Discussion with Nate Soares on a key alignment difficulty

High-level hopes for AI alignment

Why Would AI "Aim" To Defeat Humanity?

Nearcast-based "deployment problem" analysis