This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....
One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment)....
I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?” This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.1 * A plan connotes something like: “By default, we ~definitely fail....
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so...
Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc. In previous pieces, I argued that there's a real and large risk of AI systems' aiming to defeat all of humanity combined - and succeeding. I first argued that this sort of catastrophe would be likely without...
Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc. I’ve argued that AI systems could defeat all of humanity combined, if (for whatever reason) they were directed toward that goal. Here I’ll explain why I think they might - in fact - end up directed toward...
When thinking about how to make the best of the most important century, two “problems” loom large in my mind: * The AI alignment problem: how to build AI systems that perform as intended, and avoid a world run by misaligned AI. * The AI deployment problem (briefly discussed here):...