Last year I wrote the CAST agenda, arguing that aiming for Corrigibility As Singular Target was the least-doomed way to make an AGI. (Though it is almost certainly wiser to hold off on building it until we have more skill at alignment, as a species.) I still basically believe that...
Bob: "I want my AGI to make everyone extremely wealthy! I'm going to train that to be its goal." Cassie: "Stop! You'll doom us all! While wealth is good, it's not everything that's good, and so even if you somehow build a wealth-maximizer (instead of summoning some random shattering of...
(Part 5 of the CAST sequence) Much work remains on the topic of corrigibility and the CAST strategy in particular. There’s theoretical work in both nailing down an even more complete picture of corrigibility and in developing better formal measures. But there’s also a great deal of empirical work that...
(Part 4 of the CAST sequence) This document is an in-depth review of the primary documents discussing corrigibility that I’m aware of. In particular, I'll be focusing on the writing of Eliezer Yudkowsky and Paul Christiano, though I’ll also spend some time at the end briefly discussing other sources. As...
(Part 3b of the CAST sequence) EDIT: WARNING: This formalism is critically flawed! It should mainly be taken as a way to get a handle on where my mind was at when I was writing, as an additional handle on corrigibility. See my follow-up essay "Serious Flaws in CAST" for...
(Part 3a of the CAST sequence) As mentioned in Corrigibility Intuition, I believe that it’s more important to find a simple, coherent, natural/universal concept that can be gestured at, rather than coming up with a precisely formal measure of corrigibility and using that to train an AGI. This isn’t because...