In my most recent post, I introduced a corrigibility transformation that could take an arbitrary goal over external environments and define a corrigible goal with no hit to performance. That post focused on corrigibility and deception in training, which are some of the biggest problems in AI alignment, but the...
This post contains similar content to a forthcoming paper, in a framing more directly addressed to readers already interested in and informed about alignment. I include some less formal thoughts, and cut some technical details. That paper, A Corrigibility Transformation: Specifying Goals That Robustly Allow For Goal Modification, will be...
Thanks to Evan Hubinger for funding this project and for introducing me to predictive models, Johannes Treutlein for many fruitful discussions on related topics, and Dan Valentine for providing valuable feedback on my code implementation. In September 2023, I received four months of funding through Manifund to extend my initial...
Max Harms recently published an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A large installment in the series is dedicated to cataloging the properties that make up such a goal, with open questions...
Thanks to Johannes Treutlein and Paul Colognese for feedback on this post. Just over a year ago, the Conditioning Predictive Models paper was released. It laid out an argument and a plan for using powerful predictive models to reduce existential risk from AI, and outlined some foreseeable challenges to doing...
Thanks to Leo Gao, Nicholas Dupuis, Paul Colognese, Janus, and Andrei Alexandru for their thoughts. This post was mostly written in 2022, and pulled out of my drafts after recent conversations on the topic. My main update from then is the framing. Rather that suggesting searching for substeps of search...
Thanks to Charlotte Siegmann, Caspar Oesterheld, Spencer Becker-Kahn, and Evan Hubinger for providing feedback on this post. The issue of self-fulfilling prophecies, also known as performative prediction, arises when the act of making a prediction can affect its own outcome. Systems aiming for accurate predictions are then incentivized not only...