Epistemic status: confident that the underlying idea is useful; less confident about the details, though they're straightforward enough that I expect they're mostly in the right direction. TLDR: This post describes a pre-mortem-like exercise that I find useful for thinking about AGI risk. It is the only way I know...
Subtitle: A partial defense of high-confidence AGI doom predictions. Introduction Consider these two kinds of accident scenarios: 1. In a default-success scenario, accidents are rare. For example, modern aviation is very safe thanks to decades of engineering efforts and a safety culture (e.g. the widespread use of checklists). When something...
Status: working notes Here's an exercise I've found very useful for intuition-building around alignment: 1. Propose a solution to the alignment problem. 2. Dig into the details until you understand why the proposal fails. 3. If there is an obvious fix, go back to step 2 and iterate. In this...
TLDR If you work in AI alignment / safety research, please fill out this form on how useful access to extra compute would be for your research. This should take under 10 minutes, and you don't need to read the rest of this post beforehand—in fact it would be great...
Inner alignment and objective robustness have been frequently discussed in the alignment community since the publication of “Risks from Learned Optimization” (RFLO). These concepts identify a problem beyond outer alignment/reward specification: even if the reward or objective function is perfectly specified, there is a risk of a model pursuing a...