The wider AI research community is an almost-optimal engine of apocalypse. The primary metric of a paper's success is how much it improves capabilities along concrete metrics, publish-or-perish dynamics supercharge that, the safety side of things is neglected to the tune of 1:49 rate of safety to other research, and most results are made public so as to give everyone else in the world a fair shot at ending it too.
It doesn't have to be this way. The overwhelming majority of the people involved do not actually want to end the world. There must exist an equilibrium in which their intentions match their actions.
Even fractionally shifting the status quo towards that equilibrium would have massive pay-offs, as far as timelines are concerned. Fully overturning it may well constitute a sufficient condition for humanity's survival. Yet I've seen precious little work done in this area, compared to the technical questions of AI alignment. It seems to be picking up in recent months, though — and I'm happy to contribute.
This post is an attempt at a comprehensive high-level overview of the tactical and strategic options available to us.
1. Rationale
Why is it important? Why is it crucial?
First. We, uh, need to make sure that if we figure alignment out people actually implement it. Like, imagine that tomorrow someone comes up with a clever hack that robustly solves the alignment problem... but it increases the compute necessary to train any given ML model by 10%, or it's a bit tricky to implement, or something. Does the wider AI community universally adopt that solution? Or do they ignore it? Or do the industry leaders, after we extensively campaign, pinky-swear to use it the moment they start training models they feel might actually pose a threat, then predictably and fatally misjudge that?
In other words: When the time comes, we'll need to convince people that safety is important enough to fuss around a bit for its sake. But i