Note: These are all rough numbers, I'd expect I'd shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
So it might be important for misaligned AIs to attempt a takeover early in the intelligence explosion. Specifically, we can ask "how much x-risk is averted if the probability of misaligned AI takeover before TED AI goes to 0?", which attempts to capture all the worlds in which AIs attempt to takeover before TED AI. I think my overall risk reduction is something like 1/4. In other words, there's something like a 3/4 chance the AIs lurk (and can goal guard or align successors), or can backdoor successors, or control their successors.
Now, conditional on the pre-TED AIs attempting to takeover, what are the different routes that they might use? The most salient options to me are:
I think the bio path seems to me to be the most compelling path here by a fair amount; it maybe gets another 1/3 probability of this outcome. So, just from the risk of pre-TED AI AIs attemptinng to takeover, we have somehting like 1/3 * 1/4 = 1/12 probability. If you multiply that by my likelihood of AI takeover, which is around 70%, you get ~6% risk flowing from this route. Then, I update up to ~8% from other AIs, e.g. post-TED AIs relying on biorisk as a route to takeover.
So my overall view on how much x-risk flows through bio-catastrophe is around 8%.
Note that what exactly counts as a bio x-risk is slightly unclear, e.g. at some point the AIs can build drones / nanotech to get into the bio-bunkers, and it's unclear what counts.
This breakdown isn't exhaustive, because another salient possibility is that the AIs are clueless, e.g., they are misaligned with their successors but don't realize it, similar to Agent 3 in AI 2027.
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion.
I haven't thought too deeply about this, but I would guess that the AI self-alignment problem is quite a lot easier than the human AI-alignment problem.
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won't notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it'll be hard for them to pull that off.
I generally like your breakdown and way of thinking about this, thanks. Some thoughts:
--
Also, the above isn't even mentioning bio x-risk mediated by humans, or by trailing AIs during the chaos of takeoff. My guess is those risks are substantially lower, e.g. maybe 1% and 2% respectively; again don't feel confident.
Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.
The basic case against against Effective-FLOP.
A3 in https://blog.heim.xyz/training-compute-thresholds/ also discusses limitations of effective FLOPs.
Maybe I am being dumb, but why not do things on the basis of "actual FLOPs" instead of "effective FLOPs"? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.
Yeah, actual FLOPs are the baseline thing that's used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs.
If there's a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable. Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks.
Another downside that FLOPs / E-FLOPs share is that it's unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have. And it's unclear what capabilities will emerge from a small bit of scaling: it's possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model.