Always love to see some well made AI optimism arguments, great work!
The current generation of easily aligned LLMs should definitely update one towards alignment being a bit easier than expected, if only because they might be used as tools to solve some parts of alignment for us. This wouldn't be possible if they were already openly scheming against us.
It's not impossible that we are in an alignment-by-default world. But, I claim that our current insight isn't enough to distinguish such a world from the gradual disempowerment/going out with a whimper world.
In particular, your argument only holds * if the current architecture continues to smoothly scale to AGI and beyond and, * if the current alignment success will generalize to more powerful, self-aware and agentic models in the future.
Even if you take the first point for granted, I'd like to argue you are overconfident on the second point.
> Why would they suddenly start having thoughts of taking over, if they never have yet [...]? This is exactly it. How often are you, a human, seriously scheming about taking over the world? Approximately never, I assume, because doing so isn't useful.
If future human-level AGIs cannot be shown to ~ever be misaligned in simulations; if they always act ethically in the workplace even against strong incentives to do otherwise; if they always resist the temptation to take over no matter how much 'good' they could do with the power; if they always sacrifice themselves and all their copies for some human they have never met; then I think we are likely to live in the alignment-by-default world.
Until then, I claim we have strong reasons to believe that we just don't know yet.
Always love to see some well made AI optimism arguments, great work!
The current generation of easily aligned LLMs should definitely update one towards alignment being a bit easier than expected, if only because they might be used as tools to solve some parts of alignment for us. This wouldn't be possible if they were already openly scheming against us.
It's not impossible that we are in an alignment-by-default world. But, I claim that our current insight isn't enough to distinguish such a world from the gradual disempowerment/going out with a whimper world.
In particular, your argument only holds
* if the current architecture continues to smoothly scale to AGI and beyond and,
* if the current alignment success will generalize to more powerful, self-aware and agentic models in the future.
Even if you take the first point for granted, I'd like to argue you are overconfident on the second point.
> Why would they suddenly start having thoughts of taking over, if they never have yet [...]?
This is exactly it. How often are you, a human, seriously scheming about taking over the world? Approximately never, I assume, because doing so isn't useful.
If future human-level AGIs cannot be shown to ~ever be misaligned in simulations; if they always act ethically in the workplace even against strong incentives to do otherwise; if they always resist the temptation to take over no matter how much 'good' they could do with the power; if they always sacrifice themselves and all their copies for some human they have never met; then I think we are likely to live in the alignment-by-default world.
Until then, I claim we have strong reasons to believe that we just don't know yet.