I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it's hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I'm missing something here!
This seems right to me, and the essay could probably benefit from saying something about what counts as self-improvement in the relevant sense. I think the answer is probably something like "improvements that could plausibly lead to unplanned changes in the model's goals (final or sub)." It's hard to know exactly what those are. I agree it's less likely that simply increasing processor speed a bit would do it (though Bostrom argues that big speed increases might). At any rate, it seems to me that whatever the set includes, it will be symmetric as between human-produced and AI-produced improvements to AI. So for the important improvements--the ones risking misalignment--the arguments should remain symmetrical.
[Note: This post was written by Peter N. Salib. Dan H assisted me in posting to Alignment Forum, but no errors herein should be attributed to him. This is a shortened version of a longer working paper, condensed for better readability in the forum-post format. This version assumes familiarity with standard arguments around AI alignment and self-improvement. The full 7,500 word working paper is available here. Special thanks to the Center for AI Safety, whose workshop support helped to shape the ideas below.]
Many accounts of existential risk (xrisk) from AI involve self-improvement. The argument is that, if an AI gained the ability to self-improve, it would. Improved capabilities are, after all, useful for achieving essentially any goal. Initial self-improvement could enable further self-improvement. And so on, with the...
A few people have pointed out this question of (non)identity. I've updated the full draft in the link at the top to address it. But, in short, I think the answer is that, whether an initial AI creates a successor or simply modifies its own body of code (or hardware, etc.), it faces the possibility that the new AI failed to share its goals. If so, the successor AI would not want to revert to the original. It would want to preserve its own goals. It's possible that there is some way to predict an emergent value drift just before it happens and cease improvement. But I'm not sure it would be, unless the AI had solved interpretability and could rigorously monitor the relevant parameters (or equivalent code).