Based on the method I used in "Robustness to Fundamental Uncertainty in AGI Alignment", we can analyze various proposals for building aligned AI and determine which appear to best trade off false positive risk for false negative risk and recommend those which are conservatively safest. We can do this at various levels of granularity and for various levels of specificity of proposed alignment methods. What I mean by that is we can consider AI alignment as a whole or various sub-problems within it, like value learning and inner alignment, and we can consider high-level approaches to alignment or more specific proposals with more of the details worked out. For this initial post in what may become a series, I'll compare high-level approaches on addressing alignment as a whole.
Some ground rules:
I'll be comparing three high-level approaches to AI alignment that I term Iterated Distillation and Amplification (IDA), Ambitious Value Learning (AVL), and Normative Embedded Agency (NEA). By each of these, for the purposes of this post, I'll mean the following, which I believe captures the essence of these approaches but obviously leaves out lots of specifics about various ways they may be implemented.
I do not think these are an exhaustive categorization of all possible alignment schemes; rather they are three that I have enough familiarity with to reason about and consider to be the most promising approaches people are investigating. There is at least a fourth approach I'm not considering here because I haven't thought about it enough—building AI that is aligned because it emulates how human brains function—and probably others different enough that they warrant their own category.
For each of the three approaches we must consider their false positive risks. Once we have done that, we can consider the risks of each approach relative to the others. Remember, here I'll be basing my analysis on my summary of these approaches given above, not on any specific alignment proposal.
I'll give some high level thoughts on why each may fail and then make a specific statement summing each one up. I won't go into a ton of detail both because in some cases others already have (and I'll link when that's the case) or because these seem like fairly obvious observations that most readers of the Alignment Forum will readily agree with. If that's not the case please bring it up in the comments and we can go into more detail.
Given the risks of false positives identified above, we can now look to see if we can rank the approaches in terms of false positive risk by assessing if any of those risks dominate the others, i.e. the false positive risks associated with one approach necessarily pose greater risks and thus higher chance of failure than those associated with another. I believe we can, and I make the following arguments.
Based on the above analysis, I'd argue that Ambitious Value Learning is safer than Iterated Distillation and Amplification is safer than Normative Embedded Agency as approaches to building aligned AI in terms of false positive risk, all else equal. In short, risk(AVL) < risk(IDA) < risk(NEA), or if you like AVL is safer than IDA is safer than NEA, based on false positive risk.
I think there's a lot that could be better about the above analysis. In particular, it's not very specific, and you might argue that I stood up straw versions of each approach that I then knocked down in ways that are not indicative of how specific proposals would work. I didn't get more specific because I'm more confident I can reason about high level approaches than details about specific proposals, and it's unclear which specific proposals are worth learning in enough detail to perform this evaluation, so as a start this seemed like the best option.
Also we have the problem that NEA is not as real an approach as IDA or AVL, with the research I cited as the basis for the NEA approach more likely to augment the IDA or AVL approaches rather than offer an alternative to them. Still, I find including the NEA "approach" interesting if for no other reason that it points to a class of solutions researchers of the past would have proposed if they were trying to build aligned GOFAI, for example.
Finally, as I said above, my main goal here is to demonstrate the method, not to strongly make the case that AVL is safer than IDA (even though on reflection I personally believe this). My hope is that this inspires others to do more detailed analyses of this type on specific methods to recommend the safest seeming alignment mechanisms, or that it generates enough interest that I'm encourage to do that work myself. That said, feel free to fight out AVL vs. IDA at the object level in the comments if you like, but if you do at least try to do so within the framework presented here.