I'm working on writing a paper about an idea I previously outlined for addressing false positives in AI alignment research. This is the first completed draft of one of the subsections arguing for the adoption of a particular, necessary hinge proposition to reason about aligned AGI. I appreciate feedback on this subsection especially regarding if you agree with the line of reasoning and if you think I've ignored anything important that should be addressed here. Thanks!

AGI alignment is typically phrased in terms of aligning AGI with human interests, but this hides some of the complexity of the problem behind determining what "human interests" are. Taking "interests" as a synonym for "values", we can begin to make some progress by treating alignment as at least partially the problem of teaching AGI human values (Soares, 2016). Unfortunately, what constitutes human values is currently unknown since humans may not be aware of the extent of their own values or may not hold reflexively consistent values (Scanlon, 2003). Further complicating matters, humans are not rational, so their values cannot be deduced from their behavior unless some normative assumptions are made (Tversky, 1969), (Armstrong and Mindermann, 2017). This is a special case of Hume's is-ought problem—that axiology cannot be inferred from ontology alone—and it complicates the problem of training AGI on human values (Hume, 1739).

Perhaps some of the difficulty could be circumvented if a few normative assumptions were made, like assuming that rational preferences are always better than irrational preferences or assuming that suffering supervenes on preference satisfaction. This poses an immediate problem for our false positive reduction strategy by introducing additional variables that will necessarily increase the chance of a false positive. Maybe we could avoid making any specific normative assumptions prior to the creation of aligned AGI by expecting the AGI to discover them via a process like Yudkowsky's coherent extrapolated volition (Yudkowsky, 2004). This may avoid the need to make as many assumptions, but still requires making at least one—that moral facts exist to permit the correct choice of normative assumptions—and reveals a deep philosophical problem at the heart of AGI alignment—meta-ethical uncertainty.

Meta-ethical uncertainty stems from epistemic circularity and the problem of the criterion since it is not possible to know the criteria by which to asses which moral facts are true or even if any moral facts exist without first assuming to know what is good and true (Chisholm, 1982). We cannot hope to resolve meta-ethical uncertainty here, but we can at least decide what impact particular assumptions about the existence of moral facts have upon false positives in AGI alignment. Specifically, whether or not moral facts exists and, if they do, what moral facts should be assumed to be true.

On the one hand suppose we assume that moral facts exist, then we could build aligned AGI on the presupposition that it could at least discover moral facts even if no moral facts were specified in advance and then use knowledge of these facts to constrain its values such that they aligned with humanity's values. Now suppose this assumption is false and moral facts do not exist, then our moral-facts-assuming AGI would either never discover any moral facts to constrain its values to be aligned with human values or would constrain itself with arbitrary moral facts that would not be sure to produce value alignment with humanity.

On the other hand suppose we assume that moral facts do not exist, then we must build aligned AGI to reason about and align itself with the axiology of humanity in the absence of any normative assumptions, likely on a non-cognitivist basis like emotivism. Now suppose this assumption is false and moral facts do exist, then our moral-facts-denying AGI would discover the existence of moral facts, at least implicitly, by their influence on the axiology of humanity and would align itself with humanity as if it had started out assuming moral facts existed but at the cost of solving the much harder problem of learning axiology without the use of normative assumptions.

Based on this analysis it seems that assuming the existence of moral facts, let alone assuming any particular moral facts, is more likely to produce false positives than assuming moral facts do not exist because denying the existence of moral facts gives up the pursuit of a class of alignment schemes that may fail, namely those that depend on the existence of moral facts. Doing so likely makes finding and implementing a successful alignment scheme harder, but it does this by replacing difficulty tied to uncertainty around a metaphysical question that may not be resolved in favor of alignment to uncertainty around implementation issues that through sufficient effort may be made to work. Barring a result showing that moral nihilism—the assumption that no moral facts exist—implies the impossibility of building aligned AGI, it seems the best hinge proposition to hold in order to reduce false positives in AGI alignment due to meta-ethical uncertainty.


  • Nate Soares. The Value Learning Problem. In Ethics for Artificial Intelligence Workshop at 25th International Joint Conference on Artificial Intelligence. (2016). Link
  • T. M. Scanlon. 3 Rawls on Justification. 139 In The Cambridge Companion to Rawls. Cambridge University Press, 2003.
  • Amos Tversky. Intransitivity of preferences.. Psychological Review 76, 31–48 American Psychological Association (APA), 1969.Link
  • Stuart Armstrong, Sören Mindermann. Impossibility of deducing preferences and rationality from human policy. (2017). Link
  • David Hume. A Treatise of Human Nature. Oxford University Press, 1739. Link
  • Eliezer Yudkowsky. Coherent Extrapolated Volition. (2004). Link
  • Roderick M. Chisholm. The Foundations of Knowing. University of Minnesota Press, 1982.
New Comment