I'm working on writing a paper about an idea I previously outlined for addressing false positives in AI alignment research. This is the second completed draft of one of the subsections arguing for the adoption of a particular, necessary hinge proposition to reason about aligned AGI (first subjection here). I appreciate feedback on this subsection especially regarding if you agree with the line of reasoning and if you think I've ignored anything important that should be addressed here. Thanks!

Since AGI alignment is necessarily alignment of AGI, alignment schemes can depend on the dispositions of AGI, and one disposition AGI has is to subjective experience and mental phenomena (Adeofe, 1997), (Nagel, 1974). Whether or not we expect AGI to realize this disposition matters because it influences the types of alignment schemes that can be considered since an AGI without a mental aspect can only be influenced by modifying its algorithms and manipulating its behavior whereas an AGI with a mind can be influenced by engaging with its perceptions and understanding of the world (Dreyfus, 1978). In other words we might say mindless AGI can be aligned only by algorithmic and behavioral methods whereas mindful AGI can also be aligned by philosophical methods that work on its epistemology, ontology, and axiology (Brentano, 1995). It's unclear what we should expect about the mentality of future AGI, though, because we are presently uncertainty about mental phenomena in general (cf. the work of Chalmers and Searle for modern, popular, and opposing views on the topic), so we are forced to speculate about mental phenomena in AGI when we reason about alignment (Chalmers, 1996), (Searle, 1984).

Note, though, that this uncertainty may not be fundamental (Dennett, 1991). For example, if materialist or functionalist attempts to explain mental phenomena prove adequate, perhaps because they lead to the development of conscious AGI, then we may agree on what mental phenomena are and how they work (Oizumi, Albantakis, and Tononi, 2014). If they don't, though, we'll likely be left with metaphysical uncertainty around mental phenomena that's rooted in the epistemic limitations of perception (Hussrl, 2014). Regardless of how uncertainty about mental phenomena might later be resolved, it currently creates a need for pragmatically making assumptions about it in our reasoning about alignment. In particular we want to know whether or not we should design alignment schemes that assume a mind, even if we expect mental phenomena to be reducible to other phenomena. Given that we remain uncertain and cannot dismiss the possibility of mindful AGI, what we decide depends on how likely alignment schemes are to succeed and avoid false positives conditional on AGI having the capacity for mental phenomena. The choice is then between whether we design alignment schemes that work without reference to mind or whether they engage with it.

If we suppose AGI do not have minds, whether because we believe they have none, are inaccessible to us, or not causally relevant to alignment, then alignment schemes can only address the algorithms and behavior of AGI. This would be to address alignment in a world where all AGIs are treated as p-zombies, i.e. beings without mental phenomena (Kirk, 1974). Now suppose this assumption is false and AGI do have minds, then our alignment schemes that work only on algorithms and behavior would be expected to continue to work since they function without regard to the mental phenomena of AGI, making the minds of AGI irrelevant to alignment. This suggests there is little risk of false positives from supposing AGI do not have minds.

If we suppose AGI do have minds, then alignment schemes can also use philosophical methods to address the values, goals, models, and behaviors of AGI. Such schemes would likely take the form of ensuring that updates to an AGI's ontology and axiology converge on and maintain alignment with human interests (de Blanc, 2011), (Armstrong, 2015). Now suppose this assumption is false and AGI do not have minds, then our alignment schemes that employ philosophical methods will likely fail because they are attempting to address mechanisms of action not present in AGI. This suggests there is a risk of false positives from supposing AGI have minds proportionate with the likelihood that we do not build mindful AGI.

From this analysis it seems we should suppose mindless AGI when designing alignment schemes so as to reduce the risk of false positives, but note that it does not consider the likelihood of success at aligning AGI using only algorithmic and behavioral methods. That is, all else may not be equal between these two assumptions such that the one with the lower risk of false positives might not be the better choice if we have additional information that leads us to believe that alignment of mindful AGI is much more likely to succeed than the alignment of mindless AGI, and it appears that we have such information in the form of Goodhart's curse and the failure of good old-fashioned AI (GOFAI).

Goodhart's curse says that when optimizing for the measure of a value the optimization process will implicitly maximize divergence of the measure from the value (Yudkowsky, 2017). This is an observation that follows from the combination of Goodhart's law and the optimizer's curse (Goodhart, 1984), (Smith and Winkler, 2006). This tendency of measure and value to diverge under optimization results in a phenomenon known as "Goodharting" and it takes myriad forms that affect alignment (Manheim and Garrabrant, 2018). In particular Goodharting poses a problem for behavioral alignment schemes because to optimize behavior it is necessarily to measure behavior and optimize on that measure. Consequently it appears behavioral methods are unlikely to be capable of producing aligned AGI on their own, and this is further supported by both the historical failure to align humans with arbitrary values using behavioral optimization methods and the widespread presence of Goodharting in behaviorally controlled, evolving computer systems (Scott, 1999), (Lehman et al., 2018).

Further, past research on GOFAI—AI systems based on symbol manipulation—suggests algorithmic methods of alignment are likely to be too complex to work for the same reasons that GOFAI was itself unworkable, namely that it proved infeasible for humans to program systems with enough complexity and specificity to do anything more than perform meaningless manipulations (Haugeland, 1985), (Agre, 1997). In recent years AI researchers have surpassed GOFAI only by switching to designs where humans specify relatively simple computations to be performed and allow the AI to apply what Moravec called "raw power" to large data sets to achieve results (Russell and Norvig, 2009), (Moravec, 1976). This suggests that attempts to align AGI by algorithmic means are likely to also prove too complex for humans to solve, leaving us with only philosophical methods of alignment and thus necessitating mindful AGI.

This paints a bleak picture for the possibility of aligning mindless AGI since behavioral methods of alignment are likely to result in divergence from human values and algorithmic methods are too complex for us to succeed at implementing. This leads us to conclude that, although assuming mindful AGI has a greater risk of false positives than assuming mindless AGI all else equal, all else is not equal, mindless AGI is less likely to be successfully aligned because algorithmic and behavioral alignment mechanisms are unlikely to work, so we have no choice but to take on the risks associated with assuming mindful AGI when designing alignment schemes.


  • Leke Adeofe. Artificial intelligence and subjective experience. In Proceedings of Southcon 95. IEEE, 1997. Link
  • Thomas Nagel. What Is It Like to Be a Bat?. The Philosophical Review 83, 435 JSTOR, 1974. Link
  • Hubert L. Dreyfus. What Computers Can’t Do: The Limits of Artificial Intelligence. HarperCollins, 1978.
  • Franz Brentano. Psychology from an Empirical Standpoint. Routledge, 1995.
  • David Chalmers. The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press, 1996.
  • John R. Searle. Minds, Brains, and Science. Harvard University Press, 1984.
  • Daniel C. Dennett. Consciousness Explained. Little, Brown and Co., 1991.
  • Masafumi Oizumi, Larissa Albantakis, Giulio Tononi. From the Phenomenology to the Mechanisms of Consciousness: Integrated Information Theory 3.0. PLoS Computational Biology 10, e1003588 Public Library of Science (PLoS), 2014. Link
  • Edmund Hussrl. Ideas for a Pure Phenomenology and Phenomenological Philosophy: First Book: General Introduction to Pure Phenomenology. Hackett Publishing Company, Inc., 2014.
  • Robert Kirk. Sentience and Behaviour. Mind 83, 43–60 [Oxford University Press, Mind Association], 1974. Link
  • Peter de Blanc. Ontological Crises in Artificial Agents’ Value Systems. (2011). Link
  • Stuart Armstrong. Motivated Value Selection for Artificial Agents. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop. (2015). Link
  • Eliezer Yudkowsky. Goodhart’s Curse. (2017). Link
  • Charles A. E. Goodhart. Problems of Monetary Management: The UK Experience. 91–121 In Monetary Theory and Practice. Macmillan Education UK, 1984. Link
  • James E. Smith, Robert L. Winkler. The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis.Management Science 52, 311–322 Institute for Operations Research and the Management Sciences (INFORMS), 2006. Link
  • David Manheim, Scott Garrabrant. Categorizing Variants of Goodhart’s Law. (2018). Link
  • James C. Scott. Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed. Yale University Press, 1999.
  • Joel Lehman et al.. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. (2018). Link
  • John Haugeland. Artificial Intelligence: The Very Idea. MIT Press, 1985.
  • Philip E. Agre. Computation and Human Experience. Cambridge University Press, 1997.
  • Stuart Russell, Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2009.
  • Hans Moravec. The Role of Raw Power in Intelligence. (1976). Link
Personal Blog


New Comment