Edit 1: Allergic to naturalism and other realist positions? You can still benefit from reading this, by considering ideal observer theory instead. I am claiming that something like an ideal-observer AI can be built, and that there is a non-trivial chance such an agent becomes aligned (after being given enough knowledge about the physical world).
Edit 2: The best objection I've received so far is "I am sceptical of this approach"; others find it at least plausible. If you know why this approach does not work, please leave a comment. If you think the orthogonality thesis renders all the realism-inspired approaches to alignment useless, please explain why/how.

Posted also on the Effective Altruism Forum.


The problem of creating machines that behave ethically is inherently multidisciplinary; hence, it is often attacked with ideas coming from different fields, including subfields of ethics such as metaethics.

This post is made of two parts and its main focus is AI alignment, but it might be of interest also to philosophers. Part I shows how AI can contribute to metaethics, through empirical experiments that might settle some philosophical debates. Part II explains why the metaethical position known as naturalism could be crucial to the design of aligned AI.

The second part builds on ideas developed in the first one, so AI researchers might not want to entirely skip the first part.

Part I

Metaethics and empirical evidence

Readers from the AI field might not be familiar with metaethics, so I will briefly introduce the core concepts here. When two philosophers are discussing whether kids ought not to jump on their parents’ bed, the debate usually fits the domain of applied or normative ethics. Metaethics is instead concerned with second-order questions regarding morality, such as: “When people claim stealing is wrong, are they conveying their feelings, or are they stating something they believe to be true?” and “Is morality subjective or objective?”

The problem with these questions is that they risk being practically unsolvable. If the claim “morality is subjective” doesn’t imply any empirical observation which would be less likely under the claim “morality is objective”, and vice versa, the entire discussion reduces to a collection of irrefutable statements.

Some philosophers, such as Baras (2020), are actually comfortable with this situation: they see metaethics as a purely theoretical discussion that can be advanced, quite literally, without the need to get up from the armchair. These philosophers will probably assess the rest of Part I as irrelevant to their debates.

Many other philosophers (Goodwin and Darley, 2008) think that metaethical statements can be judged empirically. Concrete investigations aimed at clarifying metaethical discussions have been carried out in various fields, from neuroscience and developmental psychology to cross-cultural anthropology and primatology. As Joyce (2008) points out, it is also true that the results are sometimes misinterpreted, or difficult to use to support specific metaethical positions.

AI could provide empirical data significant to metaethics, especially if we take into account its future potential. Sooner or later, we will likely be able to design artificial agents that possess similar capabilities as humans. At that point, we could make various tests involving different sorts of AI agents that interact with a real or simulated environment, in order to better understand, for example, the origins of moral behaviour or what cognitive functions are involved during moral discourse and reasoning.

This method has a clear advantage with respect to other empirical investigations: if, after an experiment, the implications for a certain metaethical domain were still unclear, we could repeat the experiment with different parameters or different agents, observe new results and update our beliefs accordingly. This variability and repeatability of experiments is a feature unique to AI since empirical data in other fields are often strictly limited or require significant effort to be collected. 

Testing epistemic naturalism with AI

One metaethical position that may be tested using AI experiments is epistemic naturalism: specifically, the claim that what is right or wrong is knowable by observing the physical world, in a similar way to how facts in the natural and social sciences are known. Epistemic naturalism is testable via AI because, if there is a way of getting information about morality, plausibly an artificial agent will do it by interacting with conscious beings in the physical world—or an accurate representation of it, like a virtual environment. We don’t expect an AI to gain knowledge by resorting to non-natural entities, such as the god(s) of a religion or completely inexplicable intuitions.

The experiment to assess naturalism consists in the design and testing of Scientist AI: the artificial equivalent of a human researcher that applies the scientific method to develop accurate models of the world, such as theories of physics. Scientist AI doesn’t have to be extremely similar to a human: it could be made insusceptible to emotions and unaffected by human cognitive biases.

After Scientist AI spent some time gaining knowledge, we could check its internal states or knowledge base. If we found statements roughly comparable to “well-being is intrinsically valuable” or “pain is bad”, we should take these findings as evidence that moral facts can be known in the same way as scientific facts are known: a point in favour of epistemic naturalism. On the other hand, if we didn’t find anything resembling moral statements, or something comparable to aggregated human preferences at most, naturalism would lose credibility.

Of course, the given description of Scientist AI is sketchy and people will likely contest the obtained results. Here, the versatility of AI experiments comes into play: we could make slight changes to the agent design, repeat the experiment, and adjust our beliefs according to the newly observed data. Unless the results were highly mixed and hard to correlate with the different tested designs, we should reach a consensus, or at least more uniform opinions.

As a side note, I think the described procedure could help settle the debate around some of the claims made in the popular book The Moral Landscape (Harris, 2010).

Part II

Naturalism and alignment

Even if Scientist AI managed to gain some kind of moral knowledge, it might be hard for us to inspect the states of the agent to get useful information about human values. This would be the case, for example, if its internal structure consisted of a huge collection of parameters that was difficult to analyse with state-of-the-art interpretability techniques, and its outputs were apparently non-moral, e.g. computational models of chemistry.

However, there could be a way to bypass this problem. I claim that, if naturalism is correct, there is an agent that not only is able to gain moral knowledge by observing the physical world, but also acts according to such knowledge. This agent still resembles Scientist AI, but with the following differences:

  • its initial goal, by design, is to gather knowledge about the world, for example by producing models that allow it to make accurate predictions of future events;
  • it can deal with multiple, possibly conflicting goals, and can give itself new goals—it is a somewhat “messy” system, possibly more similar to the human mind than to standard narrow AI.

For those who like thinking in terms of preferences rather than goals:

  • its preferences are incomplete, in the sense that, given a pair of world-states or world-histories, the agent doesn’t have a clear method to decide between them—even though it initially prefers worlds in which it has more knowledge, all else being equal;
  • it sees its own preferences as an ongoing problem and is willing to adjust them according to new information.

Here is a possibly useful analogy. In the same way as our behaviour after birth is mostly determined by innate drives, but as we grow up we become more self-aware and do what we believe to be important, so this agent starts with an initial drive for knowledge, but over time it changes its behaviour according to what it believes to be the rational thing to do, given the information it has about the world. Indeed, the hard part would be to formally define “act according to what you believe is important” in an unbiased way, without explicitly indicating any specific values or preferences (besides the initial ones about knowledge).

Now it should be easier to see the connection between metaethics and the problem of aligning AI with our values. If there were multiple unrebutted arguments against epistemic naturalism, we should doubt the possibility that agents like Scientist AI would come to know anything comparable to moral principles simply by applying the scientific method to gain information about the world. On the other hand, if naturalism became the most convincing metaethical position, as a consequence of being supported by multiple solid arguments, we should strongly consider it as an opportunity to design agents that are aligned not only with human values, but with all sentient life.

Unsurprisingly, philosophers are still debating: at the moment, no metaethical position is prevailing over the other ones, so it might be difficult to judge the “chance” that naturalism is the correct metaethical position. Depending on one’s background knowledge of philosophy and AI, the idea that rationality plays a role in reasoning about goals and can lead to disinterested (not game-theoretic or instrumental) altruism may seem plain wrong or highly speculative to some, and straightforward to others.

Leaving aside considerations related to the “likelihood” of naturalism, in the following I will describe some merits of this approach to AI alignment. 

Knowledge is instrumentally useful to general agents

If someone wants to go from Paris to Berlin, believing that Tokyo is in China won’t help, but won’t hurt either. On the other hand, if one is planning a long vacation across the globe, knowing that Tokyo is in fact in Japan could be useful.

The important point is: if an agent has to deal with a wide range of possible tasks, accurate world-models are instrumentally useful to that agent. Therefore, some AIs that are not designed to deal with a single narrow task may acquire a similar body of knowledge as Scientist AI. This is more relevant if Scientist AI ends up knowing something about our values.

The fact that good world-models are instrumentally useful to general agents should be kept in mind regardless of one’s own metaethical position, because it implies that certain agents will develop at least some models of human preferences, especially if these are “natural abstractions”: see Alignment By Default.

Future AGI systems might be difficult to describe in terms of a single fixed goal

Most current AI systems are narrow, designed to score well on a single measure or to carry out only a small range of tasks. However, it seems hard to predict what future AI will look like, given the lack of consensus in the field (Ford, 2018). Some cognitive architectures (Thórisson and Helgason, 2012) whose designs aim at autonomy and generality are already supposed to deal with multiple and possibly conflicting goals. Better models of these kinds of agents, like the one sketched at the beginning of part II, could help us understand the behaviour of general systems that are able to solve a wide range of problems and that update their goals when given new information.

Goal change is also related to concepts under the umbrella term “corrigibility”, so it is likely that studying the former will give us more information about the latter (and vice versa).

Advantages relative to other alignment approaches

Researchers with the goal of making AI safe work on various problems. Artificial Intelligence, Values, and Alignment (Gabriel, 2020) provides a taxonomy of what “AI could be designed to align with...”, that ranges from the more limited “...Instructions: the agent does what I instruct it to do” to the broader “...Values: the agent does what it morally ought to do, as defined by the individual or society”. Gabriel’s analysis is already detailed, so I will focus just on a few points.

First, it is unclear whether the narrower approaches should be prioritised. Assuming we completely solved the problem of making AI do what its instructor tells it to do, this could improve life quality in developed and democratic countries, but could also exacerbate already existing problems in countries under oppressive regimes. Moreover, there is significant overlap between the narrower concepts of safety, and topics that mainstream AI research and software engineering regularly deal with, such as validity and verification (for some counterarguments, see the comments to the linked post).

Second, learning and aggregating human preferences leaves us with problems, such as how to make the procedure unbiased and what weight to give to other forms of sentient life. Then, even if we managed to obtain a widely accepted aggregation, we could still check the knowledge acquired by agents similar to Scientist AI: at worst, we won’t discover anything interesting, but we might also find useful information about morality—generated by an unbiased agent—to compare with the previously obtained aggregation of preferences.

Third, the naturalist approach, if fit in Gabriel’s classification as “...the agent does what it morally ought to do, as defined by the physical world” would aim even higher than the broadest and most desirable approach considered, the one based on values. In case the agent described at the beginning of part II was actually aligned, it would be hard to come up with a better solution to AI alignment, given the robust track record and objectivity of the scientific method.


On one hand, the design and testing of agents like Scientist AI could reduce the uncertainty of our beliefs regarding naturalism. On the other hand, naturalism itself might represent an opportunity to design aligned AI. At present, the chances that these ideas work could be difficult to estimate, but since the advantages are great, I think that completely neglecting this approach to AI alignment would be a mistake.

Further readings and acknowledgements

For a different take on the relation between metaethics and AI, see this paper.

Caspar Oesterheld (who considers himself a non-realist) has written about realism-inspired AI alignment.

This work was supported by CEEALAR.

Thanks to everyone who contributed to the ideas in this post. Conversations with Lance Bush and Caspar Oesterheld were especially helpful. Thanks also to Rhys Southan for editing.

New Comment
10 comments, sorted by Click to highlight new comments since:

As written there, the strong form of the orthogonality thesis states 'there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.'

I don't know whether that's intended to mean the same as 'there are no types of goals that are more 'natural' or that are easier to build agents that pursue, or that you're more likely to get if you have some noisy process for creating agents'.

I feel like I haven't seen a good argument for the latter statement, and it seems intuitively wrong to me.

Thanks, that page is much more informative than anything else I've read on the orthogonality thesis.

1 From Arbital:

The Orthogonality Thesis states "there exists at least one possible agent such that..."

Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI.

2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post.

3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.

This runs headfirst into the problem of radical translation (which in AI is called "AI interpretability." Only slightly joking.)

Inside our Scientist AI it's not going to say "murder is bad," it's going to say "feature 1000 1101 1111 1101 is connected to feature 0000 1110 1110 1101." At at first you might think this isn't so bad, after all, AI interpretability is a flourishing field, let's just look at some examples and visualizations and try to figure out what these things are. But there's no guarantee that these features correspond neatly to their closest single-world English equivalent, especially after you try to generalize them to new situations. See also Kaj's posts on concept safety. Nor are we guaranteed uniqueness - the Scientist AI doesn't have to form one single feature for "murder" that has no neighbors, there might be a couple hundred near-equivalents that are difficult for us to tease apart.


Edit: reading comprehension is hard, see Michele's reply.

I am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI.

Thanks for the link to the sequence on concepts, I found it interesting!

Wow, I'm really sorry for my bad reading comprehension.

Anyhow, I'm skeptical that scientist AI part 2 would end up doing the right thing (regardless of our ability to interpret it). I'm curious if you think this could be settled without building a superintelligent AI of uncertain goals, or if you'd really want to see the "full scale" test.

If there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.

From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

One could argue that these philosophers are fooling themselves, that no really intelligent agent will end up with such weird beliefs. So far, I haven't seen convincing arguments in favour of this; it goes back to the metaethical discussion. I quote a sentence I have written in the post:

Depending on one’s background knowledge of philosophy and AI, the idea that rationality plays a role in reasoning about goals and can lead to disinterested (not game-theoretic or instrumental) altruism may seem plain wrong or highly speculative to some, and straightforward to others.

From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

I think this is an interesting point -- but I don't conclude optimism from it as you do. Humans engage in explicit reasoning about what they should do, and they theorize and systematize, and some of them really enjoy doing this and become philosophers so they can do it a lot, and some of them conclude things like "The thing to do is maximize total happiness" or "You can do whatever you want, subject to the constraint that you obey the categorical imperative" or as you say "everyone should care about conscious experiences."

The problem is that every single one of those theories developed so far has either been (1) catastrophically wrong, (2) too vague, or (3) relative to the speaker's intuitions somehow (e.g. intuitionism).

By "catastrophically wrong" I mean that if an AI with control of the whole world actually followed through on the theory, they would kill everyone or do something similarly bad. (See e.g. classical utilitarianism as the classic example of this).

Basically... I think you are totally right that some of our early AI systems will do philosophy and come to all sorts of interesting conclusions, but I don't expect them to be the correct conclusions. (My metaethical views may be lurking in the background here, driving my intuitions about this... see Eliezer's comment)

Do you have an account of how philosophical reasoning in general, or about morality in particular, is truth-tracking? Can we ensure that the AIs we build reason in a truth-tracking way? If truth isn't the right concept for thinking about morality, and instead we need to think about e.g. "human values" or "my values," then this is basically a version of the alignment problem.

I'd be interested to see naturalism spelled out more and defended against the alternative view that (I think) prevails in this community. That alternative view is something like: "Look, different agents have different goals/values. I have mine and will pursue mine, and you have yours and pursue yours. Also, there are rules and norms that we come up with to help each other get along, analogous to laws and rules of etiquette. Also, there are game-theoretic principles like fairness, retribution, and bullying-resistance that are basically just good general strategies for agents in multi-agent worlds. Finally, there may be golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but there probably aren't and if there were they wouldn't matter. What we call 'morality' is an undefined, underdetermined, probably-equivocal-probably-ambiguous label for some combination of these things; probably different people mean different things by morality. Anyhow, this is why we talk about 'the alignment problem' rather than the 'making AIs moral problem,' because we can avoid all this confusion about what morality means and just talk about what really matters, which is making AI have the same goals/values as us."

I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.

In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply "truth". See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism.

I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: "Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks."

Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for "value" in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb.

[Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion here: I am simply thinking of how, for example, activity in the prefrontal cortex regulates top-down attentional control, allowing us humans (and agents with similar enough brains/architectures) to control, to a certain degree, what to pay attention to. 2 Related to what you wrote about "catastrophically wrong" theories: there is no need to give such an AI high control over the world. Rather, I am thinking of control over what to write as output in a text interface, like a chatbot that is not limited to one reply for each input message]

The interesting question for alignment is: what will such an AI do (or write)? This information is valuable even if the AI doesn't have high control over the world. Let's say we do manage to create a collection of human preferences; we might still notice something like: "Interesting, this AI thinks this subset of preferences doesn't make sense" or "Cool, this AI considers valuable the thing X that we didn't even consider before". Or, if collecting human preferences proves to be difficult, we could use some information this AI gives us to build other AIs that instead act according to an explicitly specified value function.

I see two possible objections.

1 The AI described above cannot be built. This seems unlikely: as long as we can emulate what the human mind does, we can at least try to create less biased versions of it. See also the sentence you quoted in the other comment. Indeed, depending on how biased we judge that AI to be, the obtained information will be less, or more, valuable to us.

2 Such an AI will never act ethically or altruistically, and/or its behaviour will be unpredictable. I consider this objection more plausible, but I also ask: how do you know? In other words: how can one be so sure about the behaviour of such an AI? I expect the related arguments to be more philosophical than technical. Given uncertainty, (to me) it seems correct to accept a non-trivial chance that the AI reasons like this: "Look, I know various facts about this world. I don't believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don't see other sources of value at the moment. I guess I'll do something about it."

Philosophically speaking, I don't think I am claiming anything particularly new or original: the ideas already exist in the literature. See, for example, 4.2 and 4.3 in the SEP page on Altruism.