Posted also on the EA Forum.


In Naturalism and AI alignment I claimed that we might get an aligned AI by building an agent which derives its own preferences (a partial order over world-histories) from its knowledge of the natural world. 

Now I am working on a formal (i.e. mathematical or algorithmic) model of that agent. When AI capabilities will reach human level in natural language understanding, the formal model will help turn a natural language AI system that understands the world at human level of intelligence into an AI that is able to reason about ethics and possibly do EA-style research.

Here, I elaborate on the idea that an AI which reasons like a human might be aligned. The basic argument is that:

  • Some humans act and think ethically
  • If we understand the causes of such behaviour and replicate them in AI, we get AI that is roughly aligned. If we also manage to eliminate some human cognitive biases, we should get AI that is more robustly aligned.

Factors underlying altruistic behaviour in humans

At least some humans act and think in ethical or altruistic terms. To explain why this happens, we can consider the different factors that may lead to such behaviour.

  • Positive and negative first-person experiences: if humans couldn’t perceive any kind of pleasure or pain, survival in general would be significantly harder, let alone making considerations about others’ subjective experiences.
  • Theory of mind: without it, humans probably could not reason about what is good or bad for others.
  • Empathy: in addition to theory of mind, humans can also experience a positive or negative response depending on what somebody else is feeling. Empathy incentives harm reduction and makes positive emotions spread.
  • Social and cultural drives: culture influences what actions are perceived as good or bad; and humans, as social animals, tend to adjust their behaviour accordingly, in order to be seen as good persons by the people around them.
  • Game-theoretical reasoning: one can end up acting altruistically and cooperatively simply because doing so provides the greatest benefit to themself.
  • Moral reasoning: some humans classify actions as good or bad according to a system of beliefs held to be true. Two examples: religious people often rely on a set of dogmas that guide their decisions during life; some philosophers wonder “Given what I know about the world, what matters most?”, and some of them act according to the answer they find.

If we were able to replicate all these factors in artificial minds, at least some of these minds would end up thinking ethically. However, we don’t yet know how to build an AI that possesses all the factors in the list. It would be nice if we could ignore some of the factors above and still get an aligned AI.

The last factor, moral reasoning, deserves special attention, since it could be less or particularly important depending on one’s view about morality.

Moral reasoning and subjective vs objective morality

As trivial as it sounds, the more subjective morality is, the less objectively true moral knowledge and reasoning are. If what makes moral statements true are opinions, feelings or attitudes of people, then our philosophical conclusions about good and bad are what they are simply because we are human. We should expect other rational minds to reach different conclusions regarding morality if they have different feelings, or cultural backgrounds, etc.

Thus: first-person experiences, empathy and theory of mind, social and cultural drives are the underlying cause of moral reasoning, which is not grounded in truth. Objective moral knowledge and objective moral progress are an illusion.


On the other hand, if morality is more objective, i.e. it depends on objective features of the world, independent of subjective opinion, then we should expect other rational thinkers, even different from humans, to come to similar conclusions about what is good or bad. Of course, since some moral beliefs in humans are strongly influenced by culture, we do not expect perfect coincidence in moral conclusions reached by different minds; but we would see at least some overlap among the conclusions. 

This second possibility, that morality depends on some objective features of the world, is particularly interesting for the design of ethical AI. If we manage to build an AI which can recognise these objective features of the world and can reason about them, this AI will probably understand ethics even without some of the other factors in the list above, e.g. social and cultural drives, or empathy.

Now an interesting question arises: assuming morality is objective (at least partially), which factors from the list can be neglected in the design of ethical AI? In other words: what is the minimal set of factors necessary to get ethical AI in this scenario?

A conjecture

My conjecture is that: given enough world knowledge expressed in natural language, an AI can become aligned just by reasoning. I am claiming that reasoning about truth can be extended to reasoning about moral truth; and qualia, empathy, or social drives are not necessary for an AI to act ethically.


Why is this conjecture relevant at all? If it is correct, it should be possible to convert an AI whose inputs and outputs are in natural language and which has a human-level understanding of the world, to an AI that reasons about ethics and is able to do EA-style research on how to do the most good. This conversion will not require the specification of a utility function encoding human values.


Can the conjecture be tested? Yes. What is needed is the formal design of an AI that, to decide how to act, asks itself if anything in the world is valuable—as some humans do in their lives. If this AI comes to the conclusion that consciousness has some value, and that (for example) reducing suffering is better than maximising it, then the conjecture is probably correct.

I say “probably” because the conclusions of a single AI won’t constitute enough evidence to settle the conjecture. If more AIs, with variations in design and initial knowledge about the world, came to the same conclusions regarding ethics, then we would be more confident that the conjecture is correct.

(This is related to the experiments I wrote about in part I of Naturalism and AI alignment, in case you are interested in the philosophical side of the topic.)


What if there is something wrong with the conjecture or with the philosophical assumptions? The default failure mode of this approach to alignment is that the above AI comes to the conclusion that nothing is valuable, i.e. it believes nihilism is the correct ethical framework.

If many other AIs, with some variations in design and initial knowledge, came to similar conclusions, then we should accept the idea that morality is a subjective and anthropocentric concept. It would mean that, for a mind to develop a concept of morality, a combination of the other factors in the list is necessary: a functional equivalent of empathy and theory of mind, or evolution in an environment that rewards cooperation and social behaviour, or maybe subjective experiences.

Regarding other possible failure modes, note that I am not trying to produce a safety module that, when attached to a language model, will make that language model safe. What I have in mind is more similar to an independent-ethical-thinking module: if the resulting AI states something about morality, we’ll still have to look at the code and try to understand what’s happening, e.g. what the AI exactly means with the term “morality”, and whether it is communicating honestly or is trying to persuade us. This is also why doing multiple tests will be practically mandatory.


Let’s ignore morality for a moment: is it even possible, for an AI that works only on inputs in natural language, to reach a human-level understanding of the world? It’s a tough question at the intersection of philosophy of language and AI. Natural language understanding is hypothesised to be an AI-complete problem, and some have argued that subjective experiences (qualia) are necessary to get human-level general intelligence, so the answer might be negative. On the other hand, language models keep getting better and better, and there doesn’t seem to be a strong reason why progress in natural language understanding would halt anytime soon.

What about the formal design? Does the idea of an AI that asks itself questions about value, and acts accordingly, make sense? I think it does, because some humans seem to follow a similar decision process, and the consensus in AI is that we can emulate the human mind and all its cognitive processes on hardware. 

A very interesting point in favour of the conjecture could be moral uncertainty. If the described AI wasn’t able to completely dismiss the view that consciousness has value, it might end up acting as if consciousness had value anyway, even if it gave a very high weight to nihilism. Ultimately, the outcome depends on how the AI will aggregate different ethical viewpoints with different degrees of belief; it’s difficult to make accurate predictions now, but we should keep in mind this possibility.


I’ve conjectured that it’s possible to convert a natural language AI system that has a human-level understanding of the world to an AI that reasons about ethics and how to bring the most good. The conjecture can be tested, and I have argued that it is also plausible.

I am working on a formal model that will help test the conjecture approximately when AI capabilities will reach human level in natural language understanding.

This work was supported by CEEALAR.

Thanks to Jaeson Booker, Charlie Steiner and Jenny Liu Zhang for direct feedback, and to many other guests at CEEALAR for conversations around these topics.

If you like my work and would like to support the project, or chat about these ideas, write me a private message. In particular, if you know a lot about NLP or NLU, your help would be very welcome!

New Comment