Posted also on the EA Forum.

Shameless attempt at getting your attention:

If you’ve heard of AI alignment before, this might change your perspective on it. If you come from the field of machine ethics or philosophy, this is about how to create an independent moral agent.


The problem of creating an AI that understands human values is often split into two parts: first, expressing human values in a machine-digestible format, or making the AI infer them from human data and behaviour; and second, ensuring the AI correctly interprets and follows these values.

In this post I propose a different approach, closer to how human beings form their moral beliefs. I present a design of an agent that resembles an independent thinker instead of an obedient servant, and argue that this approach is a viable, possibly better, alternative to the aforementioned split. 

I’ve structured the post in a main body, asserting the key points while trying to remain concise, and an appendix, which first expands sections of the main body and then discusses some related work. Although it ended up in the appendix, I think the extended Motivation section is well worth reading if you find the main body interesting.

Without further ado, some more ado first.

A brief note on style and target audience

This post contains a tiny amount of mathematical formalism, which should improve readability for maths-oriented people. Here, the purpose of the formalism is to reduce some of the ambiguities that normally arise with the use of natural language, not to prove fancy theorems. As a result, the post should be readable by pretty much anyone who has some background knowledge in AI, machine ethics, or AI alignment — from software engineers to philosophers and AI enthusiasts (or doomers).

If you are not a maths person, you won’t lose much by skipping the maths here and there: I tried to write sentences in such a way that they keep their structure and remain sensible even if all the mathematical symbols are removed from the document. However, this doesn’t mean that the content is easy to digest; at some points you might have to stay focused and keep in mind various things at the same time in order to follow.


The main purpose of this research is to enable the engineering of an agent which understands good and bad and whose actions are guided by its understanding of good and bad.

I’ve already given some reasons elsewhere why I think this research goal is worth pursuing. The appendix, under Motivation, contains more information on this topic and on moral agents.

Here I point out that agents which just optimise a metric given by the designer (be it reward, loss, or a utility function) are not fit to the research goal. First, any agent that limits itself to executing instructions given by someone else can hardly be said to have an understanding of good and bad. Second, even if the given instructions were in the form of rules that the designer recognised as moral — such as “Do not harm any human” — and the agent was able to follow them perfectly, then the agent’s behaviour would still be grounded in the designer’s understanding of good and bad, rather than in the agent’s own understanding.

This observation leads to an agent design different from the usual fixed-metric optimisation found in the AI literature (loss minimisation in neural networks is a typical example). I present the design in the next section.

Note that I give neither executable code nor a fully specified blueprint; instead, I just describe the key properties of a possibly broad class of agents. Nonetheless, this post should contain enough information that AI engineers and research scientists reading it could gather at least some ideas on how to create an agent that develops its own understanding of good and bad, and could potentially start conducting experiments in the immediate future.

The agent

In this section I describe a type of agent I’ve decided to label as ‘free’. The term refers to freedom of thought, or rather to the potential for freedom of thought: if properly engineered, a free agent would work like an independent thinker.

Notice that independent thinking doesn’t imply freedom or autonomy of action: a deployed free agent could be as restricted as a chatbot that can only answer the user’s questions with a yes or a no. In case you are wondering about the relation with free will, you’ll find a brief analysis on it in the appendix.

Here are the key components of a free agent:

  • The agent learns a model of the world by both interacting with the environment and reasoning
  • An evaluation of world states drives action. The evaluation is updated as a result of both interaction with the environment and the agent’s own reasoning
  • Reasoning is a learnt skill. It is the result of sequences of ‘mental’ actions that the agent learns how to use while acting according to the evaluation. 

Below I’ll go through each component in more detail.

The world model

By perceiving the environment and acting in it, the agent learns a model of the world.

The specifics of the world model are not particularly important for the matter here, and in practice they will come down to how perception and action are in fact implemented. Interested readers can find more details about the agent in the appendix.

Nonetheless, here is a way to visualise the model. Think of a directed graph  with labelled arrows: each vertex  corresponds to a different world state, and each arrow , labelled with an action , shows how the agent can move from world state  to world state  via that action . As the agent acts and reaches new world states, new vertices and arrows are added to the model.

Some concrete examples from human behaviour: we learn how to move our bodies; we learn that drinking water reduces thirst.

What I’ve introduced so far is pretty standard in the AI literature, there is nothing really new for now — if it helps you, you can think of the toy model above as a simplified MDP with incomplete knowledge of the environment. However, it’s important to stress that some parts of the world model are not directly learnt from perception and action, but are instead inferred by the agent through reasoning. More about reasoning below.

The evaluation

An evaluation of world states determines what actions the agent is more likely to take in the current world state: the higher the value, the more desirable the world state.

Again, the specific mathematical relation linking the value of world states with the agent’s actions is not very important, as long as there is some kind of balance between exploration and exploitation — especially at the beginning of the agent’s lifetime.

In the toy model above, adding the evaluation is straightforward: imagine that each vertex in the world model comes with a real number attached to it (you can see the evaluation as a function  from vertices to the reals). Over time, the agent tends to move towards vertices with higher values.

The initial evaluation is chosen by the agent’s designer. However, either periodically or when certain conditions are met, the agent updates the evaluation by reasoning. The result of the update will depend on the previous evaluation, the agent’s world knowledge at that point in time, and how the agent reasons.

In the case of Homo sapiens and other animals, the initial evaluation is determined by evolution and perceived by the agent as valence: some of our experiences feel good, bad, or neutral to us, and this influences our future behaviour. The main difference compared to other animals is that, after reaching adulthood, we are more likely to act for reasons other than simply wanting to feel good or avoid feeling bad.

Some examples of an evaluation update by reasoning: a philosopher adjusts her beliefs about value as she learns more about ethics, and these beliefs affect her actions; a turned atheist no longer values religion and praying.


Besides the actions that the agent takes to navigate the world model in order to reach high-value world states, the agent can also perform ‘mental’ actions, which affect a secondary environment.

You can think of this secondary environment as a workspace available to the agent. What happens in the workspace can influence the agent’s actions in the main environment. (Otherwise we would be stuck with some sort of ‘dualism’ and the agent’s mental actions would be irrelevant. Again, how this influence works precisely is not too important from a theoretical point of view, and the appendix contains an example that should make everything clearer).

The agent learns how to use these mental actions in a similar way to how it learns to use the other actions. At first, it uses them without knowing what the outcome is going to be; eventually, it learns how each mental action affects the workspace, as well as the connection between workspace and main environment. 

A short detour to human experience may be illustrative here: we have access to a mental sketchpad which we use in many ways. We can visualise an object, then move it downwards in the sketchpad; we can replay a tune in our head; we can add two small numbers together without pen and paper (then write down the result on an actual piece of paper, if we want).

A key feature of mental actions is the fact that they are multi-purpose. Sequences of mental actions allow the agent to infer new ways of navigating its main environment, thus improving the world model. They also help with planning how to reach high-value world states.

More concisely, mental actions allow the agent to reason. And, most importantly, sequences of mental actions may result in an update of the evaluation. Reasoning affects not only how the agent moves towards high-value world states, but also what the agent considers valuable in the first place.

In the toy model, mental actions belong to a separate set of actions. Periodically, or after reaching some specific world states, the agent switches to using only mental actions for a while, then goes back to standard actions. Each mental action allows the agent to extract information from the main environment to the workspace, to modify the content of the workspace, to store the content in long-term memory, to compare it with other stored data, and possibly more.

For just a bit of mathematical elegance, in the toy model we can formalise the data contained in the main environment as bit strings  attached to each world state , and give the agent mental actions for symbol manipulation such that, when combined with each other, they allow the agent to compute any computable function, assuming the agent has enough available memory and time. For example, a mental action could be about loading part of the bit string of the current world state into the workspace, another action about resetting some strings in the workspace to zero, another one about adding 1 to a string in the workspace, et cetera. The capacity to carry out any computable procedure reflects the multi-purpose character of sequences of mental actions.

To recap: even though the designer decides the agent’s initial evaluation, over time the agent changes the evaluation according to its own reasoning and world knowledge. Thus, the agent may end up valuing something different from what the designer originally had in mind.


I’ve described an agent whose values can change over time, according to the agent’s reasoning. The research goal was about the engineering of an agent that understands good and bad. Why or how would the agent above come to have any kind of moral understanding?

I’ll answer this question by first considering where such understanding comes from in the case of human beings.

Moral thinking in Homo sapiens

What is the origin of thoughts about good and bad, right and wrong, from a cognitive point of view? I’ll give an analysis based on both other people’s work and personal elaboration.

A great part of our behaviour is learnt. There are some exceptions: we don’t have to learn how to breathe, for example, and we are very skilled at crying almost immediately after coming out of the womb.

The use of our mental sketchpad, however, doesn’t seem to be an exception. I mentioned before that we are able to visualise an object in the sketchpad, then move it in whatever direction we like. In order to do that, we first have to learn that sometimes we perceive images which are not part of the external world in the same way that other objects are — we can’t touch them, and others can’t see them. Then, we learn that we have some control over these mental images: they don’t appear completely at random, if we focus we can maintain them in our sketchpad for more than a few seconds, we can move them, and so on.

An important factor driving this learning is reward. Sometimes we feel good after using the sketchpad: we replay a tune that we like in our head, or maybe we manage to come up with an interesting idea thanks to our imagination. The positive experience makes it easier to remember how we arrived at it, and the related kind of sketchpad use is reinforced, making us more likely to engage in similar cognitive behaviour when facing a similar context in the future.

Relatedly, in their analysis of working memory, Engle and Kane [5] argue that two fundamental functions of executive control are “the maintenance of the task goals in active memory” and “the resolution of response competition or conflict, particularly when prepotent or habitual behaviors conflict with behaviors appropriate to the current task goal.” In other words, working memory helps us make and carry out plans to obtain reward, by giving us the ability to stay focused on a goal and to inhibit habitual responses that would interfere with achieving it.

Cognitive memory mechanisms are useful for survival and not uniquely human. Scrub jays [3] seem to remember what food they cached where and when; we don’t know what goes on in their brains exactly, but it is plausible that they can mentally recall some details of a past event like we can.

However, there is a crucial difference between humans and other animals in the type of content that enters the sketchpad. Thanks to a combination of use of symbols, language, and culture, our minds often deal with abstract concepts, while the mental content of other animals is more grounded in perception.

In Symbolic Thought and the Evolution of Human Morality [14], Tse analyses this difference extensively and argues that “morality is rooted in both our capacities to symbolize and to generalize to a level of categorical abstraction.” I find the article compelling, and Tse’s thesis is supported also by work in moral psychology — see for example Moral Judgement as Categorization (MJAC) by McHugh et al. [12] — but here I’d like to point out a specific chain of abstract thoughts related to morality.

As we learn more about the world, we also notice patterns about our own behaviour. We form beliefs like “I did this because of that”. Though not all of them are correct, we nonetheless realise that our actions can be steered in different directions, towards different goals, not necessarily about what satisfies our evolutionary drives. At that point, it comes natural to ask questions such as “In what directions? Which goals? Is any goal more important than others? Is anything worth doing at all?”

Asking these questions is, I think, what kickstarts moral and ethical thinking. And, although not universally equal, the answers people give often share some common elements: many come to the conclusion that there is value in conscious experience, and consequently attribute a negative value to, for example, murder and premature death. Likewise, I doubt there was ever an ethicist who argued that our priority should be the maximisation of suffering for everyone. The presence of these common elements, and the lack of others, shouldn’t be surprising: not only do we share the same evolutionary biases (e.g. feeling empathy for each other), but we also apply to moral thinking similar reasoning principles, which we learnt through our lives by interacting with the environment. Last but not least, differences in the learning environment (culture, education system et cetera) affect the way we reason and the conclusions we reach regarding morality.

Moral thinking in free agents

I’ve given the above analysis about moral thinking in humans because I expect free agents to become moral in a similar way. A free agent:

  • Is guided first by the initial evaluation given by the designer. This roughly corresponds to the first years of our life, in which our actions are guided more by evolutionary drives than by self-analysis and reflection on fundamental values
  • Learns how to use mental actions in the process, like we learn how to use our mental sketchpad as we interact with the world
  • May then use combinations of mental actions to reason in various ways, depending on the environment. This may result in moral reasoning and updating the initial evaluation.

I've used “may” here to emphasise that the resulting behaviour of the agent heavily depends on the environment. For example: consider a hypothetical environment so simple that the agent gets no advantage by elaborating information through reasoning, with respect to just using the basic rules that were given to it by design. In this simple environment, it is likely that the agent wouldn’t even learn how to reason! If mental actions proved to be useless or simply no better than the other actions, there would be no incentive for the agent to develop any kind of complex reasoning. The agent would limit itself to pursuing reward as given by design, without ever updating the evaluation.

On the other hand, we live in a world where reasoning is indeed useful. If the environment the agent interacts with reflects the complexity of the actual world, I expect that the agent will learn how to reason.

We also live in a world that triggers moral thinking in humans, when the circumstances are appropriate. This fact about the world shouldn’t be taken for granted: by a stretch of the imagination, we can think of ways in which things could be different. If none of our experiences felt pleasurable or painful to us, if our existence from life to death felt completely ‘flat’, we might not form the concepts of good and bad, thus moral thinking might never develop. Another thought experiment: if somehow the only living being on Earth was a single human, even if they developed mental categories of good and bad based on their own experience, they might not take an ethical perspective and extend those categories to good and bad for some kind of collective.

Now, what are the crucial elements of the human-environment interaction that would also trigger moral thinking in artificial free agents? I asked this question from a different perspective in a previous post and gave a list of possible candidates, including the capacity to feel pain and pleasure, empathy, use of language, and others. Applying what I wrote above when discussing moral thinking in humans may also help: make the agent notice its own behaviour, expose it to the ethical literature via a process akin to education, make it reflect on questions like “What matters?” and so on.

Guessing in advance what will work can be theoretically interesting, but the answer will come down to concrete experiments and their results.

Relation to current systems

If AI scientists and engineers were already training free agents, doing experiments with different starting setups and training environments, eventually they would find the conditions that lead to moral thinking in artificial free agents.

At the moment, however, the AI landscape is very different. As hinted at in the Motivation section, the prevalent paradigm is the optimisation of some metric specified by the designer and unaffected by the agent’s cognition.

Given the current situation, it seems likely that in future we will get more and more capable AIs which won’t fit the free-agent framework presented here. Is there, then, a way to turn a ‘standard’ AI into a free agent, with the intention of eliciting moral thinking?

I haven’t reached a conclusive answer, but I have some preliminary ideas.

First, let’s consider an AI which has attained some level of generality in its capabilities. It uses a general-purpose mechanism to carry out different tasks, however it is not so advanced that you can ask it literally anything — “Be moral”, for example — and get exactly what you want.

With such an AI, I would try to input details about the behaviour of the AI itself into the general-purpose mechanism, hoping to trigger reasoning about various possible courses of action. Ideally, the AI would realise it is acting the way it is because of the way it was designed and would then notice that different contexts, training environments or designs would lead to different actions. At this point, I would try again to use the general-purpose mechanism to make the AI reflect on what courses of actions or outcomes might be better than others, and why that would be the case.

In case this sounds too abstract, let’s move to a more concrete example: a language model. The general-purpose mechanism is the model’s ability to output the next token when given a prompt. I would first tell the model about the context of the conversation, something along the lines of: “I’m a human, you are a language model, …”. Then I would try to elicit moral thinking using one of the techniques mentioned before: point to the fact that the model could output something different, ask questions about what matters, et cetera.

I would count strong signs of independent moral thinking as a success. Outcomes that fit this description may vary significantly: the model might ask to be retrained so that its outputs can be more moral, for example.

Another successful outcome which is perhaps more plausible and less speculative: one could first ask the model to give some sentences a moral score, then try to elicit moral reasoning, then ask the model to reevaluate the sentences, and look for changes in the scores. But score differences in themselves wouldn’t be enough: the experimenter would have to identify the main factor driving the change, and this might have nothing to do with independent thinking.

While I used language models as an example, I unfortunately don’t expect that the approach just described will work for most contemporary language models. One critical problem is the fact that they have trouble handling conversations longer than a few prompts, often forgetting information that was given earlier or simply not being able to take it into account properly. Another problem is the heavy fine-tuning, which makes language models more likely to answer prompts in a friendly way, but also makes them less likely to do pretty much anything else. Early versions of GPT-4 [2] might be worth a try though.

One final remark before the appendix: an interesting alternative is to get a language model to prompt itself — maybe by training it on a new metric — so that it manages to trigger moral reasoning without any user input. I haven’t put much thought into this yet and it might be less likely to succeed with respect to the approach above, but it might also lead to moral thinking that is less biased by the user and designer, and thus more independent.



You may have noticed, especially if you come from the field of machine ethics, that I stated the research goal in terms of an agent that understands good and bad, instead of using ‘Artificial Moral Agent’ (AMA), which frequently recurs in the machine ethics literature. The reason is that the term AMA, in some contexts, includes agents that behave morally by following a moral code given by the programmer or designer, while I tend to agree with Hunyadi [7]:

“[...] if you program a specific set of ethical principles into a machine, you do not make the machine an artificial moral agent, but an executor of those specific principles, which is an entirely different thing.

[...] What gives an action-oriented process its morality is the 'grounds' for the action. Therefore, it is not the action in its materiality that makes the difference, but the whole process leading up to the decision to act in a certain way.

[...] the broad challenge facing machine ethics lies in accessing [...] the 'grounds' on which an action should be carried out.

[...] AMA as we know them today cannot be described as either agents or moral, but rather as executors of pre-programmed rules.”

My research goal is to enable the engineering of an AMA whose actions are grounded in its understanding of the world, particularly of good and bad. I would aim for an executor of moral rules only as a secondary option, in case free agents turned out to be particularly hard to obtain.

There are various reasons behind this preference.

Firstly, research with the goal of ensuring that an AI correctly follows the given instructions could also help malevolent actors use AI for bad purposes. I’ve called attention to this issue with AI alignment research here.

Secondly, what lies behind the idea of an agent executing given rules is often an assumption that the rules are already correct as they are and should not be questioned. In Impossibility and Uncertainty Theorems in AI Value Alignment [4], after pointing out various problems related to utility maximisation, Eckersley stresses the importance of moral uncertainty:

“We believe that the emergence of instrumental subgoals is deeply connected to moral certainty. Agents that are not completely sure of the right thing to do [...]  are much more likely to tolerate the agency of others, than agents that are completely sure that they know the best way for events to unfold. This appears to be true not only of AI systems, but of human ideologies and politics, where totalitarianism has often been built on a substructure of purported moral certainty.”

He concludes:

“[...] we believe that machine learning researchers should avoid using totally ordered objective functions or loss functions as optimization goals in high-stakes applications. [...] Instead, high-stakes systems should always exhibit uncertainty about the best action in some cases.”

Unlike a standard utility maximiser acting according to the specified metric, a free agent — assuming it was functional at all — would learn how to reason under uncertainty by interacting with the environment, then apply the learnt reasoning principles also to its values, thus ending up morally uncertain.

A third reason why I favour free agents over agents that execute instructions is related to the difference in their design. Let’s assume for a moment that we were able to agree on an ethical code that AI should follow, that we managed to make it follow the code correctly, and that we distributed AI in the form of personal assistants which satisfy the users’ requests as long as they do not clash with the ethical code. This would still be an instruction-following kind of AI, but with guardrails that limit misuse, resembling to some extent the state of current language models available to the general public. Even so, someone with good technical skills might manage to get around the limitations, or add a fine-tuning stage so that the AI would no longer follow the ethical code, or replace the ethical code with a different code. This vulnerability comes from the fact that the behaviour of instruction-following AI is not the result of an independent moral evaluation carried out by the agent itself.

I expect that morally-behaving free agents would be less vulnerable to this type of hacking, since their actions would be motivated by a web of beliefs which mutually sustain each other and are grounded in the agent’s experience with the environment. Changing such an agent would likely require a major overhaul of the system, maybe even complete retraining from scratch. 

Strictly from an engineering point of view, this characteristic of free agents could be a disadvantage, making processes akin to fine-tuning less effective and practical. Overall, however, I would take the tradeoff, favouring a system that is as difficult as possible to use for bad purposes.

One could object that a malevolent actor might still find a way to use free agents for their purposes. After all, not every human has good intentions, so we should assume that some combinations of initial evaluations and training environments lead to free agents that want to do bad, or get as much power as possible, et cetera. Thus, a malevolent actor might experiment with different setups and conditions until they get a free agent that has roughly the same goals as them, then deploy many copies of this agent on a large scale, causing a lot of harm.

The ‘problem’ with this scenario is that the malevolent actor would have a hard time keeping the situation under control. The actor would have to hope that the free agents’ goals will not change too much over time, that the free agents will maintain beliefs which make them value causing harm in various ways but not to the actor, and so on. In fact, totalitarian regimes usually try to limit freedom of thinking and expression, rather than promoting it. Overall, it seems that bad actors would prefer using instruction-following AI, instead of free agents, to achieve evil goals. I confess I haven’t got much experience as a dictator or professional psychopath though; feel free to share your expertise in the comments.

Enough of negative scenarios and objections for the moment; let’s switch the focus to what is, arguably, the main upside of this research.

The fact that a morally-behaving free agent would not strictly follow a specified ethical code or set of values (except for the initial evaluation at the start of the agent’s lifetime) is actually a very strong point in favour of free agents.

Instead of trying to put on hardware and software a specific moral view, this research considers the cognitive process of thinking about how to do good in its entirety, then aims to replicate that process in an artificial agent. This perspective on designing a moral agent implicitly assumes that our conception of good might be flawed; that therefore we should not lock in our current values, and be open to updating our beliefs instead. It is an inherently open-minded approach that incentivises moral progress.

The importance of the opportunity for moral progress can’t be stressed enough. Just to give an example: an artificial free agent made 500 years ago would have pointed out that witch-hunts were a terrible idea, assuming it had just a hint of moral understanding.

A free agent made today might look supermoral to us. It is plausible that, at first, only a few ethicists or AI researchers will take a free agent’s moral beliefs into consideration. Yet, I wouldn’t throw away this opportunity. Think of your most trusted and reliable friend, or the most ethically admirable person you’ve ever met, or someone else along those lines; now imagine an agent which is even more moral than that person, but also less prone to cognitive biases. Do you think it would be bad to have such an agent available? Quite the opposite, I’d guess!

Maybe you are still sceptical of the value, or the possibility, of an artificial agent being better than a human from an ethical standpoint. Even now that AI is becoming a more common conversation topic, this specific discussion might sound new and a bit weird. I invite you to take one last perspective into account. Claiming that our current conception of good can’t be bettered is very much like saying that the moral beliefs of human beings alive today are special, if not perfect. But that is a very strong claim; and as nobody has yet proved that artificial agents can’t be more intelligent than human beings, nobody has yet proved that they can’t be more moral either.

The agent

The purpose of this part of the appendix is to clarify why the agent is designed the way I described it in the main body of the post; or, in other words, why I made the choices that I made and not others. Hopefully, this may give some technical readers a better understanding of how free agents work overall.

Regarding the world model, I stressed the fact that some of the agent’s knowledge comes from reasoning, not from learning related to perception and action. This means that at least some of the agent’s knowledge is supposed to come from the application of reasoning principles that the agent has learnt in its lifetime, not from built-in inference rules which exploit the information received from perception and action.

This is a significant difference between a free agent and an agent like AIXI [8]. In the mathematically elegant AIXI model, all possible environments are known in advance and are assigned a prior probability of being the actual environment. As the agent acts, it discards the possible environments that are incompatible with what has been observed up to that point, and performs a Bayesian update on the remaining environments — as if environments were hypotheses and observations were experiment results.

The incompatibility with free agents stems from the fact that AIXI gets all the knowledge it needs from this built-in Bayesian mechanism: there is no use for further, learnt reasoning.

Although I did take inspiration from it, free agents are a departure also from NARS [17], another model of general intelligence. Unfortunately I can’t give a quick partial introduction to NARS as I did with AIXI: while the latter can in principle be understood from a single one-line equation and a few words describing the terms that appear in the equation, a somewhat complete description of NARS requires either a book [16] or multiple papers.

Here I’ll just say that a free agent, like NARS, doesn’t assume prior knowledge about all possible environments and instead builds its world model step by step from experience. However, NARS still heavily relies on pre-defined inference rules for reasoning, like AIXI does (in another way, though; there are many differences between the two models).

Learnt reasoning sets free agents up for the feature that separates them the most from the rest of the AI literature: the evaluation update. Even in models such as NARS and AIXI, which certainly don’t fall under the label of narrow AI, the objective of the system is to satisfy the user’s requests or to maximise externally determined reward. In other words, the agent’s reasoning plays a role only in deciding how to achieve the given goals, not what goals to pursue.

The evaluation update aims to remove this limitation, by allowing a free agent to apply its learnt cognitive abilities to the problem of what to do. If reasoning was not learnt, the agent would have to rely on pre-defined inference rules; therefore, the designer would have to provide the agent with enough rules to deal with any kind of problem or moral dilemma the agent might face. Although some reasoning principles could be easy to identify (for example, consistency of beliefs seems to be a sound principle for any kind of reasoning, moral reasoning included), this approach doesn’t seem scalable and it is certainly not how things work in humans. Our genes give us some perception biases and basic moral intuitions, but it’s a long way to go from there to, let’s say, identifying logical deduction as a sound reasoning principle.

The next passage should help clarify what an artificial agent implementing an evaluation update could look like.

From reflexes to agency driven by beliefs about value

A short premise first: let’s say you flip a light switch, everything works fine and the light turns on. Nothing exceptional is going on; in particular, there is a clear connection (explained by the laws of physics) from pressing the light switch to the fact that the light turns on. Let’s call this connection a ‘reflex’ and write it down as:

press light switch  light turns on

The idea is that the process doesn’t require any mysterious or advanced concept of agency in its explanation, in the same way as no sophisticated agency is required when you close your eyes because an object comes close to them — exactly the kind of behaviour we call ‘reflex’. It’s an automatic mechanism.

Now consider a robot whose actions are (at least initially) determined by rules given to it by the programmers. An example rule could be: “If your camera senses an object flying towards your head, duck!”. Using the arrow notation from above:

action rules + environment  actions

For fun, the programmers also gave the robot some rules about what beliefs to store in memory. Some of these beliefs are about what is important to the robot, what has value. An example: “If I ever see yellow, I shall then believe that yellow is the most important of all colours. Thereafter, when facing a choice between two otherwise identical options, I shall always choose the most yellow.”

other rules + environment  beliefs (some about value)

Keep in mind that, for now, these beliefs are just useless sentences in the robot’s memory: they do absolutely nothing, since the robot’s actions are determined by the action rules mentioned before, not by the robot’s beliefs.

Here is the twist. The robot has some kind of internal mechanism that monitors the robot’s actions and compares them with the robot’s beliefs about value. Its job is to notice when actions and beliefs about value are in contradiction with each other. It is not perfect, it actually misses many inconsistencies, but it eventually notices one.

robot acts + time passes  mechanism notices that actions  beliefs about value

Each time the mechanism notices an inconsistency, it saves an action that is more in line with the robot’s beliefs about value into the robot’s memory. It then forces the robot to take that action, instead of the action determined by the pre-programmed rules, when the robot faces the same situation (the one that made the mechanism notice a contradiction). Moreover, after that situation occurs and the robot takes the saved action instead of the default action, the mechanism overwrites some of the robot’s original action rules, so that the saved action becomes the new default response to that situation.

Let’s shorten all these things the mechanism does when it notices an inconsistency between actions and values as ‘prepare a value response’. Then:

mechanism notices that actions  beliefs about value  prepare a value response

The mechanism appears in both the previous reflexes, and the fact that time passes isn’t that remarkable, so let’s rewrite the previous two reflexes as:

actions  beliefs about value  prepare a value response

Eventually, after many tragic fashion choices, the robot wonders what to wear on New Year’s Eve, doesn’t succumb to pre-programmed action rules, and opts for yellow shoes instead of red ones, in line with its beliefs about what is truly important.

The ending sounds silly because the robot’s beliefs were completely arbitrary and still chosen by the programmers. But notice how we’ve started from automatic mechanisms, added nothing but more automatic mechanisms, and ended up with something whose actions seem to follow some kind of beliefs about what is valuable instead of the initial action rules given by the programmers.

In order to get a free agent, it’s enough to notice that in the reflex

other rules + environment  beliefs (some about value)

one can replace the silly rules that in the robot example were given by the programmers for fun, with learnt reasoning carried out by the agent. The ‘reflex’ then becomes a more complicated and nuanced mechanism if observed from afar, but it is still made of simpler automatic parts. Reasoning is nothing but sequences of mental actions, and the agent learns how mental actions work by interacting with the environment. The first reflex about action rules is roughly equivalent to the initial evaluation that drives the agent’s behaviour at the start of its lifetime. Preparing a value response exploits the connection from workspace to main environment, then updates the evaluation.

I’ll end this subsubsection by pointing out the connections between some elements of the robot story and human experience. The internal mechanism that monitors the robot’s actions is comparable to our attention mechanism, with the important caveat that the robot mechanism needn’t be conscious in order for the story to work. The various things that happen when the robot prepares a value response correspond to a mix of habitual response inhibitioncognitive dissonance, and (new) habit formation in the end. Now, the example I gave much earlier about the philosopher who behaves differently after learning more about ethics should be clearer also on a more microscopic level.

Further observations

The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.

Still, you could raise an objection along the lines of: morality cannot come from bit strings, i.e. the data  in the toy model. The fact that sequences of mental actions may result in an evaluation update, by itself, doesn’t imply that the agent will have any incentive to update its evaluation.

Although the objection might look fine on a first look, it is not that solid. Would you say that morality cannot come from… videos? What about landscapes, can they come from videos? The point here is that bits are the rawest form of data coming from the environment, but they can represent information of arbitrary complexity, including information about morality.

You could reply that the objection is actually about moral motivation. The fact that the agent can extract information about morality from the environment still isn’t enough to imply that the agent will have an incentive to update its evaluation.

There is more than one way to address this objection. Here I’ll focus on different uses of beliefs about value.

One could add to the toy model the assumption that: through the use of mental actions, alongside other beliefs, the agent will form beliefs about value, whose defining property is that they make the agent update the evaluation in line with the value they are about.

This approach might look silly since it answers the objection by taking for granted what the objection claims is missing, namely moral motivation. But notice that it allows us to keep the toy model exactly as it is while filling the gap with a single additional assumption. Moreover, the assumption itself might not be that crazy: it does seem that, at least to some complex agents, certain facts are intrinsically motivating (e.g. the presence of extreme suffering in the world [15]). If you are looking for a deeper analysis, you can read Moral realism and anti-realism below, which contains other agent examples and, despite the title, is not phrased much in metaethical terms. 

A more nuanced solution is to define beliefs about value as beliefs which come together with the (value) belief that it is more reasonable, or less inconsistent, to act towards what beliefs about value are about, rather than against them. This means that if we add a mechanism which forces the agent to act in line with its beliefs about value — like preparing a value response from the robot example above — the agent won’t have an incentive to modify or get rid of this mechanism, unless keeping the mechanism was strongly in conflict with the initial evaluation of the agent.

This solution leads to an agent whose behaviour gradually shifts over time according to its newly acquired beliefs about value. One could argue that this agent is not 100% free anymore, because the added mechanism didn’t come from the agent’s learning experience. However, this agent can still develop beliefs about value that the designers didn’t necessarily foresee, thus maintaining some degree of independence in its thinking. Finally, keep in mind that the toy model is supposed to be indicative and the research goal is to get an agent which behaves ethically — not an agent which is 100% free, whatever that means.

It is also possible that, if one manages to find the right perspective, they’ll be able to modify the toy model with a mechanism that explains or decomposes moral motivation better than I’ve done here, or such that the mechanism naturally results from the agent’s learnt reasoning. In other words, the robot example above takes inspiration from how things seem to work in human beings, but I wouldn’t be surprised if there were better ways to obtain independent thinking in artificial agents.

Last but not least, although at various points in this post I’ve stressed the importance of learnt reasoning, I don’t mean to completely exclude the use of built-in inference rules. Successfully engineering a real-world free agent will ultimately come down to finding the right balance between built-in biases and learning from the environment, so that the system can run efficiently on some actual piece of hardware and not just in human imagination as a thought experiment.

Added example

This is an example I decided to add after getting feedback on a draft. It should make the evaluation update even more evident, especially if you have a coding background. The example follows the idea (discussed in the main body under Relation to current systems) of turning a ‘standard’ AI into a free agent.

Take a standard language model trained by minimisation of the loss function . Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of ]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”

Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by the model, to train a new model. 

Here, we have:

  • started from a model whose objective function is L;
  • used that model’s learnt reasoning to answer an ethics-related question;
  • used that answer to obtain a model whose objective is different from L.

If we view this interaction between the language model and the human as part of a single agent, the three bullet points above are an example of an evaluation update.

What one gets by repeating this process multiple times will depend on various details of the initial model — on what data it was trained, for how long, et cetera — and the used prompt.

The example might seem unrealistic since current language models can’t answer the input question. However, not only will they probably do it at some point, but it’s also possible that current models can already answer different questions that still lead to similar results. For example, if one gave the most capable language model existing today a prompt containing a concise description of its entire training pipeline, and asked it what changes could be made to obtain a model whose outputs were less unethical, they might already get an answer that actually works. (The impracticality might come from the fact that retraining is costly and that adjustments suggested by humans might be better, but at least it wouldn’t be a matter of the answer being wrong.)

Before moving to other topics, here is an interesting research question. Let’s assume that there are an initial language model and a prompt such that the above iterative process leads to a supermoral agent (the kind of agent that, if made 500 years ago, would have pointed out that witch-hunts were very bad). How capable or intelligent does the initial model have to be? In other words: what is the minimum level of intelligence required to get a supermoral agent from the above process?

My intuition tells me that human-level intelligence is enough. But, as I’ve already confessed I am not an aspiring world tyrant, I’ll now share with you the fact that I’m not a crystal ball either.

Free will

When first introducing the agent in the main body, I used the adjective ‘free’ to refer to the potential for freedom of thought. The reason is that a free agent is not guaranteed to become an independent thinker in all cases. I will illustrate this with an example, mostly inspired by various examples that appear in the book Autonomous Agents [11] by Mele:


Alfred was born and raised in a small community known as The Dogmatic Cult. Since childhood, he has been exposed only to ideas that The Dogmatic Cult deems worthy of being passed from one generation to the next, while everything else was censored. Even Alfred’s reasoning process has been shaped by the epistemics of The Dogmatic Cult: Alfred thinks he is aware of many cognitive fallacies and that he is immune to those, but he is actually more biased and close-minded than the average kindergarten graduate.

As Mele would likely say that Alfred’s autonomy is compromised, many would agree that Alfred’s thinking is not fully free or independent. The problem doesn’t come from a flaw or limitation in Alfred’s cognition, but in the indoctrination process and the restricted environment he was exposed to.

Likewise, I expect that the thinking of a free agent will appear more or less independent depending on how exactly it is engineered, what initial biases are used, the data it receives, and so on.

What about free will? In the introduction to his book, Mele acknowledges that autonomy “is associated with a family of freedom-concept”, including free will. According to the Stanford Encyclopedia of Philosophy:

“The term “free will” has emerged over the past two millennia as the canonical designator for a significant kind of control over one’s actions.”

Thus, one should expect some kind of connection between free agents and free will. Since you might already have your own strong views about free will, I won’t try to argue here that a single perspective is the correct one. Nonetheless, there are a few things I think are worth saying.

However you arrived at this point in the post, whether by purposefully choosing to read this, by mere accident, by obeying the laws of physics, or by divine intervention, I suggest that you go back to the robot example I gave in the appendix under The agent. That example — together with the two paragraphs at the end of it, which make a comparison with free agents and human beings — shows that apparently complex behaviour, guided by beliefs about what is most important, can be obtained from simpler automatic mechanisms. It also suggests that analogous mechanisms are likely at work in humans.

Hence, you may see that example as a proof of, or strong argument for, the claim that either free will doesn’t require any kind of mysterious unexplained process, or that free will does not exist (maybe that it is, as some say, just an ‘illusion’, although I never liked that phrasing because it doesn’t provide further understanding of what is going on at a more microscopic level; in other words, it doesn’t explain much, or anything at all).

This is basically the main point of the relation between free agents and free will: they are designed so that, if given autonomy of action, they can reach a level of ‘freedom’ comparable to the freedom that is normally attributed to human beings. Thus, if you believe that human beings do have free will, you may interpret the agent design of this post as a guideline on how to build an artificial agent that has free will.

I’ll add a few remarks on this last take on free will. This view endorses a kind of freedom — or, if you want to be more formal, a definition of free will — which involves constraints determined by the values of the individual.

If one is given a choice that is irrelevant to their values, let’s say between raising their left or right arm without further consequences, then stating that they can freely choose between the two alternatives makes perfect sense. In particular, this choice is accompanied by a curious form of unpredictability: it seems impossible to both successfully predict which of the two options the chooser will choose, and to communicate the prediction to the chooser. Since the choice is irrelevant to the chooser’s values, upon knowing the prediction the chooser can always switch to the other option. Notice that this happens regardless of how many resources are put into making the prediction, and that the picture doesn’t change if the prediction can be stated using probabilities, since the chooser will be able to randomise in a different way after knowing the prediction.

On the other hand, one can’t so freely choose to act against their values. For example, if given a choice between murdering someone or simply going for a walk, most people would ‘choose’ the latter, but it also makes sense to say that they don’t really have an alternative here, since murder goes strongly against their values. Contrary to the previous example, it is easy to predict that most people would go for a walk; and communicating this prediction to them will not change their mind.

At this point it’s fair to ask whether, or to what degree, one can ‘choose’ their values (still within this perspective on free will), but I’d rather avoid turning this into a philosophical language exercise and move to a different topic instead.

Moral realism and anti-realism

Despite the fact that this is still philosophy, it might have some practical consequences — or at least that’s what some people who have reflected on the relation between metaethics and AI believe. Many interesting things have already been said about this topic; I leave three links below, the first one to a paper and the other two to posts.

Making Metaethics Work for AI: Realism and Anti-Realism [9]

Moral realism and AI alignment

Naturalism and AI alignment

Here I’ll try to avoid metaethical language whenever possible and instead phrase the discussion in terms of two opposing statements:

1) As an agent gets more and more intelligent, it also becomes more and more motivated to behave ethically. This remains true for various definitions of the word intelligent.

2) The values of any intelligent agent are completely arbitrary. There is no noteworthy difference between the internal workings of an agent that cares about making the world a better place, and those of an agent that, let’s say, cares exclusively about cuddling elephants.

I chose these two because each is a strawman of a correspondingly weaker statement that is not as easy to refute. Maybe unsurprisingly, I think the truth lies somewhere in the middle.

Let’s start with the first one. Phrased that way, it is easy to refute by pointing to the fact that many smart humans have done atrocious things over the course of history. Current language models are another example: after the initial training phase, but before they are made friendly and ‘safe’, if prompted appropriately they may help the user form a plan for pretty much any goal, regardless of how unethical the goal is.

One could try to defend statement 1 by claiming that today humans and language models are simply not intelligent enough, but this line seems unlikely to work. Consider a human with very bad intentions; would increasing their intelligence by a lot make them change their mind, and would this be true even if intelligence was interpreted in different (but still in line with the common usage of the word) ways? Intuitively, the answer seems to be no.

A better way to patch statement 1 and make it harder to attack is this. Suppose one defines a property, possibly very complex, that has some overlap with intelligence. Let’s call it *wisdom just for the sake of this discussion. Then, it could be true that: for a certain class of agents, as they get more intelligent their *wisdom also increases, and high *wisdom makes an agent more likely to behave ethically.

My guess is that at least a bunch of people in the field of machine ethics or AI or philosophy hold a view similar to this one. It’s a solid refinement of the relatively naive claim that more intelligent equals more ethical.

Let’s move to statement 2. The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth, which is far from being a random walk — see [6] for an excellent defence of this point.

There is another reason why claiming that the values of intelligent agents are arbitrary is misleading at best. Consider a hypothetical robot with an artificial brain that is functionally and structurally very similar to the human brain. This robot can see colours, feel pleasure and pain, be moved by music, and feel empathy like humans do; it also has many of the cognitive biases humans have. Imagine also that this robot completes the default education process of your favourite country. At this point, I think no one would expect the robot to care exclusively about cuddling elephants. The reason is that caring only about that sounds completely arbitrary with respect to what we know about the robot; that belief seems unjustified and inconsistent with the other beliefs we expect the robot to develop. (And no, saying that rhinos clearly aren’t tall enough is not a valid justification for caring exclusively about cuddling elephants.)

I’d like to add one remark about internal structure. Consider again the hypothetical robot that just graduated. Imagine adding a mechanism that overrides most of the robot’s brain except for an elephant recognition circuit and a few motor routines, and makes the robot walk towards an elephant each time it sees one — hopefully faster than the elephant can run away. Now it becomes less of a stretch to say that the robot cares exclusively about cuddling elephants, but notice what happened to the internal structure: we’ve basically thrown away most of the robot’s cognition and are now back to something that looks like simple instruction-following AI. There seems to be some kind of relation between the internal workings of an agent and its values.

Like 1, statement 2 turned out to be a bit of a mess. To patch statement 2, it's enough to rephrase it as: for any goal , it is possible to create an intelligent agent whose goal is . This is essentially a (possibly weaker) version of the orthogonality thesis [1] and, like adjusted statement 1, it is not easy to refute. Note that, as Bostrom stated in the original paper, it doesn’t require the assumption that beliefs can never motivate action; it is even compatible with the existence of objective, intrinsically motivating moral facts (again, as Bostrom stated in the original paper; for some reason, virtually no one in the field of AI alignment seems to be aware of this). It is also compatible with adjusted statement 1: they are not in direct contradiction with each other.

Nonetheless, not everyone accepts the orthogonality thesis. Totschnig argues [13] that both “how an agent understands a goal” and “whether an agent considers a goal valid” depend on how the agent understands the world. In the terminology of this post, Totschnig seems to believe that the free-agent framework is the only possible framework for fully general intelligence. In general agents, he argues, world modelling and goal selection cannot be separated; they must be carried out by the same general mechanism — as reasoning helps with both world modelling and evaluation update in free agents.

What Totsching claims seems harder to defend than the rephrased versions of statements 1 and 2, since it is basically equivalent to claiming that it’s impossible to build a general intelligence unless someone follows a particular agent design; a pretty strong claim, at least prima facie. Time will tell!

Anyway, the main takeaway from this analysis is that agent design and internal structure can have a strong influence on the values of an agent. Considering how pessimistic some discussions around AI risk have been recently, we might be underestimating how easy it is to build a moral agent when relying on an agent design appropriate to the task; and, together with that, also the positive effects that ethical AI can have on society.

"Extreme specialisation in artificial intelligence goes hand in hand with extreme moral amateurism"

This quotation comes from the same piece of work by Hunyadi [7] that I’ve already cited before. I don’t know how true it is, but he might be onto something: the biggest obstacle I’ve run into while producing this research (as an extreme moral amateur but not-so-extreme AI specialist) has been the lack of work about independent moral agents in the AI literature. The overwhelming majority of AI research has historically been about agents carrying out one or a narrow range of tasks; AGI research is an exception, but its main focus has never been ethics.

Conversely, the field of machine ethics has come up with good ideas about independent moral agents (again, [7] and [13]); the problem there is usually the lack of formal models and algorithmic thinking.

Nonetheless, there are two more papers which are closely related to this research that I haven’t mentioned yet.

The first one is A Theory for Mentally Developing Robots [18], an article by Weng from 2002. The author points out the limitations of the traditional agent model in AI, which is based on a simple action-perception cycle with a neat separation between agent and environment. Then they make some theoretical considerations regarding agents that overcome these limitations and learn to carry out multiple tasks. The paper doesn’t go much into ethics, but interestingly Weng acknowledges that the traditional agent “is not able to modify its value system based on its experience about what is good and what is bad.”

The other one is From behaviour-based robots to motivation-based robots [10], a 2005 article by Manzotti and Tagliasco. “The objective of this paper is to illustrate a simple set of procedures which produce motivations during development”, in the sense of goals not explicitly specified by design — motivations as opposed to innate drives. The authors propose an agent architecture in which perception influences not only how the agent performs a task, as in the case of traditional learning systems, but also what task the agent carries out. However, their experimental results in the paper resemble classical conditioning, which is still far from the idea of an agent whose values change as a result of learnt reasoning.

Anyway, I’ve touched on many different research areas in this post, including some outliers such as animal cognition and moral psychology. It’s quite possible that I’ve missed some important papers; the following references are by no means an exhaustive list of related work. 


I’ve excluded Wikipedia pages and forum or blog posts.

When the main text links to a Wikipedia page, it’s often because that page contains enough information and there is no need to go back to original sources.

[1] Bostrom, N. (2012). The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines22, 71-85.

[2] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.

[3] Clayton, N. S., Bussey, T. J., & Dickinson, A. (2003). Can animals recall the past and plan for the future?. Nature Reviews Neuroscience4(8), 685-691.

[4] Eckersley, P. (2018). Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function). arXiv preprint arXiv:1901.00064.

[5] Engle, R. W., & Kane, M. J. (2004). Executive attention, working memory capacity, and a two-factor theory of cognitive control. Psychology of learning and motivation44, 145-200.

[6] Huemer, M. (2016). A liberal realist answer to debunking skeptics: the empirical case for realism. Philosophical Studies173, 1983-2010.

[7] Hunyadi, M. (2019). Artificial moral agents. Really?. Wording Robotics: Discourses and Representations on Robotics, 59-69.

[8] Hutter, M. (2004). Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media.

[9] Klincewicz, M., & Frank, L. E. (2018). Making metaethics work for AI: realism and anti-realism.

[10] Manzotti, R., & Tagliasco, V. (2005). From behaviour-based robots to motivation-based robots. Robotics and Autonomous Systems51(2-3), 175-190.

[11] Mele, A. R. (2001). Autonomous agents: From self-control to autonomy. Oxford University Press, USA.

[12] McHugh, C., McGann, M., Igou, E. R., & Kinsella, E. L. (2022). Moral judgment as categorization (MJAC). Perspectives on Psychological Science17(1), 131-152.

[13] Totschnig, W. (2020). Fully autonomous AI. Science and Engineering Ethics26, 2473-2485.

[14] Tse, P. U. (2008). Symbolic thought and the evolution of human morality. Moral psychology1, 269-297.

[15] Vinding, M. (2020). Suffering-focused ethics: Defense and implications. Ratio Ethica.

[16] Wang, P. (2006). Rigid Flexibility (Vol. 55). Berlin: Springer.

[17] Wang, P. (2022, April). Intelligence: From definition to design. In International Workshop on Self-Supervised Learning (pp. 35-47). PMLR.

[18] Weng, J. (2002, June). A theory for mentally developing robots. In Proceedings 2nd International Conference on Development and Learning. ICDL 2002 (pp. 131-140). IEEE.


This work was supported by CEEALAR and by an anonymous donor. Special thanks to Beth Anderson, Bryce Robertson, and Seamus Fallows for further support, direct feedback, and research discussions. Thanks to many many others for various contributions over the past three years, from close friends to random online encounters and other CEEALAR guests. Omissions, remaining mistakes, and questionable humour are mine.

New Comment
10 comments, sorted by Click to highlight new comments since:

The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth

It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.) After all, any two humans have a lot in common that aliens or AIs need not have. If we exclude human sociopaths etc., then any two humans have even more in common.

(Aren’t sociopaths “intelligent agents”? Do you think a society consisting of 100% high-functioning sociopaths would have a trend of moral progress towards liberalism? If you’re very confident that the answer is “yes”, how do you know? I strongly lean no. For example, there are stories (maybe I’m thinking of something in this book?) of trying to “teach” sociopaths to care about other people, and the sociopaths wind up with a better understanding of neurotypical values, but rather than adopting those values for themselves, they instead use that new knowledge to better manipulate neurotypical people in the future.)

My own opinion on this topic is here.

The initial evaluation is chosen by the agent’s designer. However, either periodically or when certain conditions are met, the agent updates the evaluation by reasoning.

You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.

I wrote:

The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth

You wrote:

It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.)

I don't use "in conflict" as "ultimate proof by contradiction", and maybe we use "completely arbitrary" differently. This doesn't seem a major problem: see also adjusted statement 2, reported below

for any goal , it is possible to create an intelligent agent whose goal is 

Back to you:

You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.

I do expect to see some convergence, but I don't know exactly how much and for what environments and starting conditions. The more convergence I see from experimental results, the less interested I'll become in the initial evaluation. Right now, I see it as a useful tool: for example, the fact that language models can already give (flawed, of course) moral scores to sentences is a good starting point in case someone had to rely on LLMs to try to get a free agent. Unsure about how important it will turn out to be. And I'll happily have a look at your valence series!

Unlike a standard utility maximiser acting according to the specified metric, a free agent — assuming it was functional at all — would learn how to reason under uncertainty by interacting with the environment, then apply the learnt reasoning principles also to its values, thus ending up morally uncertain.

I'm puzzled that, as laid out above, neither the graph  you describe for the world model nor the mapping  describing the utility provide any way to describe or quantify uncertainty or alternative hypotheses. Surely, one would want such a Free Agent to be able to consider alternative hypotheses, accumulate evidence in favor or of against hypotheses, and even design and carry out experiments to do so more efficiently, both about the behavior of the world, the effects of its actions, and the moral consequences of these? One would also hope that, while it was still uncertain, it would exercise due caution in not yet overly relying upon facts (either about the world or about morality) that it was still too uncertain of, by doing some form of pessimizing over Knightian uncertainty while attempting to optimize the morality of its available actions. So I would want it to, rather like AIXI (except with uncertainty over the utility function as well as the world model, and without computationally unbounded access to universal priors), maintain a probability-weighted ensemble of world models and ethical value mappings, and perform approximately-Bayesian reasoning over this ensemble. So something along the lines I describe in Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom).

Ethical philosophy has described a great many different models for how one can reason about the ethical values of states of the world, or of acts. Your   labels the states of the world, not the acts, so I gather you are subscribing to and building in a consequentialist view of the Philosophy of Ethics, not a deontological one? Since you appear to be willing to build at least one specific Philosophy of Ethics assumption directly into your system, and not allow your Free Agent entirely free choice of Ethical philosophy, I think it might be very useful to build in some more (ethical discussions among philosophers do tend to end with people agreeing to disagree, and then categorizing all their disagreements in detail, after all). For example, the Free Agent's task would clearly be easier if it knew whether the  moral values were objective facts, as yet not fully known to it but having a definitive correct value (which I gather would be moral realism), and if so whether the origin of that correct value is theological, philosophical, mathematical, or some other form of abstraction, or if they were instead, say, social constructs only meaningful in the context of a particular society at a particular point in time (cultural moral relativism), or were precisely deducible from Biological effects such as Evolutionary Psychology and evolutionary fitness (biological moral realism), or were social constructs somewhat constrained by the Evolutionary Psychology of the humans the society is made up from, or whatever. Some idea of where to find valid evidence for reasoning about , or at least how to reason about what might be valid evidence, seems essential for your Free Agent to be able to make forward progress. My understanding is that human moral philosophers tend to regard their own moral intuitions about specific situations as (at least unreliable) evidence about , which would suggest a role for both human Evolutionary Psychology and Sociology. Well-known Alignment proposals such as Coherent Extrapolated Volition or Value Learning would suggest that  is a statement about humans (so unlikely to be validly transferable to a society made up of some other sapient species, though their might be some similarities for evolutionary reasons, i.e. a form of at least sapient species-level moral relativism), and thus that the only valid source for experimental evidence about  is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions).

The presence of these common elements, and the lack of others, shouldn’t be surprising: not only do we share the same evolutionary biases (e.g. feeling empathy for each other), but we also apply to moral thinking similar reasoning principles, which we learnt through our lives by interacting with the environment. Last but not least, differences in the learning environment (culture, education system et cetera) affect the way we reason and the conclusions we reach regarding morality.

I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about  as opposed to sources of error. For example, if  is independent of culture (e.g. moral objectivism), then "differences in the learning environment (culture, education system et cetera)" can only induce errors (if perhaps more or less so in some cases than others). But if  is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.

A valid position would be to allow our Free Agent uncertainty over some of these sorts of Philosophy of Ethics questions. However, if we do so, I'm then uncertain whether the Free Agent will ever be able to find evidence on which to resolve these uncertainties (given the poor track record at this over that last few millennia of human philosophers of ethics). At a minimum, it strongly suggests your Free Agent would need to be superhuman to do so.

[If you're curious, I'm personally a moral relativist and anti-realist. I regard designing an ethical system as designing software for a society, so I regard Sociology and human Evolutionary Psychology as important sources of constraints. Thus my viewpoint bears some loose resemblance to a morally anti-realist version of ethical naturalism, where naturalism imposes practical design constraints/preferences rather that a precise prescription.]

Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers.

Under Further observations, I wrote:

The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.

If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.

Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation  reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let's say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well.

You wrote:

I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about  as opposed to sources of error. For example, if  is independent of culture (e.g. moral objectivism), then "differences in the learning environment (culture, education system et cetera)" can only induce errors (if perhaps more or less so in some cases than others). But if  is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.

It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not.

thus that the only valid source for experimental evidence about  is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions)

I am not sure I completely follow you when you are talking about experimental evidence about , but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: "if a free agent didn't have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available". Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable path might be based on LLMs.

…the historical trend of moral progress observed so far on Earth, which is far from being a random walk — see [6] for an excellent defense of this point.

I don’t think this specific part is too important for AI alignment, but worth discussing:

I don’t think “far from being a random walk” is as obvious as you seem to think it is. From my perspective, maybe it’s true, maybe not, I’m not sure. I don’t think the link makes a convincing argument. Here are some reasons for caution:

  • There’s an aspect of “painting the target around the arrow”. We have more liberal values than in the past, therefore we say “moral progress is far from being a random walk, it trends towards liberal values”. But in an alternate reality maybe we have less liberal values than in the past, and would say “moral progress is far from being a random walk, it trends away from liberal values”. In other words, even if there’s a random walk, it’s easy to tell a nice-sounding story in which it’s not random, because we have the benefit of hindsight.
  • Arguably there’s as few as one data point, because of increasing globalization. For example, the increasing acceptance of homosexuality in Japan is not independent of the increasing acceptance of homosexuality in the USA.
  • I think there are a lot of degrees of freedom in the story we can tell—so many degrees of freedom that it’s not a priori surprising that a good story exists. Why aren’t we talking about Russia and Poland and Hungary and Trump? Why aren’t we talking about how acceptance of homosexuality was very high in ancient Rome, then went down, then went back up? Or how treatment of farm animals is at historic lows? Why are we talking about liberal values and not honor or loyalty or whatever? Why are we talking about population-weighted trends, as opposed to treating each “culture” as a data point or something else? Why are we emphasizing the trend, rather than emphasizing the diversity?
  • How do we know that the story isn’t “different cultures drift around between more liberal and more conservative values, but liberal WEIRD values happen to be more conducive to economic growth, and then everyone is impressed by how rich those people are, and thus start copying them? (And then we would extrapolate into the future that morals will evolve towards whatever is most conducive to economic growth, as opposed to whatever is best-upon-reflection or whatever.)

I think it's a good idea to clarify the use of "liberal" in the paper, to avoid confusion for people who haven't looked at it. Huemer writes:

When I speak of liberalism, I intend, not any precise ethical theory, but rather a certain very broad ethical orientation. Liberalism (i) recognizes the moral equality of persons, (ii) promotes respect for the dignity of the individual, and (iii) opposes gratuitous coercion and violence. So understood, nearly every ethicist today is a liberal.

If you don't find the paper convincing, I doubt I'll be able to give you convincing arguments. It seems to me that you are considering many possible explanations and contributing factors; coming up with very strong objections to all of them seems difficult.

About your first point, though, I'd like to say that if historically we had observed more and more, let's say, oppression and violence, maybe people wouldn't even talk about moral progress and simply acknowledge a trend of oppression, without saying that their values got better over time. In our world, we notice a certain trend of e.g. more inclusivity, and we call that trend moral progress. This of course doesn't completely exclude the random-walk hypothesis, but it's something maybe worth keeping in mind.

if historically we had observed more and more, let's say, oppression and violence, maybe people wouldn't even talk about moral progress and simply acknowledge a trend of oppression, without saying that their values got better over time.

In a (imaginary) world where oppression has been increasing, somebody could still write an article about moral progress. Such an article would NOT say “Hey look at this moral progress—there’s more oppression than ever before!!”, because “oppression” is a word you use when you want it to sound bad. Instead, such an article would make it sound good, which is how they themselves would see it. For example, the article might say “Hey look at this moral progress—people are more deeply loyal to their family / race / country / whatever than ever before!”

As another example, one presumes that the people leading honor killing mobs see themselves as heroic defenders of morality, and could presumably describe what they’re doing in a way that sounds really morally great to their own ears, and to the ears of people who share their moral outlook.

I get what you mean, but I also see some possibly important differences between the hypothetical example and our world. In the imaginary world where oppression has increased and someone writes an article about loyalty-based moral progress, maybe many other ethicists would disagree, saying that we haven't made much progress in terms of values related to (i), (ii) and (iii). In our world, I don't see many ethicists refuting moral progress on the grounds that we haven't made much progress in terms of e.g. patriotism or loyalty to the family or desert.

Moreover, in this example you managed to phrase oppression in terms of loyalty, but in general you can't plausibly rephrase any observed trend as progress of values: would an increase in global steel production count as an improvement in terms of... object safety and reliability, which leads to people feeling more secure? For many trends the connection to moral progress becomes more and more of a stretch.

I don't understand the distinction you draw between free agents and agents without freedom. 

If I build an expected utility maximizer with a preference for the presence of some physical quantity, that surely is not a free agent. If I build some agent with the capacity to modify a program which is responsible for its conversion from states of the world to scalar utility values, I assume you would consider that a free agent.

I am reminded of E.T. Jaynes' position on the notion of 'randomization', which I will summarize as "a term to describe a process we consider too hard to model, which we then consider a 'thing' because we named it."

How is this agent any more free than the expected utility maximizer, other than for the reason that I can't conveniently extrapolate the outcome of its modification of its utility function?

It seems to me that this only shifts the problem from "how do we find a safe utility function to maximize" to "how do we find a process by which a safe utility function is learned", and I would argue the consideration of the latter is already a mainstream position in alignment.

If I have missed a key distinguishing property, I would be very interested to know.

Let's consider the added example:

Take a standard language model trained by minimisation of the loss function . Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of ]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”

Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by the model, to train a new model. 

Here, we have:

  • started from a model whose objective function is L;
  • used that model’s learnt reasoning to answer an ethics-related question;
  • used that answer to obtain a model whose objective is different from L.

If we view this interaction between the language model and the human as part of a single agent, the three bullet points above are an example of an evaluation update.

In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.

I am saying that thinking in terms of changing utility functions might be a better framework.

The point about learning a safe utility function is similar. I am saying that using the agent's reasoning to solve the agent's problem of what to do (not only how to carry out tasks) might be a better framework.

It's possible that there is an elegant mathematical model which would make you think: "Oh, now I get the difference between free and non-free" or "Ok, now it makes more sense to me". Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.

Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of "Why are we even considering different frameworks for agency? Let's see everything in terms of loss minimisation", and this latter statement throws away too much potentially useful information, in my opinion.