Michele Campolo

Lifelong recursive self-improver, on his way to exploding really intelligently :D

More seriously: my posts are mostly about AI alignment, with an eye towards moral progress and creating a better future. If there was a public machine ethics forum, I would write there as well.

An idea:

  • We have a notion of what good is and how to do good
  • We could be wrong about it
  • It would be nice if we could use technology not only to do good, but also to also improve our understanding of what good is.

The idea above, and the fact that I’d like to avoid producing technology that can be used for bad purposes, is what motivates my research. Feel free to reach out if you relate!

At the moment I am doing research at CEEALAR on agents whose behaviour is driven by a reflective process analogous to human moral reasoning, rather than by a metric specified by the designer. See Free agents.

Here are other suggested readings from what I've written so far:

-Naturalism and AI alignment
-From language to ethics by automated reasoning
-Criticism of the main framework in AI alignment


Ongoing project on moral AI

Wiki Contributions


This was a great read, thanks for writing!

Despite the unpopularity of my research on this forum, I think it's worth saying that I am also working towards Vision 2, with the caveat that autonomy in the real world (e.g. with a robotic body) or on the internet is not necessary: one could aim for an independent-thinker AI that can do what it thinks is best only by communicating via a chat interface. Depending on what this independent thinker says, different outcomes are possible, including the outcome in which most humans simply don't care about what this independent thinker advocates for, at least initially. This would be an instance of vision 2 with a slow and somewhat human-controlled, instead of rapid, pace of change.

Moreover, I don't know what views they have about autonomy as depicted in Vision 2, but it seems to me that also Shard Theory and some research bits by Beren Millidge are to some extent adjacent to the idea of AI which develops its own concept of something being best (and then acts towards it); or, at least, AI which is more human-like in its thinking. Please correct me if I'm wrong.

I hope you'll manage to make progress on brain-like AGI safety! It seems that various research agendas are heading towards the same kind of AI, just from different angles.

I get what you mean, but I also see some possibly important differences between the hypothetical example and our world. In the imaginary world where oppression has increased and someone writes an article about loyalty-based moral progress, maybe many other ethicists would disagree, saying that we haven't made much progress in terms of values related to (i), (ii) and (iii). In our world, I don't see many ethicists refuting moral progress on the grounds that we haven't made much progress in terms of e.g. patriotism or loyalty to the family or desert.

Moreover, in this example you managed to phrase oppression in terms of loyalty, but in general you can't plausibly rephrase any observed trend as progress of values: would an increase in global steel production count as an improvement in terms of... object safety and reliability, which leads to people feeling more secure? For many trends the connection to moral progress becomes more and more of a stretch.

Let's consider the added example:

Take a standard language model trained by minimisation of the loss function . Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of ]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”

Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by the model, to train a new model. 

Here, we have:

  • started from a model whose objective function is L;
  • used that model’s learnt reasoning to answer an ethics-related question;
  • used that answer to obtain a model whose objective is different from L.

If we view this interaction between the language model and the human as part of a single agent, the three bullet points above are an example of an evaluation update.

In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.

I am saying that thinking in terms of changing utility functions might be a better framework.

The point about learning a safe utility function is similar. I am saying that using the agent's reasoning to solve the agent's problem of what to do (not only how to carry out tasks) might be a better framework.

It's possible that there is an elegant mathematical model which would make you think: "Oh, now I get the difference between free and non-free" or "Ok, now it makes more sense to me". Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.

Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of "Why are we even considering different frameworks for agency? Let's see everything in terms of loss minimisation", and this latter statement throws away too much potentially useful information, in my opinion. 

I think it's a good idea to clarify the use of "liberal" in the paper, to avoid confusion for people who haven't looked at it. Huemer writes:

When I speak of liberalism, I intend, not any precise ethical theory, but rather a certain very broad ethical orientation. Liberalism (i) recognizes the moral equality of persons, (ii) promotes respect for the dignity of the individual, and (iii) opposes gratuitous coercion and violence. So understood, nearly every ethicist today is a liberal.

If you don't find the paper convincing, I doubt I'll be able to give you convincing arguments. It seems to me that you are considering many possible explanations and contributing factors; coming up with very strong objections to all of them seems difficult.

About your first point, though, I'd like to say that if historically we had observed more and more, let's say, oppression and violence, maybe people wouldn't even talk about moral progress and simply acknowledge a trend of oppression, without saying that their values got better over time. In our world, we notice a certain trend of e.g. more inclusivity, and we call that trend moral progress. This of course doesn't completely exclude the random-walk hypothesis, but it's something maybe worth keeping in mind.

I wrote:

The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth

You wrote:

It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.)

I don't use "in conflict" as "ultimate proof by contradiction", and maybe we use "completely arbitrary" differently. This doesn't seem a major problem: see also adjusted statement 2, reported below

for any goal , it is possible to create an intelligent agent whose goal is 

Back to you:

You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.

I do expect to see some convergence, but I don't know exactly how much and for what environments and starting conditions. The more convergence I see from experimental results, the less interested I'll become in the initial evaluation. Right now, I see it as a useful tool: for example, the fact that language models can already give (flawed, of course) moral scores to sentences is a good starting point in case someone had to rely on LLMs to try to get a free agent. Unsure about how important it will turn out to be. And I'll happily have a look at your valence series!

Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers.

Under Further observations, I wrote:

The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.

If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.

Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation  reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let's say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well.

You wrote:

I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about  as opposed to sources of error. For example, if  is independent of culture (e.g. moral objectivism), then "differences in the learning environment (culture, education system et cetera)" can only induce errors (if perhaps more or less so in some cases than others). But if  is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.

It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not.

thus that the only valid source for experimental evidence about  is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions)

I am not sure I completely follow you when you are talking about experimental evidence about , but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: "if a free agent didn't have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available". Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable path might be based on LLMs.

To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step

suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"

seems weak.

Overall, what is the intention behind your comments? Are you trying to understand my position even better,  and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?

I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:

  1. X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X)
  2. X, experienced on a large scale by many minds, is bad
  3. Causing X on a large scale is bad
  4. When considering what to do, I'll discard actions that cause X, and choose other options instead.

Now, some people will object and say that there are holes in this chain of reasoning, i.e. that 2 doesn't logically follow from 1, or 3 doesn't follow from 2, or 4 doesn't follow from 3. For the sake of this discussion, let's say that you object the step from 1 to 2. Then, what about this replacement:

  1. X is a state of mind that feels bad to whatever mind experiences it [identical to original 1]
  2. X, experienced on a large scale by many minds, is good [replaced 'bad' with 'good']

Does this passage from 1 to 2 seems, to you (our hypothetical objector), equally logical / rational / coherent / right / sensible / correct as the original step from 1 to 2? Could I replace 'bad' with basically anything, and the correctness would not change at all as a result?

My point is that, to many reflecting minds, the replacement seems less logical / rational / coherent / right / sensible / correct than the original step. And this is what I care about for my research: I want an AI that reflects in a similar way, an AI to which the original steps do seem rational and sensible, while replacements like the one I gave do not.

we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.

Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the above paragraph. Agreement might be less easy to achieve, but it doesn't mean finding a common ground is impossible.

For example, some researchers classify emotions according to valence, i.e. whether it is an overall good or bad experience for the experiencer, and in the future we might be able to find a map from brain states to whether a person is feeling good or bad. In this sense of good and bad, I'm pretty sure that moral philosophers who argue for the maximisation of bad feelings for the largest amount of experiencers are a very small minority. In other terms, we agree that maximising negative valence on a large scale is not worth doing.

(Personally, however, I am not a fan of arguments based on agreement or disagreement, especially in the moral domain. Many people in the past used to think that slavery was ok: does it mean slavery was good and right in the past, while now it is bad and wrong? No, I'd say that normally we use the words good/bad/right/wrong in a different way, to mean something else; similarly, we don't normally use the word 'dog' to mean e.g. 'wolf'. From a different domain: there is disagreement in modern physics about some aspects of quantum mechanics. Does it mean quantum mechanics is fake / not real / a matter of subjective opinion? I don't think so)

I might be misunderstanding you: take this with a grain of salt.

From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.

I am not familiar with PPO. From this short article, in the section about TRPO:

Recall that due to approximations, theoretical guarantees no longer hold.

Is this what you are referring to? But is it important for alignment? Let's say the conditions for convergence are not met anymore, the theorem can't be applied in theory, but in practice I do get an agent that goes towards A, where I've put reward. Is it misleading to say that the agent is maximising reward?

(However, keep in mind that

I agree with Turner that modelling humans as simple reward maximisers is inappropriate


If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.

Load More