Michele Campolo

Lifelong recursive self-improver, on his way to exploding really intelligently :D

More seriously: my posts are mostly about AI alignment, with an eye towards moral progress and creating a better future instead of risk only.

At the moment I am doing research at CEEALAR on agents whose behaviour is driven by a reflective process analogous to human moral reasoning, rather than by a metric specified by the designer. I'll probably post a short article on this topic before the end of 2023.

Here are some suggested readings from what I've written so far:

-Naturalism and AI alignment
-From language to ethics by automated reasoning
-Criticism of the main framework in AI alignment

Wiki Contributions


To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step

suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"

seems weak.

Overall, what is the intention behind your comments? Are you trying to understand my position even better,  and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?

I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:

  1. X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X)
  2. X, experienced on a large scale by many minds, is bad
  3. Causing X on a large scale is bad
  4. When considering what to do, I'll discard actions that cause X, and choose other options instead.

Now, some people will object and say that there are holes in this chain of reasoning, i.e. that 2 doesn't logically follow from 1, or 3 doesn't follow from 2, or 4 doesn't follow from 3. For the sake of this discussion, let's say that you object the step from 1 to 2. Then, what about this replacement:

  1. X is a state of mind that feels bad to whatever mind experiences it [identical to original 1]
  2. X, experienced on a large scale by many minds, is good [replaced 'bad' with 'good']

Does this passage from 1 to 2 seems, to you (our hypothetical objector), equally logical / rational / coherent / right / sensible / correct as the original step from 1 to 2? Could I replace 'bad' with basically anything, and the correctness would not change at all as a result?

My point is that, to many reflecting minds, the replacement seems less logical / rational / coherent / right / sensible / correct than the original step. And this is what I care about for my research: I want an AI that reflects in a similar way, an AI to which the original steps do seem rational and sensible, while replacements like the one I gave do not.

we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.

Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the above paragraph. Agreement might be less easy to achieve, but it doesn't mean finding a common ground is impossible.

For example, some researchers classify emotions according to valence, i.e. whether it is an overall good or bad experience for the experiencer, and in the future we might be able to find a map from brain states to whether a person is feeling good or bad. In this sense of good and bad, I'm pretty sure that moral philosophers who argue for the maximisation of bad feelings for the largest amount of experiencers are a very small minority. In other terms, we agree that maximising negative valence on a large scale is not worth doing.

(Personally, however, I am not a fan of arguments based on agreement or disagreement, especially in the moral domain. Many people in the past used to think that slavery was ok: does it mean slavery was good and right in the past, while now it is bad and wrong? No, I'd say that normally we use the words good/bad/right/wrong in a different way, to mean something else; similarly, we don't normally use the word 'dog' to mean e.g. 'wolf'. From a different domain: there is disagreement in modern physics about some aspects of quantum mechanics. Does it mean quantum mechanics is fake / not real / a matter of subjective opinion? I don't think so)

I might be misunderstanding you: take this with a grain of salt.

From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.

I am not familiar with PPO. From this short article, in the section about TRPO:

Recall that due to approximations, theoretical guarantees no longer hold.

Is this what you are referring to? But is it important for alignment? Let's say the conditions for convergence are not met anymore, the theorem can't be applied in theory, but in practice I do get an agent that goes towards A, where I've put reward. Is it misleading to say that the agent is maximising reward?

(However, keep in mind that

I agree with Turner that modelling humans as simple reward maximisers is inappropriate


If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.

If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism.

I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help:

It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully).

I do not have a clear idea yet of how this happens algorithmically, but an important factor seems to be that, in the human mind, goals and actions are not completely separate, and neither are action selection and goal selection. When we think about what to do, sometimes we do fix a goal and plan only for that, but other times the question becomes about what is worth doing in general, what is best, what is valuable: instead of fixing a goal and choosing an action, it's like we are choosing between goals.

Sorry for the late reply, I missed your comment.

Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.

The reached conclusion—that it is possible to do something about the situation—is weak, but I really like the minimalist style of the arguments. Great post!

I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.

In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply "truth". See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism.

I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: "Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks."

Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for "value" in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb.

[Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion here: I am simply thinking of how, for example, activity in the prefrontal cortex regulates top-down attentional control, allowing us humans (and agents with similar enough brains/architectures) to control, to a certain degree, what to pay attention to. 2 Related to what you wrote about "catastrophically wrong" theories: there is no need to give such an AI high control over the world. Rather, I am thinking of control over what to write as output in a text interface, like a chatbot that is not limited to one reply for each input message]

The interesting question for alignment is: what will such an AI do (or write)? This information is valuable even if the AI doesn't have high control over the world. Let's say we do manage to create a collection of human preferences; we might still notice something like: "Interesting, this AI thinks this subset of preferences doesn't make sense" or "Cool, this AI considers valuable the thing X that we didn't even consider before". Or, if collecting human preferences proves to be difficult, we could use some information this AI gives us to build other AIs that instead act according to an explicitly specified value function.

I see two possible objections.

1 The AI described above cannot be built. This seems unlikely: as long as we can emulate what the human mind does, we can at least try to create less biased versions of it. See also the sentence you quoted in the other comment. Indeed, depending on how biased we judge that AI to be, the obtained information will be less, or more, valuable to us.

2 Such an AI will never act ethically or altruistically, and/or its behaviour will be unpredictable. I consider this objection more plausible, but I also ask: how do you know? In other words: how can one be so sure about the behaviour of such an AI? I expect the related arguments to be more philosophical than technical. Given uncertainty, (to me) it seems correct to accept a non-trivial chance that the AI reasons like this: "Look, I know various facts about this world. I don't believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don't see other sources of value at the moment. I guess I'll do something about it."

Philosophically speaking, I don't think I am claiming anything particularly new or original: the ideas already exist in the literature. See, for example, 4.2 and 4.3 in the SEP page on Altruism.

If there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.

From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

One could argue that these philosophers are fooling themselves, that no really intelligent agent will end up with such weird beliefs. So far, I haven't seen convincing arguments in favour of this; it goes back to the metaethical discussion. I quote a sentence I have written in the post:

Depending on one’s background knowledge of philosophy and AI, the idea that rationality plays a role in reasoning about goals and can lead to disinterested (not game-theoretic or instrumental) altruism may seem plain wrong or highly speculative to some, and straightforward to others.

Thanks, that page is much more informative than anything else I've read on the orthogonality thesis.

1 From Arbital:

The Orthogonality Thesis states "there exists at least one possible agent such that..."

Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI.

2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post.

3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.

Load More