## AI ALIGNMENT FORUMAF

Michele Campolo

Lifelong recursive self-improver, on his way to exploding really intelligently :D

Background in mathematics, research at CEEALAR now. I focus on AI alignment, with an eye towards moral progress rather than just risk.

-Naturalism and AI alignment
-From language to ethics by automated reasoning

Sorted by New

# Wiki Contributions

The reached conclusion—that it is possible to do something about the situation—is weak, but I really like the minimalist style of the arguments. Great post!

I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.

In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply "truth". See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism.

I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: "Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks."

Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for "value" in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb.

[Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion here: I am simply thinking of how, for example, activity in the prefrontal cortex regulates top-down attentional control, allowing us humans (and agents with similar enough brains/architectures) to control, to a certain degree, what to pay attention to. 2 Related to what you wrote about "catastrophically wrong" theories: there is no need to give such an AI high control over the world. Rather, I am thinking of control over what to write as output in a text interface, like a chatbot that is not limited to one reply for each input message]

The interesting question for alignment is: what will such an AI do (or write)? This information is valuable even if the AI doesn't have high control over the world. Let's say we do manage to create a collection of human preferences; we might still notice something like: "Interesting, this AI thinks this subset of preferences doesn't make sense" or "Cool, this AI considers valuable the thing X that we didn't even consider before". Or, if collecting human preferences proves to be difficult, we could use some information this AI gives us to build other AIs that instead act according to an explicitly specified value function.

I see two possible objections.

1 The AI described above cannot be built. This seems unlikely: as long as we can emulate what the human mind does, we can at least try to create less biased versions of it. See also the sentence you quoted in the other comment. Indeed, depending on how biased we judge that AI to be, the obtained information will be less, or more, valuable to us.

2 Such an AI will never act ethically or altruistically, and/or its behaviour will be unpredictable. I consider this objection more plausible, but I also ask: how do you know? In other words: how can one be so sure about the behaviour of such an AI? I expect the related arguments to be more philosophical than technical. Given uncertainty, (to me) it seems correct to accept a non-trivial chance that the AI reasons like this: "Look, I know various facts about this world. I don't believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don't see other sources of value at the moment. I guess I'll do something about it."

Philosophically speaking, I don't think I am claiming anything particularly new or original: the ideas already exist in the literature. See, for example, 4.2 and 4.3 in the SEP page on Altruism.

If there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.

From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

One could argue that these philosophers are fooling themselves, that no really intelligent agent will end up with such weird beliefs. So far, I haven't seen convincing arguments in favour of this; it goes back to the metaethical discussion. I quote a sentence I have written in the post:

Depending on one’s background knowledge of philosophy and AI, the idea that rationality plays a role in reasoning about goals and can lead to disinterested (not game-theoretic or instrumental) altruism may seem plain wrong or highly speculative to some, and straightforward to others.

1 From Arbital:

The Orthogonality Thesis states "there exists at least one possible agent such that..."

Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI.

2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post.

3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.

I am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI.

Thanks for the link to the sequence on concepts, I found it interesting!

Ok, if you want to clarify—I'd like to—we can have a call, or discuss in other ways. I'll contact you somewhere else.

Omega, a perfect predictor, flips a coin. If it comes up heads, Omega asks you for $100, then pays you$10,000 if it predict you would have paid if it had come up tails and you were told it was tails. If it comes up tails, Omega asks you for $100, then pays you$10,000 if it predicts you would have paid if it had come up heads and you were told it was heads.

Here there is no question, so I assume it is something like: "What do you do?" or "What is your policy?"

That formulation is analogous to standard counterfactual mugging, stated in this way:

Omega flips a coin. If it comes up heads, Omega will give you 10000 in case you would pay 100 when tails. If it comes up tails, Omega will ask you to pay 100. What do you do?

According to these two formulations, the correct answer seems to be the one corresponding to the first intuition.

Now consider instead this formulation of counterfactual PD:

Omega, a perfect predictor, tells you that it has flipped a coin, and it has come up heads. Omega asks you to pay 100 (here and now) and gives you 10000 (here and now) if you would pay in case the coin landed tails. Omega also explains that, if the coin had come up tails—but note that it hasn't—Omega would tell you such and such (symmetrical situation). What do you do?

The answer of the second intuition would be: I refuse to pay here and now, and I would have paid in case the coin had come up tails. I get 10000.

And this formulation of counterfactual PD is analogous to this formulation of counterfactual mugging, where the second intuition refuses to pay.

The answer of the second intuition would be: I refuse to pay here and now, and I would have paid in case the coin had come up tails. I get 10000.

is false/not admissible/impossible? Or are you saying something else entirely? In any case, if you could motivate your opinion, whatever that is, you would help me understand. Thanks!

It seems you are arguing for the position that I called "the first intuition" in my post. Before knowing the outcome, the best you can do is (pay, pay), because that leads to 9900.

On the other hand, as in standard counterfactual mugging, you could be asked: "You know that, this time, the coin came up tails. What do you do?". And here the second intuition applies: the DM can decide to not pay (in this case) and to pay when heads. Omega recognises the intent of the DM, and gives 10000.

Maybe you are not even considering the second intuition because you take for granted that the agent has to decide one policy "at the beginning" and stick to it, or, as you wrote, "pre-commit". One of the points of the post is that it is unclear where this assumption comes from, and what it exactly means. It's possible that my reasoning in the post was not clear, but I think that if you reread the analysis you will see the situation from both viewpoints.

If the DM knows the outcome is heads, why can't he not pay in that case and decide to pay in the other case? In other words: why can't he adopt the policy (not pay when heads; pay when tails), which leads to 10000?

The fact that it is "guaranteed" utility doesn't make a significant difference: my analysis still applies. After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition).