Moral uncertainty

Edited by Eliezer Yudkowsky, et al. last updated 19th Feb 2025

"Moral uncertainty" in the context of AI refers to an agent with an "uncertain utility function". That is, we can view the agent as pursuing a utility function that takes on different values in different subsets of possible worlds.

For example, an agent might have a meta-utility function saying that eating cake has a utility of €8 in worlds where Lee Harvey Oswald shot John F. Kennedy and that eating cake has a utility of €10 in worlds where it was the other way around. This agent will be motivated to inquire into political history to find out which utility function is probably the 'correct' one (relative to this meta-utility function), though it will never be absolutely sure.

Moral uncertainty must be resolvable by some conceivable observation in order to function as uncertainty. Suppose for example that an agent's probability distribution over the 'true' utility function $U$ asserts a dependency on a fair quantum coin that was flipped inside a sealed box then destroyed by explosives: the utility function is $U_{1}$ over outcomes in the worlds where the coin came up heads, and if the coin came up tails the utility function is $U_{2} .$ If the agent thinks it has no way of ever figuring out what happened inside the box, it will thereafter behave as if it had a single, constant, certain utility function equal to $0.5 \cdot U_{1} + 0.5 \cdot U_{2} .$

Parents:

Preference framework

Children:

Ideal target

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Moral uncertainty

Moral uncertainty