As part of the AI Safety Camp, I've been diving a bit deeper into the foundations of expected utility theory and preference learning. In this post, I am making explicit a connection between those two things that (I assume) many people already made implicitly. But I couldn't find a nice exposition of this argument so I wrote it up. Any feedback is of course highly welcome!

**Preference utilitarianism and the Von-Neumann-Morgenstern theorem.**

At the risk of sounding drab, I briefly want to revisit some well-trodden territory in the realm of expected utility theory to motivate the rest of this post.

When we are thinking about the question of how to align advanced AI with human values, we are confronted with the question of whether we want to capture "How humans act" or "How humans *should* act". Both approaches have virtue, but if we're being ambitious^{[1]} we probably want to aim for the latter. However, our understanding of "How humans *should* act" is still rather confused and ready-made solutions to "plug into" an AGI are not available.

Rather than tackling the entire problem at once, we might focus on one particularly well-formalized portion of ethics first. A possible answer to the question "How should I act?" comes from __preference utilitarianism__ where we focus on satisfying everyone's *preferences *(as revealed by their choices). I say that this portion of ethics is well-formalized because, it __turns out__, if you are being reasonable about your preferences, we can represent them succinctly in a "utility function," . This utility function has the neat property that for two possible options, and , you *prefer* option over option iff .

This is the famous "__von-Neumann-Morgenstern theorem__" which sits at the heart of an approach of "doing good better".

Now given that the utility function lives in the realm of mathematics, there is a seemingly natural strategy to use it to steer AI. Get everyone's utility functions, combine them into a target function, and then let the AI pick the actions that increase the target function the most. If we *have* to pick a single number to maximize, there is a case to be made that utility is one of the best single numbers we can hope for.

Sounds good. Where is the catch?

**The futility of computing utility.**

Let's start by trying to write down a utility function. **The proof of the von-Neumann-Morgenstern is *** constructive*, i.e., it doesn't only guarantee the existence of a utility function, it also shows us the way to get there. Here is a step-by-step guide:

**1. Write down all the possible elementary outcomes that we might want to know the utility of.**

Ah. Yeah. That's going to be a problem.

**"All the things" is a lot of things.** We might (and will, in this post) limit ourselves to a toy domain to make some progress, but that will be a poor substitute for the thing we want: all the possible outcomes affecting all existing humans. We might not think of some outcomes because they appear too good to be true. (Or too weird to be thinkable.) We might even want to include those outcomes *in particular,* as they might be the best option that nobody realized was on the table. But if we can't think of them, we can't write them down^{[2]}.

(Perhaps there is a way out. An essential nuance in the first step is writing down "all the possible *elementary* outcomes." We don't need to consider all the outcomes immediately. We only need to consider a set from which we can construct the more complicated outcomes. We need a __basis__ of the space of possibilities. That's already infinitely easier, and we're *always guaranteed* to find a basis^{[3]}. Hopefully, we can find a basis of a system that is rich enough to describe all relevant outcomes and simple enough to allow for linear interpolation? Of course, a __semantic embedding of natural language__ comes to mind, but the imprecision of language is probably a deal-breaker. Perhaps the semantic embedding of a formal language/programming language is more appropriate^{[4]}?)

Let's use a toy domain, call this an open problem, and move on.

**2. Order all the elementary outcomes from worst to best.**

For some inexplicable reason, my intro to computer science course at uni has prepared me for __this exact task, and this task alone__. We have a bunch of great algorithms for sorting the elements of a list by comparing them with each other. If we're moderately unlucky, **we can hope to sort N-many outcomes with ** **comparisons**. After sorting thousands of elementary outcomes, we pick the worst (bumping your toe against a chair^{[5]}) and the best (ice cream sandwich) and assign them utility 0 and 1. For 1000 elementary outcomes, we could do with as little as 3000 comparisons which you could knock out in an afternoon. For 100000 elementary outcomes, we're talking 500000 comparisons which will keep you busy for a while. But it's for the survival of humankind, so perhaps it's fine! This could work!

Unless... somebody screws up. Those 3000 comparisons have to be *spot on*. **If you accidentally mess up one comparison, the sorting algorithm might not be able to recover**^{[6]}**.** And since we're working with humans, some errors are guaranteed. Can you confidently say that you prefer a mushroom risotto over a new pair of shoes? Thought so.

But now comes the real killer.

**3. Do a sequence of psychophysics experiments where humans indicate where the exact probabilistic combination of the worst- and the best possible outcome is equivalent to an intermediate outcome.**

After sorting all outcomes from worst to best, we offer you a sequence of lotteries for each intermediate outcome (mushroom risotto): "Would you rather accept a 10% chance of stubbing your toe and a 90% chance of an ice cream sandwich, or a guaranteed mushroom risotto?" At some point (45% stubbing your toe), your answer will change from "Yes" to "No," and then we know we are very close to finding your utility for mushroom risotto (in this case, a utility of ~0.45).

I was halfway through setting up __MTurk__, but let's be realistic - this will not work. __Allais paradox__ aside, I don't have the patience to set this up, so who will have the patience to go through this? Of course, we should have seen this coming. **Just because a proof is "constructive" doesn't mean that we can apply it in the messy real world.** Getting a utility function out of a human will require some elbow grease.

**Human fallibility and reward modeling.**

Let’s shuffle some symbols. How do we account for all the human messiness and the tendency to make mistakes? A standard trick is to call it "noise"^{[7]}. Given a *true* estimate of utility, , we might write the perceived utility at any given time as a random variable, , distributed around the true value,

Given two outcomes, and , the probability of assessing the utility of higher than the utility of is distributed as the difference of two Gaussians, which is again a Gaussian:

(assuming independence of and ). The __error function__ (lovingly called ) is nasty because it doesn't have a closed-form solution. But it *does* have a close relative that looks almost the same and is a lot nicer mathematically, and that has a much more pleasant-sounding name than ""...

I'll replace the with a stereotypical __logistic function__, ,

Much nicer. Even though the logistic function has largely __fallen out of favor as an activation function in machine learning__, its __grip on psychophysics is unbroken__.

Now that we have the mathematical machinery in place, we need to calibrate it to reality. A natural choice is to take the cross-entropy between our machinery, , and the *actual* comparisons provided by humans, or ,

Surprise^{[8]}! The resulting loss function is also used for the __reward modeling__ that I discussed in a previous __post__^{[9]}. The researchers who originally proposed this technique for learning human preferences say that it is similar to the __Elo rating system from chess__ and the __"Bradley & Terry model."__ I find the motivation of reward modeling as an approximation of the von-Neumann-Morgenstern theorem a lot more __Romantic__, though.

Having uncovered this connection, a natural strategy for inferring a utility function through training a neural network with comparisons of pairs of elements from the domain presents itself.

Can this work? It doesn’t involve MTurk for now, so I am happy to try!

# A natural representation of utility functions.

I found that the neural network achieves near-zero loss on the comparisons after 20k steps (panel **a**). Runtime appears to increase linearly with the number of elements^{[10]}. The resulting utility functions are monotonic and increase by an (approximately) equal amount from one item to another (panel **b**). This demonstrates that given enough time, a multilayer perceptron can sort a list.

Some might say that I used several hours of compute time on a __Google Colab__ GPU *just* to sort lists, but that would be rather uncharitable. My primary motivation for this approach (human tendency to make mistakes) bears fruits in the following experiment. When I add noise to the choice procedure (resulting in 10% “random” choices), the network is still able to recover the appropriate ordering (panel **c**). And, even more remarkable, **when I make the choice procedure loopy (i.e., **__nontransitive__**), the network can still recover a reasonable approximation of what the utility function looks like** **(panel** **d)!**

This last set of experiments is exciting because introducing loops leads to nonlinear utility functions that are squashed together in the vicinity of the loop. Intuitively, if outcomes #3 and #4 are impossible to distinguish reliably, this might indicate that their utility is indeed very similar. The exciting possibility is that step 3 of the procedure above (psychometric calibration of utilities) could automatically be satisfied when options are sufficiently similar and sometimes confused^{[11]}.

# Concluding thoughts and what’s next?

As mentioned at the beginning of this post, I suspect that this way of interpreting preference learning was already clear to some, but I hope that making the connection explicit is useful. I also find it great to see that **we can recover a consistent utility function from inconsistent preferences** (panel d of last figure). There are a lot more experiments I want to run on this toy domain to probe the limits to which preference orderings can be turned into utility functions:

- What is the largest number of elements we can sort with a given architecture? How does training time change as a function of the number of elements?
- How does the network architecture affect the resulting utility function? How do the maximum and minimum of the unnormalized utility function change?
- Which portion of possible comparisons needs to be presented (on average) to infer the utility function?
- How far can we degenerate a preference ordering until no consistent utility function can be inferred anymore?

But independent of those experiments, there are some fascinating directions that I plan to explore in a future post. Now that I have a natural way to induce utility functions, I think I can further explore the __utility monster__ and some of the classic literature on __(un-)comparability of utility functions__. I also really want to write a proper treatment of __value handshakes__, which is a topic in __dire need of exploration__.

I'd be curious to hear any additional ideas for what to try, as well as comments on how feasible this approach sounds/where it's likely to break down.

^{^}^{^}I suspect that if Goodhart creeps into the argument, this is around the place where it happens.

^{^}if we're willing to

__embrace Zorn.__^{^}Has anybody ever looked at that? What happens when I subtract Fizzbuzz from the Fibonacci sequence?

^{^}A very close "win" over

__S-risk scenarios__.^{^}And the worst-case runtime shoots up to ∞ as soon as you have the probability of errors). In this situation,

__Bogosort__might just be the most reliable solution - at least it won't terminate until it's done.^{^}Joke aside, it’s an interesting question which factors need to come together to make something “noise.” I guess the central limit theorem can help if there are many independent factors, but that’s never really the case. The more general question is under which circumstances the

__residual factors are random and when they are adversarial.__^{^}Please ignore that it's already written in the section title.

^{^}The original

__Leike paper__has an equivalent expression that does not write out the logarithm applied to the sigmoid.^{^}My money would be on or worse, of course.

^{^}I’d expect that the difference in utility, , will be proportional to the probability of confusing the order, .

There's a bit of math directly relevant to this problem: Hodge decomposition of graph flows, for the discrete case, and vector fields, for the continuous case. Basically if you have a bunch of arrows, possibly loopy, you can always decompose it into a sum of two components: a "pure cyclic" one (no sources or sinks, stuff flowing in cycles) and a "gradient" one (arising from a utility function). No neural network needed, the decomposition is unique and can be computed explicitly. See this post, and also the comments by FactorialCode and me.

Fantastic, thank you for the pointer, learned something new today! A unique and explicit representation would be very neat indeed.