AI ALIGNMENT FORUM
AF

Inferring utility functions from locally non-transitive preferences — AI Alignment Forum

As part of the AI Safety Camp, I've been diving a bit deeper into the foundations of expected utility theory and preference learning. In this post, I am making explicit a connection between those two things that (I assume) many people already made implicitly. But I couldn't find a nice exposition of this argument so I wrote it up. Any feedback is of course highly welcome!

Preference utilitarianism and the Von-Neumann-Morgenstern theorem.

At the risk of sounding drab, I briefly want to revisit some well-trodden territory in the realm of expected utility theory to motivate the rest of this post.

When we are thinking about the question of how to align advanced AI with human values, we are confronted with the question of whether we want to capture "How humans act" or "How humans should act". Both approaches have virtue, but if we're being ambitious^[1] we probably want to aim for the latter. However, our understanding of "How humans should act" is still rather confused and ready-made solutions to "plug into" an AGI are not available.

Rather than tackling the entire problem at once, we might focus on one particularly well-formalized portion of ethics first. A possible answer to the question "How should I act?" comes from preference utilitarianism where we focus on satisfying everyone's preferences (as revealed by their choices). I say that this portion of ethics is well-formalized because, it turns out, if you are being reasonable about your preferences, we can represent them succinctly in a "utility function," . This utility function has the neat property that for two possible options, $L$ and $M$ , you prefer option $M$ over option $L$ iff $u (M) > u (L)$ .

This is the famous "von-Neumann-Morgenstern theorem" which sits at the heart of an approach of "doing good better".

Now given that the utility function lives in the realm of mathematics, there is a seemingly natural strategy to use it to steer AI. Get everyone's utility functions, combine them into a target function, and then let the AI pick the actions that increase the target function the most. If we have to pick a single number to maximize, there is a case to be made that utility is one of the best single numbers we can hope for.

Sounds good. Where is the catch?

The futility of computing utility.

Let's start by trying to write down a utility function. The proof of the von-Neumann-Morgenstern is constructive, i.e., it doesn't only guarantee the existence of a utility function, it also shows us the way to get there. Here is a step-by-step guide:

1. Write down all the possible elementary outcomes that we might want to know the utility of.

Ah. Yeah. That's going to be a problem.

"All the things" is a lot of things. We might (and will, in this post) limit ourselves to a toy domain to make some progress, but that will be a poor substitute for the thing we want: all the possible outcomes affecting all existing humans. We might not think of some outcomes because they appear too good to be true. (Or too weird to be thinkable.) We might even want to include those outcomes in particular, as they might be the best option that nobody realized was on the table. But if we can't think of them, we can't write them down^[2].

(Perhaps there is a way out. An essential nuance in the first step is writing down "all the possible elementary outcomes." We don't need to consider all the outcomes immediately. We only need to consider a set from which we can construct the more complicated outcomes. We need a basis of the space of possibilities. That's already infinitely easier, and we're always guaranteed to find a basis^[3]. Hopefully, we can find a basis of a system that is rich enough to describe all relevant outcomes and simple enough to allow for linear interpolation? Of course, a semantic embedding of natural language comes to mind, but the imprecision of language is probably a deal-breaker. Perhaps the semantic embedding of a formal language/programming language is more appropriate^[4]?)

A toy domain containing all the possible elementary outcomes one can possibly think about.

Let's use a toy domain, call this an open problem, and move on.

2. Order all the elementary outcomes from worst to best.

For some inexplicable reason, my intro to computer science course at uni has prepared me for this exact task, and this task alone. We have a bunch of great algorithms for sorting the elements of a list by comparing them with each other. If we're moderately unlucky, we can hope to sort N-many outcomes with $N log N$ comparisons. After sorting thousands of elementary outcomes, we pick the worst (bumping your toe against a chair^[5]) and the best (ice cream sandwich) and assign them utility 0 and 1. For 1000 elementary outcomes, we could do with as little as 3000 comparisons which you could knock out in an afternoon. For 100000 elementary outcomes, we're talking 500000 comparisons which will keep you busy for a while. But it's for the survival of humankind, so perhaps it's fine! This could work!

Unless... somebody screws up. Those 3000 comparisons have to be spot on. If you accidentally mess up one comparison, the sorting algorithm might not be able to recover^[6]. And since we're working with humans, some errors are guaranteed. Can you confidently say that you prefer a mushroom risotto over a new pair of shoes? Thought so.

New shoes sound fantastic, but that mushroom risotto looks de-li-cious!

But now comes the real killer.

3. Do a sequence of psychophysics experiments where humans indicate where the exact probabilistic combination of the worst- and the best possible outcome is equivalent to an intermediate outcome.

After sorting all outcomes from worst to best, we offer you a sequence of lotteries for each intermediate outcome (mushroom risotto): "Would you rather accept a 10% chance of stubbing your toe and a 90% chance of an ice cream sandwich, or a guaranteed mushroom risotto?" At some point (45% stubbing your toe), your answer will change from "Yes" to "No," and then we know we are very close to finding your utility for mushroom risotto (in this case, a utility of ~0.45).

I was halfway through setting up MTurk, but let's be realistic - this will not work. Allais paradox aside, I don't have the patience to set this up, so who will have the patience to go through this? Of course, we should have seen this coming. Just because a proof is "constructive" doesn't mean that we can apply it in the messy real world. Getting a utility function out of a human will require some elbow grease.

Human fallibility and reward modeling.

Let’s shuffle some symbols. How do we account for all the human messiness and the tendency to make mistakes? A standard trick is to call it "noise"^[7]. Given a true estimate of utility, $u ˉ (A)$ , we might write the perceived utility at any given time as a random variable, $u (A)$ , distributed around the true value,

u (A) \sim N (¯ u (A), σ^{2})

Given two outcomes, $O_{1}$ and $O_{2}$ , the probability of assessing the utility of $O_{2}$ higher than the utility of $O_{1}$ is distributed as the difference of two Gaussians, which is again a Gaussian:

P (u (O_{2}) > u (O_{1})) = erf (¯ u (O_{2}) - ¯ u (O_{1}))

(assuming independence of $O_{2}$ and $O_{1}$ ). The error function (lovingly called $erf$ ) is nasty because it doesn't have a closed-form solution. But it does have a close relative that looks almost the same and is a lot nicer mathematically, and that has a much more pleasant-sounding name than " $erf$ "...

I'll replace the $erf$ with a stereotypical logistic function, $S$ ,

erf (¯ u (O_{2}) - ¯ u (O_{1})) \approx S (¯ u (O_{2}) - ¯ u (O_{1})) = 1 / (1 + e^{- (¯ u (O_{2}) - ¯ u (O_{1})})

Much nicer. Even though the logistic function has largely fallen out of favor as an activation function in machine learning, its grip on psychophysics is unbroken.

Now that we have the mathematical machinery in place, we need to calibrate it to reality. A natural choice is to take the cross-entropy between our machinery, $P (u (O_{2}) > u (O_{1}))$ , and the actual comparisons provided by humans, $O_{2} > O_{1}$ or $O_{1} > O_{2}$ ,

\begin{matrix} L & = - (O_{2} > O_{1}) \times log (P (u (O_{2}) > u (O_{1}))) - (O_{1} > O_{2}) \times log (P (u (O_{1}) > u (O_{2}))) = log (1 + e^{¯ u (O_{bad}) - ¯ u (O_{good})}) \end{matrix}

Surprise^[8]! The resulting loss function is also used for the reward modeling that I discussed in a previous post^[9]. The researchers who originally proposed this technique for learning human preferences say that it is similar to the Elo rating system from chess and the "Bradley & Terry model." I find the motivation of reward modeling as an approximation of the von-Neumann-Morgenstern theorem a lot more Romantic, though.

Having uncovered this connection, a natural strategy for inferring a utility function through training a neural network with comparisons of pairs of elements from the domain presents itself.

Unordered elements from the domain of “things we might care about” (a) are translated into an orthonormal basis of a high dimensional vector space (b) and transformed into a scalar output through a multilayer perception (c).

Can this work? It doesn’t involve MTurk for now, so I am happy to try!

A natural representation of utility functions.

I found that the neural network achieves near-zero loss on the comparisons after 20k steps (panel a). Runtime appears to increase linearly with the number of elements^[10]. The resulting utility functions are monotonic and increase by an (approximately) equal amount from one item to another (panel b). This demonstrates that given enough time, a multilayer perceptron can sort a list.

a. Cross-entropy loss as a function of iterations for different sizes of the toy domain (indicated by color). b. Predicted utility of different items after 250k iterations of a “rational" (i.e., a total order) teaching signal. c. Predicted utility of items after 250k iterations of a “noisy rational” (i.e., 90% chance of total order, 10% chance of inverting preference) teaching signal. d. Predicted utility of items after 250k iterations of a “loopy rational” (i.e., total order, except for elements 3 and 4, or 3 and 7, which are connected in a loop) teaching signal. In b to c, the predicted utility is normalized between 0 and 1.

Some might say that I used several hours of compute time on a Google Colab GPU just to sort lists, but that would be rather uncharitable. My primary motivation for this approach (human tendency to make mistakes) bears fruits in the following experiment. When I add noise to the choice procedure (resulting in 10% “random” choices), the network is still able to recover the appropriate ordering (panel c). And, even more remarkable, when I make the choice procedure loopy (i.e., nontransitive), the network can still recover a reasonable approximation of what the utility function looks like (panel d)!

This last set of experiments is exciting because introducing loops leads to nonlinear utility functions that are squashed together in the vicinity of the loop. Intuitively, if outcomes #3 and #4 are impossible to distinguish reliably, this might indicate that their utility is indeed very similar. The exciting possibility is that step 3 of the procedure above (psychometric calibration of utilities) could automatically be satisfied when options are sufficiently similar and sometimes confused^[11].

Concluding thoughts and what’s next?

As mentioned at the beginning of this post, I suspect that this way of interpreting preference learning was already clear to some, but I hope that making the connection explicit is useful. I also find it great to see that we can recover a consistent utility function from inconsistent preferences (panel d of last figure). There are a lot more experiments I want to run on this toy domain to probe the limits to which preference orderings can be turned into utility functions:

What is the largest number of elements we can sort with a given architecture? How does training time change as a function of the number of elements?
How does the network architecture affect the resulting utility function? How do the maximum and minimum of the unnormalized utility function change?
Which portion of possible comparisons needs to be presented (on average) to infer the utility function?

How far can we degenerate a preference ordering until no consistent utility function can be inferred anymore?

But independent of those experiments, there are some fascinating directions that I plan to explore in a future post. Now that I have a natural way to induce utility functions, I think I can further explore the utility monster and some of the classic literature on (un-)comparability of utility functions. I also really want to write a proper treatment of value handshakes, which is a topic in dire need of exploration.

I'd be curious to hear any additional ideas for what to try, as well as comments on how feasible this approach sounds/where it's likely to break down.

^{^}
Some argue that narrow solutions are not stable in any case.
^{^}
I suspect that if Goodhart creeps into the argument, this is around the place where it happens.
^{^}
if we're willing to embrace Zorn.
^{^}
Has anybody ever looked at that? What happens when I subtract Fizzbuzz from the Fibonacci sequence?
^{^}
A very close "win" over S-risk scenarios.
^{^}
And the worst-case runtime shoots up to ∞ as soon as you have the probability of errors). In this situation, Bogosort might just be the most reliable solution - at least it won't terminate until it's done.
^{^}
Joke aside, it’s an interesting question which factors need to come together to make something “noise.” I guess the central limit theorem can help if there are many independent factors, but that’s never really the case. The more general question is under which circumstances the residual factors are random and when they are adversarial.
^{^}
Please ignore that it's already written in the section title.
^{^}
The original Leike paper has an equivalent expression that does not write out the logarithm applied to the sigmoid.
^{^}
My money would be on $O (n log n)$ or worse, of course.
^{^}
I’d expect that the difference in utility, $| u ˉ (O_{1}) - u ˉ (O_{2}) |$ , will be proportional to the probability of confusing the order, $P (u (O_{1}) > u (O_{2}))$ .