I often notice that in many (not all) discussions about utility functions, one side is "for" their relevance, while others tend to be "against" their usefulness, without explicitly saying what they mean. I don't think this is causing any deep confusions among researchers here, but I'd still like to take a stab at disambiguating some of this, if nothing else for my own sake. Here are some distinct (albeit related) ways that utility functions can come up in AI safety, in terms of what assumptions/hypotheses they give rise to:

AGI utility hypothesis: The first AGI will behave as if it is maximizing some utility function

ASI utility hypothesis: As AI capabilities improve well beyond human-level, it will behave more and more as if it is maximizing some utility function (or will have already reached that ideal earlier and stayed there)

Human utility hypothesis: Even though in some experimental contexts humans seem to not even be particularly goal-directed, utility functions are often a useful model of human preferences to use in AI safety research

Coherent Extrapolated Volition (CEV) hypothesis: For a given human H, there exists some utility function V such that if H is given the appropriate time/resources for reflection, H's values would converge to V

Some points to be made:

  • The "Goals vs Utility Functions" chapter of Rohin's Value Learning sequence, and the resulting discussion focused on differing intuitions about the AGI and ASI utility hypotheses. Specifically, the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.
  • AGI utility doesn't logically imply ASI utility, but I'd be surprised if anyone thinks it's very plausible for the former to be true while the latter fails. In particular, the coherence arguments and other pressures that move agents toward VNM seem to roughly scale with capabilities. A plausible stance could be that we should expect most ASIs to hew close to the VNM ideal, but these pressures aren't quite so overwhelming at the AGI level; in particular, humans are fairly goal-directed but only "partially" VNM, so the goal-directedness pressures on an AGI will likely be at this order of magnitude. Depending on takeoff speeds, we might get many years to try aligning AGIs at this level of goal-directedness, which seems less dangerous than playing sorcerer's apprentice with VNM-based AGIs at the same level of capability.(Note: I might be reifying VNM here too much, in thinking of things having a measure of "goal-directedness" with "very goal-directed" approximating VNM. But this basic picture could be wrong in all sorts of ways.)
  • The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it. On the other hand, behavioral economists have formulated models like prospect theory for when greater precision is required than the simplistic VNM model gives, not to mention the cases where it breaks down more drastically. I haven't seen prospect theory used in AI safety research; I'm not sure if this reflects more a) the size of the field and the fact that few researchers have had much need to explicitly model human preferences, or b) that we don't need to model humans more than superficially. since this kind of research is still at a very early theoretical stage with all sorts of real-world error terms abounding.
  • The CEV hypothesis can be strengthened, consistent with Yudkowsky's original vision, to say that every human will converge to about the same values. But the extra "values converge" assumption seems orthogonal to one's opinions about the relevance of utility functions, so I'm not including it in the above list.
  • In practice a given researcher's opinions on these tend to be correlated, so it makes sense to talk of "pro-utility" and "anti-utility" viewpoints. But I'd guess the correlation is far from perfect, and at any rate, the arguments connecting these hypotheses seem somewhat tenuous.
New Comment
20 comments, sorted by Click to highlight new comments since: Today at 5:46 AM
In particular, the coherence arguments and other pressures that move agents toward VNM seem to roughly scale with capabilities.

One nit I keep picking whenever it comes up: VNM is not really a coherence theorem. The VNM utility theorem operates from four axioms, and only two of those four are relevant to coherence. The main problem is that the axioms relevant to coherence (acyclicity and completeness) do not say anything at all about probability and the role that it plays - the "expected" part of "expected utility" does not arise from a coherence/exploitability/pareto optimality condition in the VNM formulation of utility.

The actual coherence theorems which underpin Bayesian expected utility maximization are things like Dutch book theorems, Wald's complete class theorem, the fundamental theorem of asset pricing, and probably others.

Why does this nitpick matter? Three reasons:

  • In my experience, most people who object to the use of utilities have only encountered VNM, and correctly point out problems with VNM which do not apply to the real coherence theorems.
  • VNM utility stipulates that agents have preferences over "lotteries" with known, objective probabilities of each outcome. The probabilities are assumed to be objectively known from the start. The Bayesian coherence theorems do not assume probabilities from the start; they derive probabilities from the coherence criteria, and those probabilities are specific to the agent.
  • Because VNM is not really a coherence theorem, I do not expect agent-like systems in the wild to be pushed toward VNM expected utility maximization. I expect them to be pushed toward Bayesian expected utility maximization.

I think you're underestimating VNM here.

only two of those four are relevant to coherence. The main problem is that the axioms relevant to coherence (acyclicity and completeness) do not say anything at all about probability

It seems to me that the independence axiom is a coherence condition, unless I misunderstand what you mean by coherence?

correctly point out problems with VNM

I'm curious what problems you have in mind, since I don't think VNM has problems that don't apply to similar coherence theorems.

VNM utility stipulates that agents have preferences over "lotteries" with known, objective probabilities of each outcome. The probabilities are assumed to be objectively known from the start. The Bayesian coherence theorems do not assume probabilities from the start; they derive probabilities from the coherence criteria, and those probabilities are specific to the agent.

One can construct lotteries with probabilities that are pretty well understood (e.g. flipping coins that we have accumulated a lot of evidence are fair), and you can restrict attention to lotteries only involving uncertainty coming from such sources. One may then get probabilities for other, less well-understood sources of uncertainty by comparing preferences involving such uncertainty to preferences involving easy-to-quantify uncertainty (e.g. if A is preferred to B, and you're indifferent between 60%A+40%B and "A if X, B if not-X", then you assign probability 60% to X. Perhaps not quite as philosophically satisfying as deriving probabilities from scratch, but this doesn't seem like a fatal flaw in VNM to me.

I do not expect agent-like systems in the wild to be pushed toward VNM expected utility maximization. I expect them to be pushed toward Bayesian expected utility maximization.

I understood those as being synonyms. What's the difference?

I would argue that independence of irrelevant alternatives is not a real coherence criterion. It looks like one at first glance: if it's violated, then you get an Allais Paradox-type situation where someone pays to throw a switch and then pays to throw it back. The problem is, the "arbitrage" of throwing the switch back and forth hinges on the assumption that the stated probabilities are objectively correct. It's entirely possible for someone to come along who believes that throwing the switch changes the probabilities in a way that makes it a good deal. Then there's no real arbitrage, it just comes down to whose probabilities better match the outcomes.

My intuition for this not being real arbitrage comes from finance. In finance, we'd call it "statistical arbitrage": it only works if the probabilities are correct. The major lesson of the collapse of Long Term Capital Management in the 90's is that statistical arbitrage is definitely not real arbitrage. The whole point of true arbitrage is that it does not depend on your statistical model being correct .

This directly leads to the difference between VNM and Bayesian expected utility maximization. In VNM, agents have preferences over lotteries: the probabilities of each outcome are inputs to the preference function. In Bayesian expected utility maximization, the only inputs to the preference function are the choices available to the agent - figuring out the probabilities of each outcome under each choice is the agent's job.

(I do agree that we can set up situations where objectively correct probabilities are a reasonable model, e.g. in a casino, but the point of coherence theorems is to be pretty generally applicable. A theorem only relevant to casinos isn't all that interesting.)

Ok, I see what you mean about independence of irrelevant alternatives only being a real coherence condition when the probabilities are objective (or otherwise known to be equal because they come from the same source, even if there isn't an objective way of saying what their common probability is).

But I disagree that this makes VNM only applicable to settings in which all sources of uncertainty have objectively correct probabilities. As I said in my previous comment, you only need there to exist some source of objective probabilities, and you can then use preferences over lotteries involving objective probabilities and preferences over related lotteries involving other sources of uncertainty to determine what probability the agent must assign for those other sources of uncertainty.

Re: the difference between VNM and Bayesian expected utility maximization, I take it from the word "Bayesian" that the way you're supposed to choose between actions does involve first coming up with probabilities of each outcome resulting from each action, and from "expected utility maximization", that these probabilities are to be used in exactly the way the VNM theorem says they should be. Since the VNM theorem does not make any assumptions about where the probabilities came from, these still sound essentially the same, except with Bayesian expected utility maximization being framed to emphasize that you have to get the probabilities somehow first.

Let me repeat back your argument as I understand it.

If we have a Bayesian utility maximizing agent, that's just a probabilistic inference layer with a VNM utility maximizer sitting on top of it. So our would-be arbitrageur comes along with a source of "objective" randomness, like a quantum random number generator. The arbitrageur wants to interact with the VNM layer, so it needs to design bets to which the inference layer assigns some specific probability. It does that by using the "objective" randomness source in the bet design: just incorporate that randomness in such a way that the inference layer assigns the probabilities the arbitrageur wants.

This seems correct insofar as it applies. It is a useful perspective, and not one I had thought much about before this, so thanks for bringing it in.

The main issue I still don't see resolved by this argument is the architecture question. The coherence theorems only say that an agent must act as if they perform Bayesian inference and then choose the option with highest expected value based on those probabilities. In the agent's actual internal architecture, there need not be separate modules for inference and decision-making (a Kalman filter is one example). If we can't neatly separate the two pieces somehow, then we don't have a good way to construct lotteries with specified probabilities, so we don't have a way to treat the agent as a VNM-type agent.

This directly follows from the original main issue: VNM utility theory is built on the idea that probabilities live in the environment, not in the agent. If there's a neat separation between the agent's inference and decision modules, then we can redefine the inference module to be part of the environment, but that neat separation need not always exist.

EDIT: Also, I should point out explicitly that VNM alone doesn't tell us why we ever expect probabilities to be relevant to anything in the first place. If we already have a Bayesian expected utility maximizer with separate inference and decision modules, then we can model that as an inference layer with VNM on top, but then we don't have a theorem telling us why inference layers should magically appear in the world.

Why do we expect (approximate) expected utility maximizers to show up in the real world? That's the main question coherence theorems answer, and VNM cannot answer that question unless all of the probabilities involved are ontologically fundamental.

The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it.

I would say, rather, that the arguments in its favor are the same ones which convinced economists.

Humans aren't well-modeled as perfect utility maximizers, but utility theory is a theory of what we can reflectively/coherently value. Economists might have been wrong to focus only on rational preferences, and have moved toward prospect theory and the like to remedy this. But it may make sense to think of alignment in these terms nonetheless.

I am not saying that it does make sense -- I'm just saying that there's a much better argument for it than "the economists did it", and I really don't think prospect theory addresses issues which are of great interest to alignment.

  • If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent. The argument for this position is the combination of the various arguments for expected utility theory: VNM; money-pump arguments; the various dutch-book arguments; Savage's theorem; the Jeffrey-Bolker theorem; the complete class theorem. One can take these various arguments and judge them on their own terms (perhaps finding them lacking).
  • Arguably, you can't fully align with inconsistent preferences; if so, one might argue that there is no great loss in making a utility-theoretic approximation of human preferences: it would be impossible to perfectly satisfy inconsistent preferences anyway, so representing them by a utility function is a reasonable compromise.
  • In aligning with inconsistent preferences, the question seems to be what standards to hold a system to in attempting to do so. One might argue that the standards of utility theory are among the important ones; and thus, that the system should attempt to be consistent even if humans are inconsistent.
  • To the extent that human preferences are inconsistent, it may make more sense to treat humans as fragmented multi-agents, and combine the preferences of the sub-agents to get an overall utility function -- essentially aligning with one inconsistent human the same way one would align with many humans. This approach might be justified by Harsanyi's theorem.

On the other hand, there are no strong arguments for representing human utility via prospect theory. It holds up better in experiments than utility theory does, but not so well that we would want to make it a bedrock assumption of alignment. The various arguments for expected utility make me somewhat happy for my preferences to be represented utility-theoretically even though they are not really like this; but, there is no similar argument in favor of a prospect-theoretic representation of my preferences. Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).

That's still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.

I don't think you're putting enough weight on what REALLY convinced economists, which was the tractability that assuming utility provides, and their enduring physics envy. (But to be fair, who wouldn't wish that their domain was as tractable as Newtonian physics ended up being.)

But yes, Utility is a useful enough first approximation for humans that it's worth using as a starting point. But only as a starting point. Unfortunately, too many economists are instead busy building castles on their assumptions, without trying to work with better approximations. (Yes, prospect theory and related. But it's hard to do the math, so micro-economic foundations of macroeconomics mostly just aren't being rebuilt.)

I certainly agree that this isn't a good reason to consider human inability to approximate a utility function when looking at modeling AGI. But it's absolutely critical when discussing what we're doing to align with human "values," and figuring out what that looks like. That's why I think that far more discussion on this is needed.

Yeah, I don't 100% buy the arguments which I gave in bullet-points in my previous comment.

But I guess I would say the following:

I expect to basically not buy any descriptive theory of human preferences. It doesn't seem likely we could find super-prospect theory which really successfully codified the sort of inconsistencies which we see in human values, and then reap some benefits for AI alignment.

So it seems like what you want to do instead is make very few assumptions at all. Assume that the human can do things like answer questions, but don't expect responses to be consistent even in the most basic sense of "the same answer to the same question". Of course, this can't be the end of the story, since we need to have a criterion -- what it means to be aligned with such a human. But hopefully the criterion would also be as agnostic as possible. I don't want to rely on specific theories of human irrationality.

So, when you say you want to see more discussion of this because it is "absolutely critical", I am curious about your model of what kind of answers are possible and useful.

My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don't even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do "wrong" so that we can pick what an AI should respect of human preferences, and what can be ignored.

For instance, I love my children, and I like chocolate. I'm also inconsistent with my preferences in ways that differs; at a given moment of time, I'm much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don't know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies - they are either inconsistent, or at least can be exploited by an AI that has a model which doesn't think about resolving those inconsistencies.

With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV - but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first.

(Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that's a different discussion.)

That all seems pretty fair.

If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent.

That's why I distinguished between the hypotheses of "human utility" and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the "extrapolation" less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem. For the former group, human utility is more salient, while the latter probably cares more about the CEV hypothesis (and the arguments you list in favor of it).

Arguably, you can't fully align with inconsistent preferences

My intuitions tend to agree, but I'm also inclined to ask "why not?" e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it "unaligned" with me? More generally, what is it about these other coherence conditions that prevent meaningful "alignment"? (Maybe it takes a big discursive can of worms, but I actually haven't seen this discussed on a serious level so I'm quite happy to just read references).

Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).
That's still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.

Hadn't thought about it this way. Partially updated (but still unsure what I think).

I didn't reply to this originally, probably because I think it's all pretty reasonable.

That's why I distinguished between the hypotheses of "human utility" and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the "extrapolation" less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem.

My thinking on this is pretty open. In some sense, everything is extrapolation (you don't exactly "currently" have preferences, because every process is expressed through time...). But OTOH there may be a strong argument for doing as little extrapolation as possible.

My intuitions tend to agree, but I'm also inclined to ask "why not?" e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information)

Well, imitating you is not quite right. (EG, the now-classic example introduced with the CIRL framework: you want the AI to help you make coffee, not learn to drink coffee itself.) Of course maybe it is imitating you at some level in its decision-making, like, imitating your way of judging what's good.

under what sense of the word is it "unaligned" with me?

I'm thinking things like: will it disobey requests which it understands and is capable of? Will it fight you? Not to say that those things are universally wrong to do, but they could be types of alignment we're shooting for, and inconsistencies do seem to create trouble there. Presumably if we know that it might fight us, we would want to have some kind of firm statement about what kind of "better" reasoning would make it do so (e.g., it might temporarily fight us if we were severely deluded in some way, but we want pretty high standards for that).


"Arguably, you can't fully align with inconsistent preferences"
My intuitions tend to agree, but I'm also inclined to ask "why not?" e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it "unaligned" with me? More generally, what is it about these other coherence conditions that prevent meaningful "alignment"? (Maybe it takes a big discursive can of worms, but I actually haven't seen this discussed on a serious level so I'm quite happy to just read references).

I've been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler's thoughts on what he has called "pareto-topia". (I haven't gotten anywhere thinking about this because I'm spending my time on other things.)

Yeah, I think something like this is pretty important. Another reason is that humans inherently don't like to be told, top-down, that X is the optimal solution. A utilitarian AI might redistribute property forcefully, where a pareto-improving AI would seek to compensate people.

An even more stringent requirement which seems potentially sensible: only pareto-improvements which both parties both understand and endorse. (IE, there should be something like consent.) This seems very sensible with small numbers of people, but unfortunately, seems infeasible for large numbers of people (given the way all actions have side-effects for many many people).

See my other reply about pseudo-pareto improvements - but I think the "understood + endorsed" idea is really important, and worth further thought.

the discussion was whether those agents will be broadly goal-directed at all, a weaker condition than being a utility maximizer

Uh, that chapter was claiming that "being a utility maximizer" is vacuous, and therefore "goal-directed" is a stronger condition than "being a utility maximizer".

Whoops, mea culpa on that one! Deleted and changed to:

the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.

It seems to me that every program behaves as if it was maximizing some utility function. You could try to restrict it by saying the utility function has to be "reasonable", but how?

  • If you say the utility function must have low complexity, that doesn't work - human values are pretty complex.
  • If you say the utility function has to be about world states, that doesn't work - human values are about entire world histories, you'd prevent suffering in the past if you could.
  • If you say the utility function has to be comprehensible to a human, that doesn't work - an AI extrapolating octopus values could give you something pretty alien.

So I'm having trouble spelling out precisely, even to myself, how AIs that satisfy the "utility hypothesis" differ from those that don't. How would you tell, looking at the AI and what it does?

Utility arguments often include type errors via referring to contextual utility in one part of the argument and some sort of god's eye contextless utility in other parts. Sometimes the 'gotcha' of the problem hinges on this.

Can you give an example?

Consider various utilitarian fixes to classic objections to it: https://www4.uwsp.edu/philosophy/dwarren/IntroBook/ValueTheory/Consequentialism/ActVsRule/FiveObjections.htm

In each case, the utilitarian wants to fix the issue by redrawing buckets around what counts as utility, what counts as actions, what counts as consequences, and the time binding/window on each of them. But these sort of ontological sidesteps prove too much. If taken as a general approach, rather than just as an ad hoc approach to solve any individual conundrum, it becomes obvious that it doesn't specify anything about agents' actions at all as discussed in Hoagy's post: https://www.alignmentforum.org/posts/yGuo5R9fgrrFLYWuv/when-do-utility-functions-constrain-1

Another way to see it is as a kind of motte and bailey issue with domain/goal specific utility as the motte and god's eye view as the bailey.

Through this lens it becomes obvious that a lot of population ethics problems, for instance, are just restatements of the sorites paradox or other such problems with continuums. You can also run this the other way and use 'utility' to turn any conflict in mathematical intuitions into a moral puzzle.