This post has benefitted from discussion with Sam Eisenstat, Scott Garrabrant, Tsvi Benson-Tilsen, Daniel Demski, Daniel Kokotajlo, and Stuart Armstrong. It started out as a thought about Stuart Armstrong's research agenda.

In this post, I hope to say something about what it means for a rational agent to have preferences. The view I am putting forward is relatively new to me, but it is not very radical. It is, dare I say, a conservative view -- I hold close to Bayesian expected utility theory. However, my impression is that it differs greatly from common impressions of Bayesian expected utility theory.

I will argue against a particular view of expected utility theory -- a view which I'll call reductive utility. I do not recall seeing this view explicitly laid out and defended (except in in-person conversations). However, I expect at least a good chunk of the assumptions are commonly made.

Reductive Utility

The core tenets of reductive utility are as follows:

  • The sample space of a rational agent's beliefs is, more or less, the set of possible ways the world could be -- which is to say, the set of possible physical configurations of the universe. Hence, each world is one such configuration.
  • The preferences of a rational agent are represented by a utility function from worlds to real numbers.
  • Furthermore, the utility function should be a computable function of worlds.

Since I'm setting up the view which I'm knocking down, there is a risk I'm striking at a straw man. However, I think there are some good reasons to find the view appealing. The following subsections will expand on the three tenets, and attempt to provide some motivation for them.

If the three points seem obvious to you, you might just skip to the next section.

Worlds Are Basically Physical

What I mean here resembles the standard physical-reductionist view. However, my emphasis is on certain features of this view:

  • There is some "basic stuff" -- like like quarks or vibrating strings or what-have-you.
  • What there is to know about the world is some set of statements about this basic stuff -- particle locations and momentums, or wave-form function values, or what-have-you.
  • These special atomic statements should be logically independent from each other (though they may of course be probabilistically related), and together, fully determine the world.
  • These should (more or less) be what beliefs are about, such that we can (more or less) talk about beliefs in terms of the sample space as being the set of worlds understood in this way.

This is the so-called "view from nowhere", as Thomas Nagel puts it.

I don't intend to construe this position as ruling out certain non-physical facts which we may have beliefs about. For example, we may believe indexical facts on top of the physical facts -- there might be (1) beliefs about the universe, and (2) beliefs about where we are in the universe. Exceptions like this violate an extreme reductive view, but are still close enough to count as reductive thinking for my purposes.

Utility Is a Function of Worlds

So we've got the "basically physical" . Now we write down a utility function . In other words, utility is a random variable on our event space.

What's the big deal?

One thing this is saying is that preferences are a function of the world. Specifically, preferences need not only depend on what is observed. This is incompatible with standard RL in a way that matters.

But, in addition to saying that utility can depend on more than just observations, we are restricting utility to only depend on things that are in the world. After we consider all the information in , there cannot be any extra uncertainty about utility -- no extra "moral facts" which we may be uncertain of. If there are such moral facts, they have to be present somewhere in the universe (at least, derivable from facts about the universe).

One implication of this: if utility is about high-level entities, the utility function is responsible for deriving them from low-level stuff. For example, if the universe is made of quarks, but utility is a function of beauty, consciousness, and such, then needs to contain the beauty-detector and consciousness-detector and so on -- otherwise how can it compute utility given all the information about the world?

Utility Is Computable

Finally, and most critically for the discussion here, should be a computable function.

To clarify what I mean by this: should have some sort of representation which allows us to feed it into a Turing machine -- let's say it's an infinite bit-string which assigns true or false to each of the "atomic sentences" which describe the world. should be a computable function; that is, there should be a Turing machine which takes a rational number and takes , prints a rational number within of , and halts. (In other words, we can compute to any desired degree of approximation.)

Why should be computable?

One argument is that should be computable because the agent has to be able to use it in computations. This perspective is especially appealing if you think of as a black-box function which you can only optimize through search. If you can't evaluate , how are you supposed to use it? If exists as an actual module somewhere in the brain, how is it supposed to be implemented? (If you don't think this sounds very convincing, great!)

Requiring to be computable may also seem easy. What is there to lose? Are there preference structures we really care about being able to represent, which are fundamentally not computable?

And what would it even mean for a computable agent to have non-computable preferences?

However, the computability requirement is more restrictive than it may seem.

There is a sort of continuity implied by computability: must not depend too much on "small" differences between worlds. The computation only accesses finitely many bits of before it halts. All the rest of the bits in must not make more than difference to the value of .

This means some seemingly simple utility functions are not computable.

As an example, consider the procrastination paradox. Your task is to push a button. You get 10 utility for pushing the button. You can push it any time you like. However, if you never press the button, you get -10. On any day, you are fine with putting the button-pressing off for one more day. Yet, if you put it off forever, you lose!

We can think of as a string like 000000100.., where the "1" is the day you push the button. To compute the utility, we might look for the "1", outputting 10 if we find it.

But what about the all-zero universe, 0000000...? The program must loop forever. We can't tell we're in the all-zero universe by examining any finite number of bits. You don't know whether you will eventually push the button. (Even if the universe also gives your source code, you can't necessarily tell from that -- the logical difficulty of determining this about yourself is, of course, the original point of the procrastination paradox.)

Hence, a preference structure like this is not computable, and is not allowed according to the reductive utility doctrine.

The advocate of reductive utility might take this as a victory. The procrastination paradox has been avoided, and other paradoxes with a similar structure. (The St. Petersburg Paradox is another example.)

On the other hand, if you think this is a legitimate preference structure, dealing with such 'problematic' preferences motivates abandonment of reductive utility.

Subjective Utility: The Real Thing

We can strongly oppose all three points without leaving orthodox Bayesianism. Specifically, I'll sketch how the Jeffrey-Bolker axioms enable non-reductive utility. (The title of this section is a reference to Jeffrey's book Subjective Probability: The Real Thing.)

However, the real position I'm advocating is more grounded in logical induction rather than the Jeffrey-Bolker axioms; I'll sketch that version at the end.

The View From Somewhere

The reductive-utility view approached things from the starting-point of the universe. Beliefs are for what is real, and what is real is basically physical.

The non-reductive view starts from the standpoint of the agent. Beliefs are for things you can think about. This doesn't rule out a physicalist approach. What it does do is give high-level objects like tables and chairs an equal footing with low-level objects like quarks: both are inferred from sensory experience by the agent.

Rather than assuming an underlying set of worlds, Jeffrey-Bolker assume only a set of events. For two events and , the conjunction exists, and the disjunction , and the negations and . However, unlike in the Kolmogorov axioms, these are not assumed to be intersection, union, and complement of an underlying set of worlds.

Let me emphasize that: we need not assume there are "worlds" at all.

In philosophy, this is called situation semantics -- an alternative to the more common possible-world semantics. In mathematics, it brings to mind pointless topology.

In the Jeffrey-Bolker treatment, a world is just a maximally specific event: an event which describes everything completely. But there is no requirement that maximally-specific events exist. Perhaps any event, no matter how detailed, can be further extended by specifying some yet-unmentioned stuff. (Indeed, the Jeffrey-Bolker axioms assume this! Although, Jeffrey does not seem philosophically committed to that assumption, from what I have read.)

Thus, there need not be any "view from nowhere" -- no semantic vantage point from which we see the whole universe.

This, of course, deprives us of the objects which utility was a function of, in the reductive view.

Utility Is a Function of Events

The reductive-utility makes a distinction between utility -- the random variable itself -- and expected utility, which is the subjective estimate of the random variable which we use for making decisions.

The Jeffrey-Bolker framework does not make a distinction. Everything is a subjective preference evaluation.

A reductive-utility advocate sees the expected utility of an event as derived from the utility of the worlds within the event. They start by defining ; then, we define the expected utility of an event as -- or, more generally, the corresponding integral.

In the Jeffrey-Bolker framework, we instead define directly on events. These preferences are required to be coherent with breaking things up into sums, so = -- but we do not define one from the other.

We don't have to know how to evaluate entire worlds in order to evaluate events. All we have to know is how to evaluate events!

I find it difficult to really believe "humans have a utility function", even approximately -- but I find it much easier to believe "humans have expectations on propositions". Something like that could even be true at the neural level (although of course we would not obey the Jeffrey-Bolker axioms in our neural expectations).

Updates Are Computable

Jeffrey-Bolker doesn't say anything about computability. However, if we do want to address this sort of issue, it leaves us in a different position.

Because subjective expectation is primary, it is now more natural to require that the agent can evaluate events, without any requirement about a function on worlds. (Of course, we could do that in the Kolmogorov framework.)

Agents don't need to be able to compute the utility of a whole world. All they need to know is how to update expected utilities as they go along.

Of course, the subjective utility can't be just any way of updating as you go along. It needs to be coherent, in the sense of the Jeffrey-Bolker axioms. And, maintaining coherence can be very difficult. But it can be quite easy even in cases where the random-variable treatment of the utility function is not computable.

Let's go back to the procrastination example. In this case, to evaluate the expected utility of each action at a given time-step, the agent does not need to figure out whether it ever pushes the button. It just needs to have some probability, which it updates over time.

For example, an agent might initially assign probability to pressing the button at time , and to never pressing the button. Its probability that it would ever press the button, and thus its utility estimate, would decrease with each observed time-step in which it didn't press the button. (Of course, such an agent would press the button immediately.)

Of course, this "solution" doesn't touch on any of the tricky logical issues which the procrastination paradox was originally introduced to illustrate. This isn't meant as a solution to the procrastination paradox -- only as an illustration of how to coherently update discontinuous preferences. This simple is uncomputable by the definition of the previous section.

It also doesn't address computational tractability in a very real way, since if the prior is very complicated, computing the subjective expectations can get extremely difficult.

We can come closer to addressing logical issues and computational tractability by considering things in a logical induction framework.

Utility Is Not a Function

In a logical induction (LI) framework, the central idea becomes "update your subjective expectations in any way you like, so long as those expectations aren't (too easily) exploitable to Dutch-book." This clarifies what it means for the updates to be "coherent" -- it is somewhat more elegant than saying "... any way you like, so long as they follow the Jeffrey-Bolker axioms."

This replaces the idea of "utility function" entirely -- there isn't any need for a function any more, just a logically-uncertain-variable (LUV, in the terminology from the LI paper).

Actually, there are different ways one might want to set things up. I hope to get more technical in a later post. For now, here's some bullet points:

  • In the simple procrastination-paradox example, you push the button if you have any uncertainty at all. So things are not that interesting. But, at least we've solved the problem.
  • In more complicated examples -- where there is some real benefit to procrastinating -- a LI-based agent could totally procrastinate forever. This is because LI doesn't give any guarantee about converging to correct beliefs for uncomputable propositions like whether Turing machines halt or whether people stop procrastinating.
  • Believing you'll stop procrastinating even though you won't is perfectly coherent -- in the same way that believing in nonstandard numbers is perfectly logically consistent. Putting ourselves in the shoes of such an agent, this just means we've examined our own decision-making to the best of our ability, and have put significant probability on "we don't procrastinate forever". This kind of reasoning is necessarily fallible.
  • Yet, if a system we built were to do this, we might have strong objections. So, this can count as an alignment problem. How can we give feedback to a system to avoid this kind of mistake? I hope to work on this question in future posts.
New Comment
45 comments, sorted by Click to highlight new comments since:

In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility.

To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view.

The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possible universes, according to the best available understanding of physics. This is objectionable, because then the agent needs to somehow change the domain as its understanding of physics grows (the ontological crisis problem). It seems more natural to allow the agent's preferences to be specified in terms of the high-level concepts it cares about (e.g. human welfare or paperclips), not in terms of the microscopic degrees of freedom (e.g. quantum fields or strings). There are also additional complications related to the unobservability of rewards, and to "moral uncertainty".

The second problem is that the reductive utility view requires the utility function to be computable. The author considers this an overly restrictive requirement, since it rules out utility functions such as in the procrastination paradox (1 is the button is ever pushed, 0 if the button is never pushed). More generally, computable utility function have to be continuous (in the sense of the topology on the space of infinite histories which is obtained from regarding it as an infinite cartesian product over time).

The alternative suggested by the author is using the Jeffrey-Bolker framework. Alas, the author does not write down the precise mathematical definition of the framework, which I find frustrating. The linked article in the Stanford Encyclopedia of Philosophy is long and difficult, and I wish the post had a succinct distillation of the relevant part.

The gist of Jeffrey-Bolker is, there are some propositions which we can make about the world, and each such proposition is assigned a number (its "desirability"). This corresponds to the conditional expected value of the utility function, with the proposition serving as a condition. However, there need not truly be a probability space and a utility function which realizes this correspondence, instead we can work directly with the assignment of numbers to propositions (as long as it satisfies some axioms).

In my opinion, the Jeffrey-Bolker framework seems interesting, but the case presented in the post for using it is weak. To see why, let's return to our motivating problems.

The problem of ontology is a real problem, in this I agree with the author completely. However, Jeffrey-Bolker only offers some hint of a solution at best. To have a complete solution, one would need to explain in what language are propositions are constructed and how the agent updates the desirability of propositions according to observations, and then prove some properties about the resulting framework which give it prescriptive power. I think that the author believes this can be achieved using Logical Induction, but the burden of proof is not met.

Hence, Jeffrey-Bolker is not sufficient to solve the problem. Moreover, I believe it is also not necessary! Indeed, infra-Bayesian physicalism offers a solution to the ontology problem which doesn't require abandoning the concept of a utility function (although one has to replace the ordinary probabilistic expectations with infra-Bayesian expectations). That solution certainly has caveats (primarily, the monotonicity principle), but at the least it shows that utility functions are not entirely incompatible with solving the ontology problem.

On the other hand, with the problem of computability, I am not convinced by the author's motivation. Do we truly need uncomputable utility functions? I am skeptical towards inquiries which are grounded in generalization for the sake of generalization. I think it is often more useful to thoroughly understand the simplest non-trivial special case, before we can confidently assert which generalizations are possible or desirable. And it is not the case with rational agent theory that the special case of computable utility functions is so thoroughly understood.

Moreover, I am not convinced that Jeffrey-Bolker allows us handling uncomputable utility functions as easily as the authors suggests. The author's argument goes: the utility function might be uncomputable, but as long as its conditional expectations w.r.t. "valid" propositions are computable, there is no problem for rational behavior to be computable. But, how often does it happen that the utility function is uncomputable but all the relevant conditional expectations are computable?

The author suggests the following example: take the procrastination utility function and take some computable distribution over the first time when the button is pushed, plus a probability for the button to never be pushed. Then, we can compute the probability the button is pushed conditional that it wasn't pushed for the first rounds. Alright, but now let's consider a different distribution. Suppose a random Turing machine is chosen[1] at the beginning of time, and on round the button is pushed iff halts after steps. Notice that this distribution on sequences is perfectly computable[2]. But now, computing the probability that the button is pushed is impossible, since it's the (in)famous Chaitin constant.

Here too, the author seems to believe that Logical Induction should solve the procrastination paradox and issues with uncomptuable utility functions more generally, as a special case of Jeffrey-Bolker. But, so far I remain unconvinced.


  1. That is, we compose a random program for a prefix-free UTM by repeatedly flipping a fair coin, as usual in algorithmic information theory. ↩︎

  2. It's even polynomial-time sampleable. ↩︎

IIUC, you argue that for an embedded agent to have an explicit utility function, it needs to be a function of the microscopic description of the universe. This is unsatisfactory since the agent shouldn't start out knowing microscopic physics. The alternative you suggest is using the more exotic Jeffrey-Bolker approach. However, this is not how I believe embedded agency should work.

Instead, you should consider a utility function that depends on the universe described in whatever ontology the utility function is defined (which we may call "macroscopic"). Microscopic physics comes in when the agent learns a fine-grained model of the dynamics in the macroscopic ontology. In particular, this fine-grained model can involve a fine-grained state space.

The other issue discussed is utility functions of the sort exemplified by the procrastination paradox. I think that besides being uncomputable, this brings in other pathologies. For example, since the utility functions you consider are discontinuous, it is no longer guaranteed an optimal policy exists at all. Personally, I think discontinuous utility functions are strange and poorly motivated.

I don't want to make a strong argument against your position here. Your position can be seen as one example of "don't make utility a function of the microscopic".

But let's pretend for a minute that I do want to make a case for my way of thinking about it as opposed to yours.

  • Humans are not clear on what macroscopic physics we attach utility to. It is possible that we can emulate human judgement sufficiently well by learning over macroscopic-utility hypotheses (ie, partial hypotheses in your framework). But perhaps no individual hypothesis will successfully capture the way human value judgements fluidly switch between macroscopic ontologies -- perhaps human reasoning of this kind can only be accurately captured by a dynamic LI-style "trader" who reacts flexibly to an observed situation, rather than a fixed partial hypothesis. In other words, perhaps we need to capture something about how humans reason, rather than any fixed ontology (even of the flexible macroscopic kind).
  • Your way of handling macroscopic ontologies entails knightian uncertainty over the microscopic possibilities. Isn't that going to lack a lot of optimization power? EG, if humans reasoned this way using intuitive physics, we'd be afraid that any science experiment creating weird conditions might destroy the world, and try to minimize chances of those situations being set up, or something along those lines? I'm guessing you have some way to mitigate this, but I don't know how it works.

As for discontinuous utility:

For example, since the utility functions you consider are discontinuous, it is no longer guaranteed an optimal policy exists at all. Personally, I think discontinuous utility functions are strange and poorly motivated.

My main motivating force here is to capture the maximal breadth of what rational (ie coherent, ie non-exploitable) preferences can be, in order to avoid ruling out some human preferences. I have an intuition that this can ultimately help get the right learning-theoretic guarantees as opposed to hurt, but, I have not done anything to validate that intuition yet.

With respect to procrastination-like problems, optimality has to be subjective, since there is no foolproof way to tell when an agent will procrastinate forever. If humans have any preferences like this, then alignment means alignment with human subjective evaluations of this matter -- if the human (or some extrapolated human volition, like HCH) looks at the system's behavior and says "NO!! Push the button now, you fool!!" then the system is misaligned. The value-learning should account for this sort of feedback in order to avoid this. But this does not attempt to minimize loss in an objective sense -- we export that concern to the (extrapolated?) human evaluation which we are bounding loss with respect to.

With respect to the problem of no-optimal-policy, my intuition is that you try for bounded loss instead; so (as with logical induction) you are never perfect but you have some kind of mistake bound. Of course this is more difficult with utility than it is with pure epistemics.

Humans are not clear on what macroscopic physics we attach utility to. It is possible that we can emulate human judgement sufficiently well by learning over macroscopic-utility hypotheses (ie, partial hypotheses in your framework). But perhaps no individual hypothesis will successfully capture the way human value judgements fluidly switch between macroscopic ontologies...

First, it seems to me rather clear what macroscopic physics I attach utility to. If I care about people, this means my utility function comes with some model of what a "person" is (that has many free parameters), and if something falls within the parameters of this model then it's a person, and if it doesn't then it isn't a person (ofc we can also have a fuzzy boundary, which is supported in quasi-Bayesianism).

Second, what does it mean for a hypothesis to be "individual"? If we have a prior over a family of hypotheses, we can take their convex combination and get a new individual hypothesis. So I'm not sure what sort of "fluidity" you imagine that is not supported by this.

Your way of handling macroscopic ontologies entails knightian uncertainty over the microscopic possibilities. Isn't that going to lack a lot of optimization power? EG, if humans reasoned this way using intuitive physics, we'd be afraid that any science experiment creating weird conditions might destroy the world, and try to minimize chances of those situations being set up, or something along those lines?

The agent doesn't have full Knightian uncertainty over all microscopic possibilities. The prior is composed of refinements of an "ontological belief" that has this uncertainty. You can even consider a version of this formalism that is entirely Bayesian (i.e. each refinement has to be maximal), but then you lose the ability to retain an "objective" macroscopic reality in which the agent's point of view is "unspecial", because if the agent's beliefs about this reality have no Knightian uncertainty then it's inconsistent with the agent's free will (you could "avoid" this problem using an EDT or CDT agent but this would be bad for the usual reasons EDT and CDT are bad, and ofc you need Knightian uncertainty anyway because of non-realizability).

First, it seems to me rather clear what macroscopic physics I attach utility to. If I care about people, this means my utility function comes with some model of what a “person” is (that has many free parameters), and if something falls within the parameters of this model then it’s a person,

This does not strike me as the sort of thing which will be easy to write out. But there are other examples. What if humans value something like observer-independent beauty? EG, valuing beautiful things existing regardless of whether anyone observes their beauty. Then it seems pretty unclear what ontological objects it gets predicated on.

Second, what does it mean for a hypothesis to be “individual”? If we have a prior over a family of hypotheses, we can take their convex combination and get a new individual hypothesis. So I’m not sure what sort of “fluidity” you imagine that is not supported by this.

What I have in mind is complicated interactions between different ontologies. Suppose that we have one ontology -- the ontology of classical economics -- in which:

  • Utility is predicated on individuals alone.
  • Individuals always and only value their own hedons; any apparent revealed preference for something else is actually an indication that observing that thing makes the person happy, or that behaving as if they value that other thing makes them happy. (I don't know why this is part of classical economics, but it seems at least highly correlated with classical-econ views.)
  • Aggregate utility (across many individuals) can only be defined by giving an exchange rate, since utility functions of different individuals are incomparable. However, an exchange rate is implicitly determined by the market.

And we have another ontology -- the hippie ontology -- in which:

  • Energy, aka vibrations, is an essential part of social interactions and other things.
  • People and things can have good energy and bad energy.
  • People can be on the same wavelength.
  • Etc.

And suppose what we want to do is try to reconcile the value-content of these two different perspectives. This isn't going to be a mixture between two partial hypotheses. It might actually be closer to an intersection between two partial hypotheses -- since the different hypotheses largely talk about different entities. But that won't be right either. Rather, there is philosophical work to be done, figuring out how to appropriately mix the values which are represented in the two ontologies.

My intuition behind allowing preference structures which are "uncomputable" as functions of fully specified worlds is, in part, that one might continue doing this kind of philosophical work in an unbounded way -- IE there is no reason to assume there's a point at which this philosophical work is finished and you now have something which can be conveniently represented as a function of some specific set of entities. Much like logical induction never finishes and gives you a Bayesian probability function, even if it gets closer over time.

The agent doesn’t have full Knightian uncertainty over all microscopic possibilities. The prior is composed of refinements of an “ontological belief” that has this uncertainty. You can even consider a version of this formalism that is entirely Bayesian (i.e. each refinement has to be maximal),

OK, that makes sense!

but then you lose the ability to retain an “objective” macroscopic reality in which the agent’s point of view is “unspecial”, because if the agent’s beliefs about this reality have no Knightian uncertainty then it’s inconsistent with the agent’s free will (you could “avoid” this problem using an EDT or CDT agent but this would be bad for the usual reasons EDT and CDT are bad, and ofc you need Knightian uncertainty anyway because of non-realizability).

Right.

First, it seems to me rather clear what macroscopic physics I attach utility to...

This does not strike me as the sort of thing which will be easy to write out.

Of course it is not easy to write out. Humanity preferences are highly complex. By "clear" I only meant that it's clear something like this exists, not that I or anyone can write it out.

What if humans value something like observer-independent beauty? EG, valuing beautiful things existing regardless of whether anyone observes their beauty.

This seems ill-defined. What is a "thing"? What does it mean for a thing to "exist"? I can imagine valuing beautiful wild nature, by having "wild nature" be a part of the innate ontology. I can even imagine preferring certain computations to have results with certain properties. So, we can consider a preference that some kind of simplicity-prior-like computation outputs bit sequences with some complexity theoretic property we call "beauty". But if you want to go even more abstract than that, I don't know how to make sense of that ("make sense" not as "formalize" but just as "understand what you're talking about").

It would be best if you had a simple example, like a diamond maximizer, where it's more or less clear that it makes sense to speak of agents with this preference.

What I have in mind is complicated interactions between different ontologies. Suppose that we have one ontology -- the ontology of classical economics -- in which...

And we have another ontology -- the hippie ontology -- in which...

And suppose what we want to do is try to reconcile the value-content of these two different perspectives.

Why do we want to reconcile them? I think that you might be mixing two different questions here. The first question is what kind of preferences ideal "non-myopic" agents can have. About this I maintain that my framework provides a good answer, or at least a good first approximation of the answer. The second question is what kind of preferences humans can have. But humans are agents with only semi-coherent preferences, and I see no reason to believe things like reconciling classical economics with hippies should follow from any natural mathematical formalism. Instead, I think we should model humans as having preferences that change over time, and the detailed dynamics of the change is just a function the AI needs to learn, not some consequence of mathematical principles of rationality.

I certainly don't evaluate my U on quarks. Omega is not the set of worlds, it is the set of world models, and we are the ones who decide what that model should be. In "procrastination" example you intentionally picked a bad model, so it proves nothing (if the world only has one button we care about, then maybe |Omega|=2 and everything is perfectly computable).

Further on, it seems to me that if we set our model to be a list of "events" we've observed, then we get the exact thing you're talking about. Although you're imprecise and inconsistent about what an event is, how it's represented, how many there are, so I'm not sure if that's supposed to make anything more tractable.

In general, asking questions about the domain of U (and P!) is a good idea, and something that all introductions to Utility lack. But the ease with which you abandon a perfectly good formalism is concerning. LI is cool, and it doesn't use U, but that's not an argument against U, at best you can say that U was not as useful as you'd hoped.

My own take is that the domain of U is the type of P. That is, U is evaluated on possible functions P. P certainly represents everything the agent cares about in the world, and it's also already small and efficient enough to be stored and updated in the agent, so this solution creates no new problems. 
 

I agree that it makes more sense to suppose "worlds" are something closer to how the agent imagines worlds, rather than quarks. But on this view, I think it makes a lot of sense to argue that there are no maximally specific worlds -- I can always "extend" a world with an extra, new fact which I had not previously included. IE, agents never "finish" imagining worlds; more detail can always be added (even if only in separate magisteria, eg, imagining adding epiphenomenal facts). I can always conceive of the possibility of a new predicate beyond all the predicates which a specific world-model discusses.

If you buy this, then I think the Jeffrey-Bolker setup is a reasonable formalization.

If you don't buy this, my next question would be whether you really think that the sort of "world" ("world model", as you called it) which an agent attaches value to always are "closed off" (ie sperify all the facts one way or the other; do not admit further detail) -- or, perhaps, you merely want to argue that this can sometimes be the case but not always. (Because if it's sometimes the case but not always, this argues against both the traditional view where Omega is the set which the probability is a measure over & the utility function is a function of, and against the Jeffrey-Bolker picture.)

I find it implausible that the sort of "world model" which we can model humans as having-values-as-a-function-of is "closed off" -- we can appreciate ideas like atoms and quarks, adding these to our ontology, without necessarily changing other aspects of our world-model. Perhaps sometimes we can "close things off" like this -- we can consider the possibility that there "is nothing else" -- but even so, I think this is better-modeled as an additional assertion which we add to the set of propositions defining a possibility rather than modeling us as having bottomed out in an underlying set of "world" which inherently decide all propositions.

In "procrastination" example you intentionally picked a bad model, so it proves nothing (if the world only has one button we care about, then maybe |Omega|=2 and everything is perfectly computable).

You seem to be suggesting that any such example could be similarly re-written to make things nicely computable. I find this implausible. We could expand the scenario so that every "day" is represented by an n-bit string. The computable function b() looks at a "day" and tells us whether the button was pressed or not on that day. As before, we get -10 utility if the button is never pressed. But we also have some (computable) reward, r(), which is a function of a "day" and tells us how good or bad that day was. The discounted reward is such that these priorities are never more important than whether or not the button is pressed; but so long as the button is eventually pressed, we prefer to get more reward rather than less. How would you change the representation now?

More generally, do you believe that any plausible utility function on bit-strings can be re-represented as a computable function (perhaps on some other representation, rather than bit-strings)? Why would you particularly expect this to be the case?

I think in arguing that I intentionally picked a bad model, you mean that the world-model representation which I chose was totally ad-hoc and chosen specifically to make things difficult to compute, and without having the goal in mind of making things difficult to compute, someone else would have chosen something simpler like |Omega|=2. But I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.


Further on, it seems to me that if we set our model to be a list of "events" we've observed, then we get the exact thing you're talking about. Although you're imprecise and inconsistent about what an event is, how it's represented, how many there are, so I'm not sure if that's supposed to make anything more tractable.

I didn't understand this part.

In general, asking questions about the domain of U (and P!) is a good idea, and something that all introductions to Utility lack. But the ease with which you abandon a perfectly good formalism is concerning. LI is cool, and it doesn't use U, but that's not an argument against U, at best you can say that U was not as useful as you'd hoped.

Jeffrey-Bolker is fairly commonly advocated amongst decision theorists in philosophy (from both sides of the CDT-EDT debate!), although as far as I'm aware it hasn't made its way into stats textbooks at any level. It can be seen as part of a broader movement in mathematics, away from set-theoretic representations and toward more algebraic representations. A related example is pointless topology -- instead of understanding a topology as a structure imposed on a set of points, the structure of "opens" (no longer "open sets") is examined in its own right. In the same way that discarding "worlds" moves the formalism closer to concepts which the agent can actually realistically manipulate, discarding "points" from topology moves the math closer to the pieces which mathematicians are actually interested in manipulating.

My own take is that the domain of U is the type of P. That is, U is evaluated on possible functions P. P certainly represents everything the agent cares about in the world, and it's also already small and efficient enough to be stored and updated in the agent, so this solution creates no new problems. 

This is an interesting alternative, which I have never seen spelled out in axiomatic foundations.

Answering out of order:

<...> then I think the Jeffrey-Bolker setup is a reasonable formalization.

Jeffrey is a reasonable formalization, it was never my point to say that it isn't. My point is only that U is also reasonable, and possibly equivalent or more general. That there is no "case against" it. Although, if you find Jeffery more elegant or comfortable, there is nothing wrong with that.

do you believe that any plausible utility function on bit-strings can be re-represented as a computable function (perhaps on some other representation, rather than bit-strings)?

I don't know what "plausible" means, but no, that sounds like a very high bar. I believe that if there is at least one U that produces an intelligent agent, then utility functions are interesting and worth considering. Of course I believe that there are many such "good" functions, but I would not claim that I can describe the set of all of them. At the same time, I don't see why any "good" utility function should be uncomputable.

I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.

I agree with the first sentence, however Omega is merely the domain of U, it does not need to be the entire ontology. In this case Omega={"button has been pressed", "button has not been pressed"} and P("button has been pressed" | "I'm pressing the button")~1. Obviously, there is also no problem with extending Omega with the perceptions, all the way up to |Omega|=4, or with adding some clocks.

We could expand the scenario so that every "day" is represented by an n-bit string.

If you want to force the agent to remember the entire history of the world, then you'll run out of storage space before you need to worry about computability. A real agent would have to start forgetting days, or keep some compressed summary of that history. It seems to me that Jeffrey would "update" the daily utilities into total expected utility; in that case, U can do something similar.

I can always "extend" a world with an extra, new fact which I had not previously included. IE, agents never "finish" imagining worlds; more detail can always be added

You defined U at the very beginning, so there is no need to send these new facts to U, it doesn't care. Instead, you are describing a problem with P, and it's a hard problem, but Jeffrey also uses P, so that doesn't solve it.

>  ... set our model to be a list of "events" we've observed ...
I didn't understand this part.

If you "evaluate events", then events have some sort of bit representation in the agent, right? I don't clearly see the events in your "Updates Are Computable" example, so I can't say much and I may be confused, but I have a strong feeling that you could define U as a function on those bits, and get the same agent.

This is an interesting alternative, which I have never seen spelled out in axiomatic foundations.

The point would be to set U(p) = p("button has been pressed") and then decide to "press the button" by evaluating U(P conditioned on "I'm pressing the button") * P("I'm pressing the button" | "press the button"), where P is the agent's current belief, and p is a variable of the same type as P.

My point is only that U is also reasonable, and possibly equivalent or more general. That there is no "case against" it. 

I do agree that my post didn't do a very good job of delivering a case against utility functions, and actually only argues that there exists a plausibly-more-useful alternative to a specific view which includes utility functions as one of several elements

Utility functions definitely aren't more general.

A classical probability distribution over  with a utility function understood as a random variable can easily be converted to the Jeffrey-Bolker framework, by taking the JB algebra as the sigma-algebra, and V as the expected value of U. Technically the sigma-algebra needs to be atomless to fit JB exactly, but Zoltan Domotor (Axiomatization of Jeffrey Utilities) generalizes this considerably.

I've heard people say that there is a way to convert in the other direction, but that it requires ultrafilters (so in some sense it's very non-constructive). I haven't been able to find this construction yet or had anyone explain how it works.

So it seems to me, but I recognize that I haven't shown in detail, that the space of computable values is strictly broader in the JB framework; computable utility functions + computable probability gives us computable JB-values, but computable JB-values need not correspond to computable utility functions.

Thus, the space of minds which can be described by the two frameworks might be equivalent, but the space of minds which can be described by computations does not seem to be; the JB space, there, is larger.

I don't see why any "good" utility function should be uncomputable.

Well, the Jeffrey-Bolker kind of explanation is as follows: agents really only need to consider and manipulate the probabilities and expected values of events (ie, propositions in the agent's internal language). So it makes some sense to assume that these probabilities and expected values are computable. But this does not imply (as far as I know) that we can construct 'worlds' as maximal specifications of which propositions are true/false and then define a utility function on those worlds which is consistent with the computable expected values and have that utility function itself be computable. And indeed it seems rather plausible to me that this is not the case, even for values which otherwise seem relatively unremarkable, as illustrated by examples like the procrastination paradox.

I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.

I agree with the first sentence, however Omega is merely the domain of U, it does not need to be the entire ontology. In this case Omega={"button has been pressed", "button has not been pressed"} and P("button has been pressed" | "I'm pressing the button")~1. Obviously, there is also no problem with extending Omega with the perceptions, all the way up to |Omega|=4, or with adding some clocks.

I'm not sure why you say Omega can be the domain of U but not the entire ontology. This seems to mean that we don't know how to take expected values for arbitrary events. Also it means you are no longer advocating for the model I'm arguing against, where U is a random variable.

We could expand the scenario so that every "day" is represented by an n-bit string.

If you want to force the agent to remember the entire history of the world, then you'll run out of storage space before you need to worry about computability. A real agent would have to start forgetting days, or keep some compressed summary of that history. It seems to me that Jeffrey would "update" the daily utilities into total expected utility; in that case, U can do something similar.

I agree that we can put even more stringent (and realistic) requirements on the computational power of the agent, and then both JB and random-variable treatments become implausible, in so far as those treatments involve infinitely large representations.

I still think that the Jeffreyesque representational choice of using compact event-propositions, rather than fully-specified worlds, seems more plausible with respect to such bounded agents.

You defined U at the very beginning, so there is no need to send these new facts to U, it doesn't care. Instead, you are describing a problem with P, and it's a hard problem, but Jeffrey also uses P, so that doesn't solve it.

As per my earlier comment on "Omega is merely the domain of U", I think here you're abandoning elements of the random-variable approach to U, and in fact reasoning in a more JB-esque way.

>  ... set our model to be a list of "events" we've observed ...
I didn't understand this part.

If you "evaluate events", then events have some sort of bit representation in the agent, right? I don't clearly see the events in your "Updates Are Computable" example, so I can't say much and I may be confused, but I have a strong feeling that you could define U as a function on those bits, and get the same agent.

Yeah, it seems like we're talking past each other here and would need to do more work to unpack what's going on. All I can think to say right now is this: the usual random-variable approach to defining U requires that probabilities respect countable additivity, because the event of "the button being pressed" is just the set of individual worlds where that happens (where the button gets pressed on a particular day). This is the root of the computational difficulty in the standard approach. JB doesn't require countable additivity, since it isn't a rule which agents can enforce on their beliefs by touching only finitely many of them. This harkens back to something you said earlier:

Instead, you are describing a problem with P, and it's a hard problem, but Jeffrey also uses P, so that doesn't solve it.

Which I agree with in this case, except that JB does "solve" it by explicitly relaxing that constraint.

Again, this is a way in which JB is more general, not less; JB could follow that constraint, if you like.

A classical probability distribution over  with a utility function understood as a random variable can easily be converted to the Jeffrey-Bolker framework, by taking the JB algebra as the sigma-algebra, and V as the expected value of U.

Ok, you're saying that JB is just a set of axioms, and U already satisfies those axioms. And in this construction "event" really is a subset of Omega, and "updates" are just updates of P, right? Then of course U is not more general, I had the impression that JB is a more distinct and specific thing.

Regarding the other direction, my sense is that you will have a very hard time writing down these updates, and when it works, the code will look a lot like one with an utility function. But, again, the example in "Updates Are Computable" isn't detailed enough for me to argue anything. Although now that I look at it, it does look a lot like the U(p)=1-p("never press the button").

events (ie, propositions in the agent's internal language)

I think you should include this explanation of events in the post.

construct 'worlds' as maximal specifications of which propositions are true/false

It remains totally unclear to me why you demand the world to be such a thing.

I'm not sure why you say Omega can be the domain of U but not the entire ontology.

My point is that if U has two output values, then it only needs two possible inputs. Maybe you're saying that if |dom(U)|=2, then there is no point in having |dom(P)|>2, and maybe you're right, but I feel no need to make such claims. Even if the domains are different, they are not unrelated, Omega is still in some way contained in the ontology.

I agree that we can put even more stringent (and realistic) requirements on the computational power of the agent

We could and I think we should. I have no idea why we're talking math, and not writing code for some toy agents in some toy simulation. Math has a tendency to sweep all kinds of infinite and intractable problems under the rug.

It remains totally unclear to me why you demand the world to be such a thing.

Ah, if you don't see 'worlds' as meaning any such thing, then I wonder, are we really arguing about anything at all?

I'm using 'worlds' that way in reference to the same general setup which we see in propositions-vs-models in model theory, or in  vs the -algebra in the Kolmogorov axioms, or in Kripke frames, and perhaps some other places. 

We can either start with a basic set of "worlds" (eg, ) and define our "propositions" or "events" as sets of worlds, where that proposition/event 'holds' or 'is true' or 'occurs'; or, equivalently, we could start with an algebra of propositions/events (like a -algebra) and derive worlds as maximally specific choices of which propositions are true and false (or which events hold/occur).

My point is that if U has two output values, then it only needs two possible inputs. Maybe you're saying that if |dom(U)|=2, then there is no point in having |dom(P)|>2, and maybe you're right, but I feel no need to make such claims.

Maybe I should just let you tell me what framework you are even using in the first place. There are two main alternatives to the Jeffrey-Bolker framework which I have in mind: the Savage axioms, and also the thing commonly seen in statistics textbooks where you have a probability distribution which obeys the Kolmogorov axioms and then you have random variables over that (random variables being defined as functions of type ). A utility function is then treated as a random variable.

It doesn't sound like your notion of utility function is any of those things, so I just don't know what kind of framework you have in mind.

Maybe I should just let you tell me what framework you are even using in the first place.

I'm looking at the Savage theory from your own https://plato.stanford.edu/entries/decision-theory/ and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O.

Furthermore, if O={o0,o1}, then I can group the terms into u(o0)P("we're in a state where f evaluates to o0") + u(o1)P("we're in a state where f evaluates to o1"), I'm just moving all of the complexity out of EU and into P, which I assume to work by some magic (e.g. LI), that doesn't involve literally iterating over every possible S.

We can either start with a basic set of "worlds" (eg, ) and define our "propositions" or "events" as sets of worlds <...>

That's just math speak, you can define a lot of things as a lot of other things, but that doesn't mean that the agent is going to be literally iterating over infinite sets of infinite bit strings and evaluating something on each of them.

By the way, I might not see any more replies to this.

I'm looking at the Savage theory from your own https://plato.stanford.edu/entries/decision-theory/ and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O.

(Just to be clear, I did not write that article.)

I think the interpretation of Savage is pretty subtle. The objects of preference ("outcomes") and objects of belief ("states") are treated as distinct sets. But how are we supposed to think about this?

  • The interpretation Savage seems to imply is that both outcomes and states are "part of the world", but the agent has somehow segregated parts of the world into matters of belief and matters of preference. But however the agent has done this, it seems to be fundamentally beyond the Savage representation; clearly within Savage, the agent cannot represent meta-beliefs about which matters are matters of belief and which are matters of preference. So this seems pretty weird. 
  • We could instead think of the objects of preference as something like "happiness levels" rather than events in the world. The idea of the representation theorem then becomes that we can peg "happiness levels" to real numbers. In this case, the picture looks more like standard utility functions; S is the domain of the function that gives us our happiness level (which can be represented by a real-valued utility). 
  • Another approach which seems somewhat common is to take the Savage representation but require that S=O. Savage's "acts" then become maps from world to world, which fits well with other theories of counterfactuals and causal interventions. 

So even within a Savage framework, it's not entirely clear that we would want the domain of the utility function to be different from the domain of the belief function.

I should also have mentioned the super-common VNM picture, where utility has to be a function of arbitrary states as well.

That's just math speak, you can define a lot of things as a lot of other things, but that doesn't mean that the agent is going to be literally iterating over infinite sets of infinite bit strings and evaluating something on each of them.

The question is, what math-speak is the best representation of the things we actually care about? 

An Orthodox Case Against Utility Functions was a shocking piece to me. Abram spends the first half of the post laying out a view he suspects people hold, but he thinks is clearly wrong, which is a perspective that approaches things "from the starting-point of the universe". I felt dread reading it, because it was a view I held at the time, and I used as a key background perspective when I discussed bayesian reasoning. The rest of the post lays out an alternative perspective that "starts from the standpoint of the agent". Instead of my beliefs being about the universe, my beliefs are about my experiences and thoughts.

I generally nod along to a lot of the 'scientific' discussion in the 21st century about how the universe works and how reasonable the whole thing is. But I don't feel I knew in-advance to expect the world around me to operate on simple mathematical principles and be so reasonable. I could've woken up in the Harry Potter universe of magic wands and spells. I know I didn't, but if I did, I think I would be able to act in it? I wouldn't constantly be falling over myself because I don't understand how 1 + 1 = 2 anymore? There's some place I'm starting from that builds up to an understanding of the universe, and doesn't sneak it in as an 'assumption'.

And this is what this new perspective does that Abram lays out in technical detail. (I don't follow it all, for instance I don't recall why it's important that the former view assumes that utility is computable.) In conclusion, this piece is a key step from the existing philosophy of agents to the philosophy of embedded agents, or at least it was for me, and it changes my background perspective on rationality. It's the only post in the early vote that I gave +9.

(This review is taken from my post Ben Pace's Controversial Picks for the 2020 Review.)

(I don't follow it all, for instance I don't recall why it's important that the former view assumes that utility is computable.)

Partly because the "reductive utility" view is made a bit more extreme than it absolutely had to be. Partly because I think it's extremely natural, in the "LessWrong circa 2014 view", to say sentences like "I don't even know what it would mean for humans to have uncomputable utility functions -- unless you think the brain is uncomputable". (I think there is, or at least was, a big overlap between the LW crowd and the set of people who like to assume things are computable.) Partly because the post was directly inspired by another alignment researcher saying words similar to those, around 2019.

Without this assumption, the core of the "reductive utility" view would be that it treats utility functions as actual functions from actual world-states to real numbers. These functions wouldn't have to be computable, but since they're a basic part of the ontology of agency, it's natural to suppose they are -- in exactly the same way it's natural to suppose that an agent's beliefs should be computable, and in a similar way to how it seems natural to suppose that physical laws should be computable.

Ah, I guess you could say that I shoved the computability assumption into the reductive view because I secretly wanted to make 3 different points:

  1. We can define beliefs directly on events, rather than needing "worlds", and this view seems more general and flexible (and closer to actual reasoning).
  2. We can define utility directly on events, rather than "worlds", too, and there seem to be similar advantages here.
  3. In particular, uncomputable utility functions seem pretty strange if you think utility is a function on worlds; but if you think it's defined as a coherent expectation on events, then it's more natural to suppose that the underlying function on worlds (that would justify the event expectations) isn't computable.

Rather than make these three points separately, I set up a false dichotomy for illustration.

Also worth highlighting that, like my post Radical Probabilism, this post is mostly communicating insights that it seems Richard Jeffrey had several decades ago. 

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions in my mind, since I typically use qualitative descriptions of probability rather than numbers in [0,1], and when I do use numbers, they are very rough, but this does seem like a sort of fuzzy, computationally limited version of a probability distribution. Similarly, my estimations of how good various outcomes are are often qualitative, rather than numerical, and again this seems like a fuzzy, computationally limited version of utility function. In order to determine the utility of the event "I take action A", I need to consider how good and how likely various consequences are, and take the expectation of the 'how good' with respect to the 'how likely'. The Jeffrey-Bolker framework seems to be asking me to pretend none of that ever happened.

Perhaps it goes without saying, but obviously, both frameworks are flexible enough to allow for most phenomena -- the question here is what is more natural in one framework or another.

My main argument is that the procrastination paradox is not natural at all in a Savage framework, as it suggests an uncomputable utility function. I think this plausibly outweighs the issue you're pointing at.

But with respect to the issue you are pointing at:

I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about,

In the Savage framework, an outcome already encodes everything you care about. So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

Alternately, we might say (as as Savage said) that the Savage axioms apply to "small worlds" -- small scenarios which the agent abstracts from its experience, such as the decision of whether to break an egg for an omelette. These can be easily considered by the agent, if it can assign values "from outside the problem" in an appropriate way.

But then, to account for the breadth of human reasoning, it seems to me we also want an account of things like extending a small world when we find that it isn't sufficient, and coherence between different small-world frames for related decisions.

This gives a picture very much like the Jeffrey-Bolker picture, in that we don't really work with outcomes which completely specify everything we care about, but rather, work with a variety of simplified outcomes with coherence requirements between simpler and more complex views.

So overall I think it is better to have some picture where you can break things up in a more tractable way, rather than having full outcomes which you need to pass through to get values.

In the Jeffrey-Bolker framework, you can re-estimate the value of an event by breaking it up into pieces, estimating the value and probability of each piece, and combining them back together. This process could be iterated in a manner similar to dynamic programming in RL, to improve value estimates for actions -- although one needs to settle on a story about where the information originally comes from. I currently like the logical-induction-like picture where you get information coming in "somehow" (a broad variety of feedback is possible, including abstract judgements about utility which are hard to cash out in specific cases) and you try to make everything as coherent as possible in the meanwhile.

In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

You do not need to be able to imagine every possible outcome individually in order to think of functions on or probability distributions over the set of outcomes, any more than I need to be able to imagine each individual real number in order to understand the function or the standard normal distribution.

It seems that you're going by an analogy like Jeffrey-Bolker : VNM :: events : outcomes, which is partially right, but leaves out an important sense in which the correct analogy is Jeffrey-Bolker : VNM :: events : probability distributions, since although utility is defined on outcomes, the function that is actually evaluated is expected utility, which is defined on probability distributions (this being a distinction that does not exist in Jeffrey-Bolker, but does exist in my conception of real-world human decision making).

If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty... I need to consider how good and how likely various consequences are, and take the expectation of the 'how good' with respect to the 'how likely'.

I don't understand JB yet, but when I introspected just now, my experience of decision-making doesn't have any separation between beliefs and values, so I think I disagree with the above. I'll try to explain why by describing my experience. (Note: Long comment below is just saying one very simple thing. Sorry for length. There's a one-line tl;dr at the end.)

Right now I'm considering doing three different things. I can go and play a videogame that my friend suggested we play together, I can do some LW work with my colleague, or I can go play some guitar/piano. I feel like the videogame isn't very fun right now because I think the one my friend suggested not that interesting of a shared experience. I feel like the work is fun because I'm excited about publishing the results of the work, and the work itself involves a kind of cognition I enjoy. And playing piano is fun because I've been skilling up a lot lately and I'm going to do accompany some of my housemates in some hamilton songs.

Now, I know some likely ways that what seems valuable to me might change. There are other videogames I've played lately that have been really fascinating and rewarding to play together, that involve problem solving where 2 people can be creative together. I can imagine the work turning out to not actuallybe the fun part but the boring parts. I can imagine that I've found no traction (skill-up) in playing piano, or that we're going to use a recorded soundtrack rather than my playing for the songs we're learning.

All of these to me feel like updates in my understanding of what events are reachable to me; this doesn't feel like changing my utility evaluation of the events. The event of "play videogame while friend watches bored" could change to "play videogame while creatively problem-solving with friend". The event of "gain skill in piano and then later perform songs well with friends" could change to "struggle to do something difficult and sound bad and that's it".

If I think about changing my utility function, I expect that would feel more like... well, I'm not sure. My straw version is "I creatively solve problems with my friend on a videogame, but somehow that's objectively bad so I will not do it". That's where some variable in the utility function changed while all the rest of the facts about my psychology and reality stay the same. This doesn't feel to me like my regular experience of decision-making.

But, maybe that's not the idea. The idea is like if I had some neurological change, perhaps I become more of a sociopath and stop feeling empathy and everyone just feels like objects to me rather than alive. Then a bunch of the social experiences above would change, they'd lose any experience of things like vicarious enjoyment and pleasure of bonding with friends. Perhaps that's what VNM is talking about in my experience.

I think that some of the standard "updates to my ethics / utility function" ideas that people discuss often don't feel like this to me. Like, some people say that reflecting onf population ethics leads them to change their utility function and start to care about the far future. That's not my experience – for me it's been things like the times in HPMOR when Harry thinks about civilizations of the future, what they'll be like/think, and how awesome they can be. It feels real to me, like a reachable state, and this is what has changed a lot of my behaviour, in contrast with changing some variable in a function of world-states that's independent from my understanding of what events are achievable.

To be clear, sometimes I describe my experience more like the sociopath example, where my fundamental interests/values change. I say things like "I don't enjoy videogames as much as I used to" or "These days I value honesty and reliability a lot more than politeness", and there is a sense there where I now experience the same events very differently. "I had a positive meeting with John" might now be "I feel like he was being evasive about the topic we were discussing". The things that are salient to me change. And I think that the language of "my values have changed" is often an effective one for communicating that – even if my experience does not match beliefs|utility, any sufficiently coherent agent can be described this way, and it is often easy to help others model me by describing my values as having changed.

But I think my internal experience is more that I made substantial updates about what events I'm moving towards, and the event "We had a pleasant interaction which will lead to use working effectively together" has changed to "We were not able to say the possibly unwelcome facts of the matter, which will lead to a world where we don't work effectively together". So internally it feels like an update about what events are reachable, even though someone from the outside who doesn't understand my internal experience might more naturally say "It seems like Ben is treating the same event differently now, so I'll model him as having changed his values".

tl;dr: While I often talk separately about what actions I/you/we could take and how valuable those actions are are, internally when when I'm 'evaluating' the actions, I'm just trying to visualise what they are, and there is no second step of running my utility function on those visualisations.

As I say, I'm not sure I understand JB, so perhaps this is also inconsistent with it. I just read your comment and noticed it didn't match my own introspective experience, so I thought I'd share my experience.

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

Ah, I see, that makes sense.

2 points about how I think about this that differs significantly. (I just read up on Bolker and Jeffrey, as I was previously unfamiliar.) I had been thinking about writing this up more fully, but have been busy. (i.e. if people think it's worthwhile, tell me and I will be more likely do so.)

First, utility is only ever computed over models of reality, not over reality itself, because it is a part of the decision making process, not directly about any self-monitoring or feedback process. It is never really evaluated against reality, nor does it need to be. Evidence for this in humans is that people suck at actually noticing how they feel, what they like, etc. The updating of their world model is a process that happens alongside planning and decision making, and is only sometimes actively a target of maximizing utility because people's model can include correspondence with reality as a goal. Many people simply don't do this, or care about map/reality correspondence. They are very unlikely to read or respond to posts here, but any model of humans should account for their existence, and the likely claim that their brains work the same way other people's brains do.

Second, Jeffrey's "News Value" is how he fits in a relationship between utility and reality. As mentioned, for many people their map barely corresponds to the territory, and they don't seem to suffer much. (Well, unless an external event imposes itself on them in a way that affects them in the present. And even then, how often do they update their model?) So I don't think Jeffrey is right. Instead, I don't think an agent could be said to "have" utility at all - utility maximization is a process, never an evaluated goal. The only reason reality matters is because it provides feedback to the model over which people evaluate utility, not because utility is lost or gained. I think this also partly explains happiness set points - as a point of noticing reality, humans are motivated by anticipated reward more than reward. I think the model I propose makes this obvious, instead of surprising.

Planned summary for the Alignment Newsletter:

How might we theoretically ground utility functions? One approach could be to view the possible environments as a set of universe histories (e.g. a list of the positions of all quarks, etc. at all times), and a utility function as a function that maps these universe histories to real numbers. We might want this utility function to be computable, but this eliminates some plausible preferences we might want to represent. For example, in the procrastination paradox, the subject prefers to push the button as late as possible, but disprefers never pressing the button. If the history is infinitely long, no computable function can know for sure that the button was never pressed: it's always possible that it was pressed at some later day.
Instead, we could use _subjective utility functions_, which are defined over _events_, which is basically anything you can think about (i.e. it could be chairs and tables, or quarks and strings). This allows us to have utility functions over high level concepts. In the previous example, we can define an event "never presses the button", and reason about that event atomically, sidestepping the issues of computability.
We could go further and view _probabilities_ as subjective (as in the Jeffrey-Bolkor axioms), and only require that our beliefs are updated in such a way that we cannot be Dutch-booked. This is the perspective taken in logical induction.

I've curated this. This seems to me like an important conceptual step in understanding agency, the subjective view is very interesting and surprising to me. This has been written up very clearly and well, I expect people to link back to this post quite a lot, and I'm really excited to read more posts on this. Thanks a lot Abram.

I think that computable is obviously too strong a condition for classical utility; enumerable is better.

Imagine you're about to see the source code of a machine that's running, and if the machine eventually halts then 2 utilons will be generated. That's a simpler problem to reason about than the procrastination paradox, and your utility function is enumerable but not computable. (Likewise, logical inductors obviously don't make PA approximately computable, but their properties are what you'd want the definition of approximately enumerable to be, if any such definition were standard.)

I suspect that the procrastination paradox leans heavily on the computability requirement as well.

I'm not sure what it would mean for a real-valued function to be enumerable. You could call a function