30

Let’s start with the simplest coherence theorem: suppose I’ll pay to upgrade pepperoni pizza to mushroom, pay to upgrade mushroom to anchovy, and pay to upgrade anchovy to pepperoni. This does not bode well for my bank account balance. And the only way to avoid having such circular preferences is if there exists some “consistent preference ordering” of the three toppings - i.e. some ordering such that I will only pay to upgrade to a topping later in the order, never earlier. That ordering can then be specified as a utility function: a function which takes in a topping, and gives the topping’s position in the preference order, so that I will only pay to upgrade to a topping with higher utility.

More advanced coherence theorems remove a lot of implicit assumptions (e.g. I could learn over time, and I might just face various implicit tradeoffs in the world rather than explicit offers to trade), and add more machinery (e.g. we can incorporate uncertainty and derive expected utility maximization and Bayesian updates). But they all require something-which-works-like-money.

Money has two key properties in this argument:

• Money is additive across decisions. If I pay $1 to upgrade anchovy to pepperoni, and another$1 to upgrade pepperoni to mushroom, then I have spent $1 +$1 = $2. • All else equal, more money is good. If I spend$3 trading anchovy -> pepperoni -> mushroom -> anchovy, then I could have just stuck with anchovy from the start and had strictly more money, which would be better.

These are the conditions which make money a “measuring stick of utility”: more money is better (all else equal), and money adds. (Indeed, these are also the key properties of a literal measuring stick: distances measured by the stick along a straight line add, and bigger numbers indicate more distance.)

Why does this matter?

There’s a common misconception that every system can be interpreted as a utility maximizer, so coherence theorems don’t say anything interesting. After all, we can always just pick some “utility function” which is maximized by whatever the system actually does. It’s the measuring stick of utility which makes coherence theorems nontrivial: if I spend \$3 trading anchovy -> pepperoni -> mushroom -> anchovy, then it implies that either (1) I don’t have a utility function over toppings (though I could still have a utility function over some other silly thing, like e.g. my history of topping-upgrades), or (2) more money is not necessarily better, given the same toppings. Sure, there are ways for that system to “maximize a utility function”, but it can’t be a utility function over toppings which is measured by our chosen measuring stick.

Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily. We assume that the resources are a measuring stick of utility, and then ask whether the system maximizes any utility function over the given state-space measured by that measuring stick.

Ok, but what about utility functions which don’t increase with resources?

As a general rule, we don’t actually care about systems which are “utility maximizers” in some trivial sense, like the rock which “optimizes” for sitting around being a rock. These systems are not very useful to think of as optimizers. We care about things which steer some part of the world into a relatively small state-space.

To the extent that we buy instrumental convergence, using resources as a measuring stick is very sensible. There are various standard resources in our environment, like money or energy, which are instrumentally useful for a very wide variety of goals. We expect a very wide variety of optimizers to “want” those resources, in order to achieve their goals. Conversely, we intuitively expect that systems which throw away such resources will not be very effective at steering parts of the world into relatively small state-space. They will be limited to fewer bits of optimization than systems which use those same resources pareto optimally.

So there’s an argument to be made that we don’t particularly care about systems which “maximize utility” in some sense which isn’t well measured by resources. That said, it’s an intuitive, qualitative argument, not really a mathematical one. What would be required in order to make it into a formal argument, usable for practical quantification and engineering?

The Measuring Stick Problem

The main problem is: how do we recognize a “measuring stick of utility” in the wild, in situations where we don’t already think of something as a resource? If somebody hands me a simulation of a world with some weird physics, what program can I run on that simulation to identify all the “resources” in it? And how does that notion of “resources” let me say useful, nontrivial things about the class of utility functions for which those resources are a measuring stick? These are the sorts of questions we need to answer if we want to use coherence theorems in a physically-reductive theory of agency.

If we could answer that question, in a way derivable from physics without any “agenty stuff” baked in a priori, then the coherence theorems would give us a nontrivial sense in which some physical systems do contain embedded agents, and other physical systems don’t. It would, presumably, allow us to bound the number of bits-of-optimization which a system can bring to bear, with more-coherent-as-measured-by-the-measuring-stick systems able to apply more bits of optimization, all else equal.

New Comment