Utility Maximization = Description Length Minimization

johnswentworth

Basic Foundations for Agent Models

Utility Maximization = Description Length Minimization

by johnswentworth

7 min read18th Feb 202117 comments

69

Information TheoryOptimizationUtility FunctionsAIRationality

Curated

There’s a useful intuitive notion of “optimization” as pushing the world into a small set of states, starting from any of a large number of states. Visually:

Yudkowsky and Flint both have notable formalizations of this “optimization as compression” idea.

This post presents a formalization of optimization-as-compression grounded in information theory. Specifically: to “optimize” a system is to reduce the number of bits required to represent the system state using a particular encoding. In other words, “optimizing” a system means making it compressible (in the information-theoretic sense) by a particular model.

This formalization turns out to be equivalent to expected utility maximization, and allows us to interpret any expected utility maximizer as “trying to make the world look like a particular model”.

Conceptual Example: Building A House

Before diving into the formalism, we’ll walk through a conceptual example, taken directly from Flint’s Ground of Optimization: building a house. Here’s Flint’s diagram:

The key idea here is that there’s a wide variety of initial states (piles of lumber, etc) which all end up in the same target configuration set (finished house). The “perturbation” indicates that the initial state could change to some other state - e.g. someone could move all the lumber ten feet to the left - and we’d still end up with the house.

In terms of information-theoretic compression: we could imagine a model which says there is probably a house. Efficiently encoding samples from this model will mean using shorter bit-strings for world-states with a house, and longer bit-strings for world-states without a house. World-states with piles of lumber will therefore generally require more bits than world-states with a house. By turning the piles of lumber into a house, we reduce the number of bits required to represent the world-state using this particular encoding/model.

If that seems kind of trivial and obvious, then you’ve probably understood the idea; later sections will talk about how it ties into other things. If not, then the next section is probably for you.

Background Concepts From Information Theory

The basic motivating idea of information theory is that we can represent information using fewer bits, on average, if we use shorter representations for states which occur more often. For instance, Morse code uses only a single bit (“.”) to represent the letter “e”, but four bits (“- - . -”) to represent “q”. This creates a strong connection between probabilistic models/distributions and optimal codes: a code which requires minimal average bits for one distribution (e.g. with lots of e’s and few q’s) will not be optimal for another distribution (e.g. with few e’s and lots of q’s).

For any random variable generated by a probabilistic model $M$ , we can compute the minimum average number of bits required to represent $X$ . This is Shannon’s famous entropy formula

$- \sum_{X} P [X | M] log P [X | M]$

Assuming we’re using an optimal encoding for model $M$ , the number of bits used to encode a particular value $x$ is $log P [X = x | M]$ . (Note that this is sometimes not an integer! Today we have algorithms which encode many samples at once, potentially even from different models/distributions, to achieve asymptotically minimal bit-usage. The “rounding error” only happens once for the whole collection of samples, so as the number of samples grows, the rounding error per sample goes to zero.)

Of course, we could be wrong about the distribution - we could use a code optimized for a model $M_{2}$ which is different from the “true” model $M_{1}$ . In this case, the average number of bits used will be

$- \sum_{X} P [X | M_{1}] log P [X | M_{2}] = E [log P [X | M_{2}] | M_{1}]$

In this post, we’ll use a “wrong” model $M_{2}$ intentionally - not because we believe it will yield short encodings, but because we want to push the world into states with short $M_{2}$ -encodings. The model $M_{2}$ serves a role analogous to a utility function. Indeed, we’ll see later on that every model $M_{2}$ is equivalent to a utility function, and vice-versa.

Formal Statement

Here are the variables involved in “optimization”:

World-state random variables $X$
Parameters $θ, θ^{'}$ which will be optimized
Probabilistic world-model $M_{1} (θ)$ representing the distribution of $X$
Probabilistic world-model $M_{2}$ representing the encoding in which we wish to make $X$ more compressible

An “optimizer” takes in some parameter-values $θ$ , and returns new parameter-values $θ^{'}$ such that

$E [- log P [X | M_{2}] | M_{1} (θ^{'})] \leq E [- log P [X | M_{2}] | M_{1} (θ)]$

… with equality if-and-only-if $θ$ already achieves the smallest possible value. In English: we choose $θ^{'}$ to reduce the average number of bits required to encode a sample from $M_{1} (θ^{'})$ , using a code optimal for $M_{2}$ . This is essentially just our formula from the previous section for the number of bits used to encode a sample from $M_{1}$ using a code optimal for $M_{2}$ .

Other than the information-theory parts, the main thing to emphasize is that we’re mapping one parameter-value $θ$ to a “more optimal” parameter-value $θ^{'}$ . This should work for many different “initial” $θ$ -values, implying a kind of robustness to changes in $θ$ . (This is roughly the same concept which Flint captured by talking about “perturbations” to the system-state.) In the context of iterative optimizers, our definition corresponds to one step of optimization; we could of course feed $θ^{'}$ back into the optimizer and repeat. We could even do this without having any distinguished “optimizer” subsystem - e.g. we might just have some dynamical system in which $θ$ is a function of time, and successive values of $θ_{t}$ satisfy the inequality condition.

Finally, note that our model $M_{1}$ is a function of $θ$ . This form is general enough to encompass all the usual decision theories. For instance, under EDT, $M_{1} (θ)$ would be some base model $M$ conditioned on the data $θ$ . Under CDT, $M_{1} (θ)$ would instead be a causal intervention on a base model $M$ , i.e. $M_{1} (θ) = d o (M, Θ = θ)$ .

Equivalence to Expected Utility Optimization

Obviously our expression $E [- log P [X | M_{2}] | M_{1} (θ)]$ can be expressed as an expected utility: just set $u (X) = log P [X | M_{2}]$ . The slightly more interesting claim is that we can always go the other way: for any utility function $u (X)$ , there is a corresponding model $M_{2}$ , such that maximizing expected utility $u (X)$ is equivalent to minimizing expected bits to encode $X$ using $M_{2}$ .

The main trick here is that we can always add a constant to $u (X)$ , or multiply $u (X)$ by a positive constant, and it will still “be the same utility” - i.e. an agent with the new utility will always make the same choices as the old. So, we set

$α u (X) + β = log P [X | M_{2}] ⟹ P [X | M_{2}] = e^{β} e^{α u (X)}$

… and look for $α, β$ which give us a valid probability distribution (i.e. all probabilities are nonnegative and sum to 1).

Since everything is in an exponent, all our probabilities will be nonnegative for any $α, β$ , so that constraint is trivially satisfied. To make the distribution sum to one, we simply set $β = - l n \sum_{X} e^{α u (X)}$ . So, not only can we find a model $M_{2}$ for any $u (X)$ , we actually find a whole family of them - one for each $α > 0$ .

(This also reveals a degree of freedom in our original definition: we can always create a new model $M_{2}^{'}$ with $P [X | M_{2}^{'}] = \frac{1}{Z} P [X | M_{2}]^{α}$ without changing the behavior.)

So What Does This Buy Us?

If this formulation is equivalent to expected utility maximization, why view it this way?

Intuitively, this view gives more semantics to our “utility functions”. They have built-in “meanings”; they’re not just preference orderings.

Mathematically, the immediately obvious step for anyone with an information theory background is to write:

$E [- log P [X | M_{2}] | M_{1}] = - \sum_{X} P [X | M_{1}] log P [X | M_{1}] + P [X | M_{1}] log \frac{P [X | M_{2}]}{P [X | M_{1}]}$

$= H (X | M_{1}) + D_{K L} (M_{2} . X | | M_{1} . X)$

The expected number of bits required to encode $X$ using $M_{2}$ is the entropy of $X$ plus the Kullback-Liebler divergence of (distribution of $X$ under model $M_{2}$ ) from (distribution of $X$ under model $M_{1}$ ). Both of those terms are nonnegative. The first measures “how noisy” $X$ is, the second measures “how close” the distributions are under our two models.

Intuitively, this math says that we can decompose the objective $E [- log P [X | M_{2}] | M_{1}]$ into two pieces:

Make $X$ more predictable
Make the distribution of $X$ “close to” the distribution $P [X | M_{2}]$ , with closeness measured by KL-divergence

Combined with the previous section: we can take any expected utility maximization problem, and decompose it into an entropy minimization term plus a “make-the-world-look-like-this-specific-model” term.

This becomes especially interesting in situations where the entropy of $X$ cannot be reduced - e.g. thermodynamics. If the entropy $H (X)$ is fixed, then only the KL-divergence term remains. In this case, we can directly interpret the optimization problem as “make the world-state distribution look like $P [X | M_{2}]$ ”. If we started from an expected utility optimization problem, then we derive a model $M_{2}$ such that optimizing expected utility is equivalent to making the world look as much as possible like $M_{2}$ .

In fact, even when $H (X)$ is not fixed, we can build equivalent models $M_{1}^{'}, M_{2}^{'}$ for which it is fixed, by adding new variables to $X$ . Suppose, for example, that we can choose between flipping a coin and rolling a die to determine $X_{0}$ . We can change the model so that both the coin flip and the die roll always happen, and we include their outcomes in $X$ . We then choose whether to set $X_{0}$ equal to the coin flip result or the die roll result, but in either case the entropy of $X$ is the same, since both are included. $M_{2}^{'}$ simply ignores all the new components added to $X$ (i.e. it implicitly has a uniform distribution on the new components).

So, starting from an expected utility maximization problem, we can transform to an equivalent minimum coded bits problem, and from there to an equivalent minimum KL-divergence problem. We can then interpret the optimization as “choose $θ$ to make $M_{1} (θ)$ as close as possible to $M_{2}$ ”, with closeness measured by KL-divergence.

What I Imagine This Might Be Useful For

In general, interpretations of probability grounded in information theory are much more solid than interpretations grounded in coherence theorems. However, information-theoretic groundings only talk about probability, not about "goals" or "agents" or anything utility-like. Here, we've transformed expected utility maximization into something explicitly information-theoretic and conceptually natural. This seems like a potentially-promising step toward better foundations of agency. I imagine there's probably purely-information-theoretic "coherence theorems" to be found.

Another natural direction to take this in is thermodynamic connections, e.g. combining it with a generalized heat engine. I wouldn't be surprised if this also tied in with information-theoretic "coherence theorems" - in particular, I imagine that negentropy could serve as a universal "resource", replacing the "dollars" typically used as a measuring stick in coherence theorems.

Overall, the whole formulation smells like it could provide foundations much more amenable to embedded agency.

Finally, there's probably some nice connection to predictive processing. In all likelihood, Karl Friston has already said all this, but it has yet to be distilled and disseminated to the rest of us.

Information TheoryOptimizationUtility FunctionsAIRationality

Curated

69

Writing Causal Models Like We Write Programs

1 comments87 karma

Optimization at a Distance

5 comments88 karma

Mentioned in

512021 AI Alignment Literature Review and Charity Comparison

32Search-in-Territory vs Search-in-Map

29AI takeoff story: a continuation of progress by other means

30Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

25Distributed Decisions

Load More (5/11)

Utility Maximization = Description Length Minimization

New Comment

17 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:47 AM

[-]Thomas Kwa2y210

The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:

Premise (as stated by John): “a system steers far-away parts of the world into a relatively-small chunk of their state space”
Desired conclusion: The system is very likely (probability approaching 1 with increasing model size / optimization power / whatever) consequentialist, in that it has an internal world-model and search process. Note that this is a structural rather than behavioral property.

John has not yet proved such a result and it would be a major advance in the selection theorems agenda. I also find it plausible that someone without specific context could do meaningful work here. As such, I’ll offer a $5000 bounty to anyone who finds a precise theorem statement and beats John to the full proof (or disproof + proof of a well-motivated weaker statement). This bounty will decrease to zero as the sequence is completed and over the next ~12 months. Partial contributions will be rewarded proportionally.

[-]Alana1y00

Any updates on this?

[This comment is no longer endorsed by its author]Reply

[-]Thomas Kwa1y10

There's a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.

[-]Adele Lopez3y90

This gives a nice intuitive explanation for the Jeffery-Bolker rotation which basically is a way of interpreting a belief as a utility, and vice versa.

Some thoughts:

What do probabilities mean without reference to any sort of agent? Presumably it has something to do with the ability to "win" De Finetti games in expectation. For avoiding subtle anthropomorphization, it might be good to think of this sort of probability as being instantiated in a bacterium's chemical sensor, or something like that. And in this setting, it's clear it wouldn't mean anything without the context of the bacterium. Going further, it seems to me like the only mechanism which makes this mean anything is the fact that it helps make the bacterium "exist more" i.e. reproduce and thrive. So I think having a probability mean a probability inherently requires some sort of self-propagation -- it means something if it's part of why it exists. This idea can be taken to an even deeper level, where according to Zureck you can get the Born probabilities by looking at what quantum states allow information to persist through time (from within the system).
Does this imply anything about the difficulty of value learning? An AGI will be able to make accurate models of the world, so it will have the raw algorithms needed to do value learning... the hard part seems to be, as usual, pointing to the "correct" values. Not sure this helps with that so much.
A bounded agent creating a model will have to make decisions about how much detail to model various aspects of the world in. Can we use this idea to "factor" out that sort of trade-off as part of the utility function?

[-]Alex Mennen3y60

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

[-]Daniel Kokotajlo3y70

Probably confused noob question:

It seems like your core claim is that we can reinterpret expected-utility maximizers as expected-number-of-bits-needed-to-describe-the-world-using-M2 minimizers, for some appropriately chosen model of the world M2.

If so, then it seems like something weird is happening, because typical utility functions (e.g. "pleasure - pain" or "paperclips") are unbounded above and below, whereas bits are bounded below, meaning a bit-minimizer is like a utility function that's bounded above: there's a best possible state the world could be in according to that bit-minimizer.

Or are we using a version of expected utility theory that says utility must be bounded above and below? (In that case, I might still ask, isn't that in conflict with how number-of-bits is unbounded above?)

[-]Rohin Shah3y120

The core conceptual argument is: the higher your utility function can go, the bigger the world must be, and so the more bits it must take to describe it in its unoptimized state under M2, and so the more room there is to reduce the number of bits.

If you could only ever build 10 paperclips, then maybe it takes 100 bits to specify the unoptimized world, and 1 bit to specify the optimized world.

If you could build 10^100 paperclips, then the world must be humongous and it takes 10^101 bits to specify the unoptimized world, but still just 1 bit to specify the perfectly optimized world.

If you could build ∞ paperclips, then the world must be infinite, and it takes ∞ bits to specify the unoptimized world. Infinities are technically challenging, and John's comment goes into more detail about how you deal with this sort of case.

For more intuition, notice that exp(x) is a bijective function from (-∞, ∞) to (0, ∞), so it goes from something unbounded on both sides to something unbounded on one side. That's exactly what's happening here, where utility is unbounded on both sides and gets mapped to something that is unbounded only on one side.

[-]Daniel Kokotajlo3y10

Ahh, thanks!

[-]johnswentworth3y60

Awesome question! I spent about a day chewing on this exact problem.

First, if our variables are drawn from finite sets, then the problem goes away (as long as we don't have actually-infinite utilities). If we can construct everything as limits from finite sets (as is almost always the case), then that limit should involve a sequence of world models.

The more interesting question is what that limit converges to. In general, we may end up with an improper distribution (conceptually, we have to carry around two infinities which cancel each other out). That's fine - improper distributions happen sometimes in Bayesian probability, we usually know how to handle them.

[-]Daniel Kokotajlo3y10

Thanks for the reply, but I might need you to explain/dumb-down a bit more.

--I get how if the variables which describe the world can only take a finite combination of values, then the problem goes away. But this isn't good enough because e.g. "number of paperclips" seems like something that can be arbitrarily big. Even if we suppose they can't get infinitely big (though why suppose that?) we face problems, see below.

--What does it mean in this context to construct everything as limits from finite sets? Specifically, consider someone who is a classical hedonistic utilitarian. It seems that their utility is unbounded above and below, i.e. for any setting of the variables, there is a setting which is a zillion times better and a setting which is a zillion times worse. So how can we interpret them as minimizing the bits needed to describe the variable-settings according to some model M2? For any M2 there will be at least one minimum-bit variable-setting, which contradicts what we said earlier about every variable-setting having something which is worse and something which is better.

[-]johnswentworth3y50

I'll answer the second question, and hopefully the first will be answered in the process.

First, note that , so arbitrarily large negative utilities aren't a problem - they get exponentiated, and yield probabilities arbitrarily close to 0. The problem is arbitrarily large positive utilities. In fact, they don't even need to be arbitrarily large, they just need to have an infinite exponential sum; e.g. if $u (X)$ is $1$ for any whole number of paperclips $X$ , then to normalize the probability distribution we need to divide by $\sum_{X = 0}^{\infty} e^{α \cdot 1} = \infty$ . The solution to this is to just leave the distribution unnormalized. That's what "improper distribution" means: it's a distribution which can't be normalized, because it sums to $\infty$ .

The main question here seems to be "ok, but what does an improper distribution mean in terms of bits needed to encode X?". Basically, we need infinitely many bits in order to encode X, using this distribution. But it's "not the same infinity" for each X-value - not in the sense of "set of reals is bigger than the set of integers", but in the sense of "we constructed these infinities from a limit so one can be subtracted from the other". Every X value requires infinitely many bits, but one X-value may require 2 bits more than another, or 3 bits less than another, in such a way that all these comparisons are consistent. By leaving the distribution unnormalized, we're effectively picking a "reference point" for our infinity, and then keeping track of how many more or fewer bits each X-value needs, compared to the reference point.

In the case of the paperclip example, we could have a sequence of utilities $u_{n} (X)$ which each assigns utility $X$ to any number of paperclips X < $n$ (i.e. 1 util per clip, up to $n$ clips), and then we take the limit $n \to \infty$ . Then our $n^{t h}$ unnormalized distribution is $P_{u n n o r m} [X | M_{n}] = e^{α X} I [X < n]$ , and the normalizing constant is $Z_{n} = \frac{1 - e^{α n}}{1 - e^{α}}$ , which grows like $O (e^{α n})$ as $n \to \infty$ . The number of bits required to encode a particular value $X < n$ is

$- log \frac{P_{u n n o r m} [X | M_{n}]}{Z_{n}} = log \frac{1 - e^{α n}}{1 - e^{α}} - α X$

Key thing to notice: the first term, $log \frac{1 - e^{α n}}{1 - e^{α}}$ , is the part which goes to $\infty$ with $n$ , and it does not depend on $X$ . So, we can take that term to be our "reference point", and measure the number of bits required for any particular $X$ relative to that reference point. That's exactly what we're implicitly doing if we don't normalize the distribution: ignoring normalization, we compute the number of bits required to encode X as

$- log P_{u n n o r m} [X | M_{n}] = - α X$

... which is exactly the "adjustment" from our reference point.

(Side note: this is exactly how information theory handles continuous distributions. An infinite number of bits is required to encode a real number, so we pull out a term $log d x$ which diverges in the limit $d x \to 0$ , and we measure everything relative to that. Equivalently, we measure the number of bits required to encode up to precision $d x$ , and as long as the distribution is smooth and $d x$ is small, the number of bits required to encode the rest of $x$ using the distribution won't depend on the value of $x$ .)

Does this make sense? Should I give a different example/use more English?

[-]Edouard Harris3y60

Late comment here, but I really liked this post and want to make sure I've fully understood it. In particular there's a claim near the end which says: if is not fixed, then we can build equivalent models $M_{1}^{'}$ , $M_{2}^{'}$ for which it is fixed. I'd like to formalize this claim to make sure I'm 100% clear on what it means. Here's my attempt at doing that:

For any pair of models $M_{1} (θ)$ , $M_{2}$ where $H (X_{0} | M_{1} (θ)) \neq H (X_{0} | M_{1} (θ^{'}))$ , there exists a variable $X$ (of which $X_{0}$ is a subset) and a pair of models $M_{1}^{'} (θ)$ , $M_{2}^{'}$ such that 1) $H (X | M_{1}^{'} (θ)) = H (X | M_{1}^{'} (θ^{'}))$ for any $θ$ , $θ^{'}$ ; and 2) the behavior of the system is the same under $M_{1}^{'} (θ)$ , $M_{2}^{'}$ as it was under $M_{1} (θ)$ , $M_{2}$ .

To satisfy this claim, we construct our $X$ as the conjunction of $X_{0}$ and some "extra" component $X_{0}^{'}$ . e.g., $X_{0} \in {heads, tails}$ for a coin flip, $X_{0}^{'} \in {1, 2, 3, 4, 5, 6}$ for a die roll, and so $X = X_{0} X_{0}^{'} \in {(heads, 1), (tails, 1), (heads, 2), . . .}$ is the conjunction of the coin flip and the die roll, and the domain of $X$ is the outer product of the coin flip domain and of the die roll domain.

Then we construct our $M_{1}^{'} (θ)$ by imposing 1) $P (X_{0} X_{0}^{'} | M_{1}^{'} (θ)) = P (X_{0} | M_{1}^{'} (θ)) P (X_{0}^{'} | M_{1}^{'} (θ))$ (i.e., $X_{0}$ , $X_{0}^{'}$ are logically independent given $M_{1}^{'} (θ)$ for every $θ$ ); and 2) $P (X_{0} | M_{1}^{'} (θ)) = P (X_{0} | M_{1} (θ))$ (i.e., the marginal prob given $M_{1}^{'} (θ)$ equals the original prob under $M_{1} (θ)$ ).

Finally we construct $M_{2}^{'}$ by imposing the analogous 2 conditions that we did for $M_{1}^{'}$ : 1) $P (X_{0} X_{0}^{'} | M_{2}^{'}) = P (X_{0} | M_{2}^{'}) P (X_{0}^{'} | M_{2}^{'})$ and 2) $P (X_{0} | M_{2}^{'}) = P (X_{0} | M_{2})$ . But we also impose the extra condition 3) $P (X_{0}^{'} | M_{2}^{'}) = \frac{1}{| X_{0}^{'} |}$ (assuming finite sets, etc.).

We can always find $X$ , $M_{1}^{'} (θ)$ and $M_{2}^{'}$ that satisfy the above conditions, and with these choices we end up with $H (X | M_{1}^{'} (θ)) = H (X | M_{1}^{'} (θ^{'}))$ for all $θ$ , $θ^{'}$ (i.e., $H$ is fixed) and $E [- log (P (X | M_{2}^{'})) | M_{1}^{'} (θ)] = E [- log (P (X_{0} | M_{2}^{'})) | M_{1}^{'} (θ)] + constant$ (i.e., the system retains the same dynamics).

Is this basically right? Or is there something I've misunderstood?

[-]johnswentworth3y40

The construction is correct.

Note that for , conceptually we don't need to modify it, we just need to use the original $M_{2}$ but apply it only to the subcomponents of the new $X$ -variable which correspond to the original $X$ -variable. Alternatively, we can take the approach you do: construct $M_{2}^{'}$ which has a distribution over the new $X$ , but "doesn't say anything" about the new components, i.e. the it's just maxentropic over the new components. This is equivalent to ignoring the new components altogether.

[-]Edouard Harris3y40

Ah yes, that's right. Yeah, I just wanted to make this part fully explicit to confirm my understanding. But I agree it's equivalent to just let ignore the extra $X_{0}^{'}$ (or whatever) component.

Thanks very much!

[-]Oliver Habryka3y60

Promoted to curated: As Adele says, this feels related to a bunch of the Jeffery-Bolker rotation ideas, which I've referenced many many times since then, but in a way that feels somewhat independent, which makes me more excited about there being some deeper underlying structure here.

I've also had something like this in my mind for a while, but haven't gotten around to formalizing it, and I think I've seen other people make similar arguments in the past, which makes this a valuable clarification and synthesis that I expect to get referenced a bunch.

[-]James Fox10mo30

I know you've acknowledged Friston at the end, but I'm just commenting for other interested readers' benefit that this is very close to Karl Friston’s active inference framework, which posits that all agents minimise the discrepancies (or prediction errors) between their internal representations of the world and their incoming sensory information through both action and perception.

[-]romeostevensit3y30

Hypothesis: in a predictive coding model, the bottom up processing is doing lossless compression and the top down processing is doing lossy compression. I feel excited about viewing more cognitive architecture problems through a lens of separating these steps.

Moderation Log