When we think of “optimization” as compressing some part of the universe into a relatively small number of possible statesit’s very natural to quantify that compression in terms of “bits of optimization”. Example: we have a marble which could be in any of 16 different slots in a box (assume/approximate uniform probability on each). We turn the box on its side, shake it, and set it down so the marble will end up in one of the 4 slots on the downward side (again, assume/approximate uniform probability on each). Then we’ve compressed the marble-state from 16 possibilities to 4, cut the state-space in half twice, and therefore performed two bits of optimization.

In the language of information theory, this quantity is the KL-divergence between the initial and final distributions.

In this post, we’ll prove our first simple theorem about optimization at a distance: the number of bits of optimization applied can only decrease over a distance. In particular, in our optimization at a distance picture:

… the number of bits of optimization applied to the far-away optimization target cannot be any larger than the number of bits of optimization applied to the optimizer’s direct outputs.

The setup: first, we’ll need two distributions to compare over the optimizer’s direct outputs. You might compare the actual output-distribution to uniform randomness, or independent outputs. If the optimizer is e.g. a trained neural net, you might compare its output-distribution to the output-distribution of a randomly initialized net. If the optimizer has some sort of explicit optimization loop (like e.g. gradient descent), then you might compare its outputs to the initial outputs tested in that loop. These all have different interpretations and applications; the math here will apply to all of them.

Let’s name some variables in this setup:

  • Optimizer’s direct outputs:  (for “actions”) 
  • ’th intermediating layer:  (with )
  • Reference distribution over everything: 
  • Actual distribution over everything: 

By assumption, only the optimizer differs between the reference and actual distributions; the rest of the Bayes net is the same. Mathematically, that means  (and of course both distributions factor over the same underlying graph).

Once we have two distributions over the optimizer’s direct outputs, they induce two distributions over each subsequent layer of intermediating variables, simply by propagating through each layer:

At each layer, we can compute the number of bits of optimization applied to that layer, i.e. how much that layer’s state-space is compressed by the actual distribution relative to the reference distribution. That’s the KL-divergence between the distributions: .

To prove our theorem, we just need to show that . To do that, we’ll use the chain rule of KL divergence to expand  in two different ways. First:

Recall that  and  are the same, so , and our first expression simplifies to . Second:

KL-divergence is always nonnegative, so we can drop the second term above and get an inequality: 

Now we just combine these two expressions for  and find

… which is what we wanted to prove.

So: if we measure the number of bits of optimization applied to the optimizer’s direct output, or to any particular layer, that provides an upper bound on the number of bits of optimization applied further away.

17

6 comments, sorted by Click to highlight new comments since: Today at 8:49 AM
New Comment

To check my noob understanding:

Suppose you are playing a MMO video game which accepts about 10,000 binary inputs per second, in the form of various keyboard buttons being pressed or not. You want to cause something crazy and meme-worthy to happen in the game, like everyone in the game falling over dead simultaneously. You have an hour or so of playtime total. What's the craziest thing you can make happen?

Answer: Well, whatever it is, it has to be at least 1/2^(10,000x3600) likely to happen already. If it was less likely than that, you wouldn't be able to make it likely even if you played perfectly.

That's a correct intuitive understanding.

I think this assumes implicitly that P(A|ref) is uniformly distributed over all the 10,000 options. In a video game I‘d think more that the ”reference” is always to output 0s since the player isn’t interacting. Then The KL divergence could be arbitrarily large. But it’s not really clear in general how to interpret the reference distribution, perhaps someone can clarify?

One thing that I had to remind myself while reading this post is that "far away" is across space-time, emphasis on time. So "far away" can be about optimizing the future.

Suppose I send a few lines of code to a remote server. Those lines are enough to bootstrap a superintelligence which goes on to strongly optimize every aspect of the world. This counts as a fairly small amount of optimization power, as the chance of me landing on those lines of code by pure chance isn't that small. 

Correct, though note that this doesn't let you pick a specific optimization target for that AGI. You'd need more bits in order to specify an optimization target. In other words, there's still a meaningful sense in which you can't "steer" the world into a specific target other than one it was likely to hit anyway, at least not without more bits.