My aim is to define optimization without making reference to the following things:

  1. A "null" action or "nonexistence" of the optimizer. This is generally poorly defined, and choices of different null actions give different answers.
  2. Repeated action. An optimizer should still count even if it only does a single action.
  3. Uncertainty. We should be able to define an optimizer in a fully deterministic universe.
  4. Absolute time at all. This will be the hardest, but it would be nice to define optimization without reference to the "state" of the universe at "time t".

Attempt one

First let's just eliminate the concept of a null action. Imagine the state of the universe at a time .

Let's divide the universe into two sections and call these  and .  They have states  and . If we want to use continuous states we'll need to have some metric  which applies to these states, so we can calculate things like the variance and entropy of probability distributions over them.

Treat  and  as part of a Read-Eval-Print-Loop. Each  produces some output  which acts like a function mapping , and vice versa.  can be thought of as things which cross the Markov blanket.

Sadly we still have to introduce probability distributions. Let's consider a joint probability distribution , and also the two individual probability distributions  and .

By defining distributions over  outputs based on the distribution , we can define  in the "normal" way. This looks like integrating over the space of  and  like so:

What this is basically saying is that to define the probability distribution of states  and , we integrate over all states  and  and sum up the states where the  corresponding to  maps  to the given .

Now lets define an "uncorrelated" version of , which we will refer to as .

This loosely represents what happens if we decorrelate  and . In the language of humans, this is like an agent taking a random move from a selection.

We can refer to a probability distribution  as an "optimizing" probability distribution if  is higher entropy than .


For an example, imagine the universe is divided into two parts: a room  and a thermostat . The room can have states in the set , and the thermostat can have states . Imagine that  and  are defined as follows:



 



Basically the thermostat decides whether the room gets warmer, stays the same, or gets colder, and the thermostat.

We can also consider the probability mass flowing from each of the nine states to another one:

 

 

Imagine the following :

 
00
00
00

This will give us the following :

 
00
00
00

Which has 1.6 bits of entropy.

And the following :

 

Which has 2.7 bits of entropy.

This means that the joint-ness of the probability distribution  has removed 1.1 bits of entropy from the system. We say that our choice of  is optimizing, with an optimizing strength of 1.1 bits.


But what if we consider a "smarter" thermostat, which turns off just before the temperature changes.



 

 

With the same choice of :

 
00
00
00

This will give us the following :

 
000
010
000

With an entropy of zero.

And the following :

 

Which has 2.2 bits of entropy.

In the new system,  has an optimizing strength of 2.2 bits, approximately twice as much. This indicates that the latter system is "better" at optimizing the distribution  in some way.


So we have eliminated the idea of needing the optimizer to have clearly-defined existence/nonexistence cases, or needing some "null" action to compare its outputs to. This is good. We have also eliminated the concept of repeated action.

Next I will attempt to eliminate the need to start with a probability distribution. In both of the examples above, our choice of  was important. I want to find a more "natural" way of defining probability distributions.

New to LessWrong?

New Comment