AI
Frontpage

5 min read

# 21

Background: Natural Latents: The Math, Natural Latents: The Concepts, Why Care About Natural Latents?, the prototypical semantics use-case. This post does not assume that you’ve read all of those, or even any of them.

Suppose I roll a biased die 1000 times, and then roll the same biased die another 1000 times. Then...

• Mediation: The first 1000 rolls are approximately independent of the second 1000 given the bias (to reasonable precision).
• Redundancy: I can estimate the die’s  bias (to reasonable precision) with high confidence from either the first or second 1000 rolls.

The die’s bias is therefore a natural latent, which means it has various nice properties.

• Minimality: The bias is the smallest summary of all the information about the first 1000 rolls relevant to the second 1000 (and vice-versa).
• Maximality: The bias is the largest piece of information which can be calculated from the first 1000 rolls and also can separately be calculated from the second 1000 rolls.
• Any other variable which satisfies the above properties must tell us (approximately) the same information about the die rolls as the bias.

Furthermore, the bias is a(n approximate) deterministic natural latent: the die’s bias (to reasonable precision) is approximately determined by[1] the first 1000 die rolls, and also approximately determined by the second 1000 die rolls. That implies one more nice property:

We’ve proven all that before, mostly in Natural Latents: The Math (including the addendum added six months after the rest of the post). But it turns out that the math is a lot shorter and simpler, and easily yields better bounds, if we’re willing to assume (approximate) determinism up-front. That does lose us some theoretical tools (notably the resampling construction), but it gives a cleaner foundation for our expected typical use cases (like e.g. semantics). The goal of this post is to walk through that math.

## Background Tool: Determinism in Diagrams

We’re going to use diagrammatic proofs, specifically using Bayes nets. But it’s non-obvious how to express (approximate) determinism using Bayes nets, or what rules diagrams follow when determinism is involved, so we’ll walk through that first.

This diagram says that  is (approximately) determined by :

Intuitively, the literal interpretation of the diagram is:  mediates between  and , i.e.  itself tells me nothing more about  once I already know . That only makes sense if  tells me everything there is to know about , i.e.  is determined by .

In the approximate case, we express the approximation error of the diagram as a KL-divergence, same as usual:

If you get confused later about what it means to have two copies of the same variable in a diagram, go back to that line; that’s the definition of the approximation error of the diagram. (One way to view that definition: there’s actually two variables  and , but  says that  and  always have the same value.)

That approximation error simplifies:

So the diagram says  is determined by , and the approximation error of the diagram is the entropy  of  given  - i.e. the number of bits required on average to specify  once one already knows . Very intuitive!

### The Dangly Bit Lemma

Intuitively, if  is determined by , then  always mediates between  and anything else. So, when working with diagrams, we can always add a copy of  with only  as a parent.

Here’s the lemma:

If  holds to within  bits, and any other diagram  involving  holds to within  bits, then we can create a new diagram  which is identical to  but has another copy of  (the “dangly bit”) as a child of . The new diagram  will hold to within  bits.

Proof (click to expand)

Let  be the distribution over  and any other variables  specified by the diagram  (note that  may include some copies of ). Then  specifies the distribution , so the approximation error for  is

A simple example: for any  the diagram  is always satisfied exactly. So, if

is satisfied, then

is satisfied for any . If we write out the approximation errors for these diagrams, this example is equivalent to .

## Deterministic Natural Latents

### Fundamental Theorem

We’ll start with the core idea of natural latents: if one variable  mediates between two other (sets of) variables , and another variable  is redundantly represented in both  and , then  is also represented in . Intuitively: imagine that  is a pipe between  and , and the only way for information to move between  and  is through that pipe. Then if some information  is present in both  and , it must have gone through the pipe.

Diagrammatically, the theorem says:

The proof is just two applications of The Dangly Bit Lemma, followed by marginalizing out  and :

### Minimality, Maximality, and Isomorphism of Deterministic Natural Latents

Now suppose that  (approximately) satisfies both the mediation and redundancy conditions. Then:

•  (approximately) satisfies the mediation condition, and
• For any other  which (approximately) satisfies the mediation condition,  (approximately) by the Fundamental Theorem.

So,  is (approximately) the smallest variable which (approximately) satisfies the mediation condition, in the sense that  is (approximately) determined by any other variable which (approximately) satisfies the mediation condition. This is the “minimality” property of deterministic natural latents.

Similarly:

•  (approximately) satisfies the redundancy condition, and
• For any other  which (approximately) satisfies the redundancy condition,  (approximately) by the Fundamental Theorem.

So,  is (approximately) the largest variable which (approximately) satisfies the redundancy condition, in the sense that any other variable which (approximately) satisfies the redundancy condition is (approximately) determined by . This is the “maximality” property of deterministic natural latents.

Finally, suppose that  and  both satisfy both the mediation and redundancy conditions. Then by the Fundamental Theorem, each is (approximately) determined by other; they’re approximately isomorphic. So, the (approximate) deterministic natural latent is (approximately) unique up to isomorphism.

And just like that, we’ve proven the main properties of deterministic natural latents which we typically want to use: (approximate) minimality, (approximate) maximality, and (approximate) uniqueness.

1. ^

When we say "Y is (approximately) determined by X" in this post, we mean "conditional on X, the entropy of Y is (approximately) zero". This may or may not imply any other notion of "Y is (approximately) determined by X", like e.g. "there exists a deterministic function of X which is equal to Y with high probability".

New Comment