Less Basic Inframeasure Theory

So, we've also got an analogue of KL-divergence for crisp infradistributions.

We'll be using $P$ and $Q$ for crisp infradistributions, and $p$ and $q$ for probability distributions associated with them. $D_{K L}$ will be used for the KL-divergence of infradistributions, and $d_{K L}$ will be used for the KL-divergence of probability distributions. For crisp infradistributions, the KL-divergence is defined as

$D_{K L} (P | Q) := {max}_{q \in Q} {min}_{p \in P} d_{K L} (p | q)$

I'm not entirely sure why it's like this, but it has the basic properties you would expect of the KL-divergence, like concavity in both arguments and interacting well with continuous pushforwards and semidirect product.

Straight off the bat, we have:

Proposition 1: $D_{K L} (P | Q) \geq 0$

Proof: KL-divergence between probability distributions is always nonnegative, by Gibb's inequality.

Proposition 2: $D_{K L} (P | Q) = 0 \leftrightarrow Q \subseteq P$

$D_{K L} (P | Q) = 0 \leftrightarrow {max}_{q \in Q} {min}_{p \in P} d_{K L} (p | q) = 0 \leftrightarrow \forall q \in Q : {min}_{p \in P} d_{K L} (p | q) = 0$

$\forall q \in Q : {min}_{p \in P} d_{K L} (p | q) = 0 \leftrightarrow \forall q \in Q \exists p \in P : d_{K L} (p | q) = 0$

And now, because KL-divergence between probability distributions is 0 only when they're equal, we have:

$\forall q \in Q \exists p \in P : d_{K L} (p | q) = 0 \leftrightarrow \forall q \in Q : q \in P \leftrightarrow Q \subseteq P$

Proposition 3: If $u$ is the uniform distribution on $X$ , then $D_{K L} (P | u) = ln (| X |) - H (P)$

$D_{K L} (P | u) = {min}_{p \in P} d_{K L} (p | u) = {min}_{p \in P} (H (p, u) - H (p))$

And the cross-entropy of any distribution with the uniform distribution is always $ln (| X |)$ , so:

$= ln (| X |) + {min}_{p \in P} (- H (p)) = ln (| X |) - {max}_{p \in P} H (p) = ln (| X |) - H (P)$

Proposition 4: $D_{K L}$ is a concave function over $(□^{c r i s p} X)^{2}$ .

Proof: Let's use $a$ as our number in $[0, 1]$ in order to talk about mixtures. Then,

$D_{K L} (a P_{1} + (1 - a) P_{2} | a Q_{1} + (1 - a) Q_{2}) = {max}_{q \in a Q_{1} + (1 - a) Q_{2}} {min}_{p \in a P_{1} + (1 - a) P_{2}} d_{K L} (p | q)$

$= {max}_{q_{1} \in Q_{1}, q_{2} \in Q_{2}} {min}_{p_{1} \in P_{1}, p_{2} \in P_{2}} d_{K L} (a p_{1} + (1 - a) p_{2} | a q_{1} + (1 - a) q_{2})$

Then we apply concavity of the KL-divergence for probability distributions to get:

$\leq {max}_{q_{1} \in Q_{1}, q_{2} \in Q_{2}} {min}_{p_{1} \in P_{1}, p_{2} \in P_{2}} (a d_{K L} (p_{1} | q_{1}) + (1 - a) d_{K L} (p_{2} | q_{2}))$

$= {max}_{q_{1} \in Q_{1}, q_{2} \in Q_{2}} (a {min}_{p_{1} \in P_{1}} d_{K L} (p_{1} | q_{1}) + (1 - a) {min}_{p_{2} \in P_{2}} d_{K L} (p_{2} | q_{2}))$

$= a {max}_{q_{1} \in Q_{1}} {min}_{p_{1} \in P_{1}} d_{K L} (p_{1} | q_{1}) + (1 - a) {max}_{q_{2} \in Q_{2}} {min}_{p_{2} \in P_{2}} d_{K L} (p_{2} | q_{2})$

$= a D_{K L} (P_{1} | Q_{1}) + (1 - a) D_{K L} (P_{2} | Q_{2})$

Proposition 5: $D_{K L} (P ⋉ K | Q ⋉ J) \geq D_{K L} (P | Q) + {min}_{p \in P} E_{p} [λ x . D_{K L} (K (x) | J (x))]$

$D_{K L} (P ⋉ K | Q ⋉ J) = {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} d_{K L} (p ⋉ k | q ⋉ j)$

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} \sum_{x, y} (p ⋉ k) (x, y) \cdot ln (\frac{(p ⋉ k) (x, y)}{(q ⋉ j) (x, y)})$

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} \sum_{x, y} p (x) \cdot k_{x} (y) \cdot ln (\frac{p (x) \cdot k_{x} (y)}{q (x) \cdot j_{x} (y)})$

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} \sum_{x, y} p (x) \cdot k_{x} (y) \cdot (ln (\frac{p (x)}{q (x)}) + ln (\frac{k_{x} (y)}{j_{x} (y)}))$

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} (\sum_{x, y} p (x) \cdot k_{x} (y) \cdot ln (\frac{p (x)}{q (x)}) + \sum_{x, y} p (x) \cdot k_{x} (y) \cdot ln (\frac{k_{x} (y)}{j_{x} (y)}))$

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} ((\sum_{x} p (x) \cdot ln (\frac{p (x)}{q (x)}) \cdot (\sum_{y} k_{x} (y)))$
$+ (\sum_{x} p (x) \cdot (\sum_{y} k_{x} (y) \cdot ln (\frac{k_{x} (y)}{j_{x} (y)}))))$

At this point we can abbreviate the KL-divergence, and observe that we have a multiplication by 1, to get:

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} (d_{K L} (p | q) + \sum_{x} p (x) \cdot d_{K L} (k_{x} | j_{x}))$

And then pack up the expectation

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P, k \in \prod_{x} K (x)} (d_{K L} (p | q) + E_{p} [λ x . d_{K L} (k_{x} | j_{x})])$

Then, with the choice of $q$ and $j$ fixed, we can move the choice of the $k_{x}$ all the way inside, to get:

$= {max}_{q \in Q, j \in \prod_{x} J (x)} {min}_{p \in P} (d_{K L} (p | q) + E_{p} [λ x . {min}_{k_{x} \in K (x)} d_{K L} (k_{x} | j_{x})])$

Now, there's something else we can notice. When choosing $j$ , it doesn't matter what $p$ is selected, you want to take every $x$ and maximize the quantity inside the expectation, that consideration selects your $j_{x}$ . So, then we can get:

$= {max}_{q \in Q} {min}_{p \in P} (d_{K L} (p | q) + E_{p} [λ x . {max}_{j_{x} \in J (x)} {min}_{k_{x} \in K (x)} d_{K L} (k_{x} | j_{x})])$

And pack up the KL-divergence to get:

$= {max}_{q \in Q} {min}_{p \in P} (d_{K L} (p | q) + E_{p} [λ x . D_{K L} (K (x) | J (x))])$

And distribute the min to get:

$\geq {max}_{q \in Q} ({min}_{p \in P} d_{K L} (p | q) + {min}_{p \in P} E_{p} [λ x . D_{K L} (K (x) | J (x))])$

And then, we can pull out that fixed quantity and get:

$= {max}_{q \in Q} {min}_{p \in P} d_{K L} (p | q) + {min}_{p \in P} E_{p} [λ x . D_{K L} (K (x) | J (x))]$

And pack up the KL-divergence to get:

$= D_{K L} (P | Q) + {min}_{p \in P} E_{p} [λ x . D_{K L} (K (x) | J (x))]$

Proposition 6: $D_{K L} (P_{1} \times P_{2} | Q_{1} \times Q_{2}) = D_{K L} (P_{1} | Q_{1}) + D_{K L} (P_{2} | Q_{2})$

To do this, we'll go through the proof of proposition 5 to the first place where we have an inequality. The last step before inequality was:

$D_{K L} (P ⋉ K | Q ⋉ J) = {max}_{q \in Q} {min}_{p \in P} (d_{K L} (p | q) + E_{p} [λ x . D_{K L} (K (x) | J (x))])$

Now, for a direct product, it's like semidirect product but all the $K (x)$ and $J (x)$ are the same infradistribution, so we have:

$D_{K L} (P_{1} \times P_{2} | Q_{1} \times Q_{2}) = {max}_{q_{1} \in Q_{1}} {min}_{p_{1} \in P_{1}} (d_{K L} (p_{1} | q_{1}) + E_{p_{1}} [λ x . D_{K L} (P_{2} | Q_{2})])$

Now, this is a constant, so we can pull it out of the expectation to get:

$= {max}_{q_{1} \in Q_{1}} {min}_{p_{1} \in P_{1}} (d_{K L} (p_{1} | q_{1}) + D_{K L} (P_{2} | Q_{2}))$

$= {max}_{q_{1} \in Q_{1}} {min}_{p_{1} \in P_{1}} d_{K L} (p_{1} | q_{1}) + D_{K L} (P_{2} | Q_{2}) = D_{K L} (P_{1} | Q_{1}) + D_{K L} (P_{2} | Q_{2})$

Proposition 7: $D_{K L} (g_{*} (P) | g_{*} (Q)) \leq D_{K L} (P | Q)$

For this, we'll need to use the Disintegration Theorem (the classical version for probability distributions), and adapt some results from Proposition 5. Let's show as much as we can before showing this.

$D_{K L} (g_{*} (P) | g_{*} (Q)) = {max}_{q^{'} \in g_{*} (Q)} {min}_{p^{'} \in g_{*} (P)} d_{K L} (p^{'} | q^{'}) = {max}_{q \in Q} {min}_{p \in P} d_{K L} (g_{*} (p) | g_{*} (q))$

Now, hypothetically, if we had

$d_{K L} (g_{*} (p) | g_{*} (q)) \leq d_{K L} (p | q)$

then we could use that result to get

$\leq {max}_{q \in Q} {min}_{p \in P} d_{K L} (p | q) = D_{K L} (P | Q)$

and we'd be done. So, our task is to show

$d_{K L} (g_{*} (p) | g_{*} (q)) \leq d_{K L} (p | q)$

for any pair of probability distributions $p$ and $q$ . Now, here's what we'll do. The $p$ and $q$ gives us probability distributions over $X$ , and the $g_{*} (p)$ and $g_{*} (q)$ are probability distributions over $Y$ . So, let's take the joint distribution over $X \times Y$ given by selecting a point from $X$ according to the relevant distribution and applying $g$ . By the classical version of the disintegration theorem, we can write it either way as starting with the marginal distribution over $X$ and a semidirect product to $Y$ , or by starting with the marginal distribution over $Y$ and you take a semidirect product with some markov kernel to $X$ to get the joint distribution. So, we have:

$d_{K L} (p ⋉ g | q ⋉ g) = d_{K L} (g_{*} (p) ⋉ k | g_{*} (q) ⋉ j)$

for some Markov kernels $k, j : Y \to X$ . Why? Well, the joint distribution over $X \times Y$ is given by $p ⋉ g$ or $q ⋉ g$ respectively (you have a starting distribution, and $g$ lets you take an input in $X$ and get an output in $Y$ ). But, breaking it down the other way, we start with the marginal distribution of those joint distributions on $Y$ (the pushforward w.r.t. $g$ ), and can write the joint distribution as semidirect product going the other way. Basically, it's just two different ways of writing the same distributions, so that's why KL-divergence doesn't vary at all.

Now, it is also a fact that, for semidirect products (sorry, we're gonna let $p, q, k, j$ be arbitrary here and unconnected to the fixed ones we were looking at earlier, this is just a general property of semidirect products), we have:

$d_{K L} (p ⋉ k | q ⋉ j) = d_{K L} (p | q) + E_{p} [d_{K L} (k (x) | j (x))]$

To see this, run through the proof of Proposition 5, because probability distributions are special cases of infradistributions. Running up to right up before the inequality, we had

$D_{K L} (P ⋉ K | Q ⋉ J) = {max}_{q \in Q} {min}_{p \in P} (d_{K L} (p | q) + E_{p} [λ x . D_{K L} (K (x) | J (x))])$

But when we're dealing with probability distributions, there's only one possible choice of probability distribution to select, so we just have

$d_{K L} (p ⋉ k | q ⋉ j) = d_{K L} (p | q) + E_{p} [d_{K L} (k (x) | j (x))]$

Applying this, we have:

$d_{K L} (p | q) + E_{p} [d_{K L} (g (x) | g (x))] = d_{K L} (p ⋉ g | q ⋉ g)$
$= d_{K L} (g_{*} (p) ⋉ k | g_{*} (q) ⋉ j) = d_{K L} (g_{*} (p) | g_{*} (q)) + E_{g_{*} (p)} [d_{K L} (k (y) | j (y))]$

The first equality is our expansion of semidirect product for probability distributions, second equality is the probability distributions being equal, and third equality is, again, expansion of semidirect product for probability distributions. Contracting the two sides of this, we have:

$d_{K L} (p | q) + E_{p} [d_{K L} (g (x) | g (x))] = d_{K L} (g_{*} (p) | g_{*} (q)) + E_{g_{*} (p)} [d_{K L} (k (y) | j (y))]$

Now, the KL-divergence between a distribution and itself is 0, so the expectation on the left-hand side is 0, and we have

$d_{K L} (p | q) = d_{K L} (g_{*} (p) | g_{*} (q)) + E_{g_{*} (p)} [d_{K L} (k (y) | j (y))] \geq d_{K L} (g_{*} (p) | g_{*} (q))$

And bam, we have $d_{K L} (p | q) \geq d_{K L} (g_{*} (p) | g_{*} (q))$ which is what we needed to carry the proof through.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

11

11

Introduction

Section 0: Notation Glossary

Section 1: Generalizations

Section 2: Types of Infradistributions

Section 3: Operations on Infradistributions

Section 4: Ultradistributions

Section 5: Conditional Probability/Updates

Section 6: IKR-distance

Section 7: Markov Processes

Section 8: Entropy

Section 9: Future directions