Fun fact: the KL divergence of distribution $P [X]$ from distribution $Q [X]$ is convex in the pair $P, Q$ . Writing it out: $D_{K L} (λ P_{1} [X] + (1 - λ) P_{2} [X] | | λ Q_{1} [X] + (1 - λ) Q_{2} [X]) \leq λ D_{K L} (P_{1} [X] | | Q_{1} [X]) + (1 - λ) D_{K} L (P_{2} [X] | | Q_{2} [X])$ with $0 \leq λ \leq 1$ .

This is particularly interesting if we take $P$ and $Q$ to be two different models, and take the indices 1, 2 to be different values of another random variable $Y$ with distribution $P [Y]$ given by $(λ, 1 - λ)$ . In that case, the above inequality becomes:

$D_{K L} (P [X] | | Q [X]) \leq E_{Y} [D_{K L} (P [X | Y] | | Q [X | Y])]$

In English: the divergence between our models of the $X$ -distribution ignoring $Y$ is at least as small as the average divergence between our models of the $X$ -distribution given $Y$ . This is true regardless of what the two models are - any approximation of the observable distribution improves (or gets no worse) when we integrate out a hidden variable, compared to fixing the value of the hidden variable.

Of course, this doesn't say anything about how much the approximation improves. Presumably for bad approximations, the divergence will not converge to anywhere near zero as we integrate more and more hidden variables. And if the hidden variable doesn't actually interact with the observables significantly, then presumably the divergence decrease will be near-zero.

So when would we expect this to matter?

I'd expect it to matter mainly when the observable $X$ consists of multiple variables which are "far apart" in a large model - i.e. there are many hidden variables mediating the interactions between observables. In other words, I'd expect this phenomenon to mainly be relevant to information at a distance. It's a hint that information at a distance, in complex systems, converges to some sort of universal behavior/properties, which is simpler in some sense than the full fine-grained model.