Fun fact: the KL divergence of distribution from distribution is convex in the pair . Writing it out: with .

This is particularly interesting if we take and to be two different models, and take the indices 1, 2 to be different values of another random variable with distribution given by . In that case, the above inequality becomes:

In English: the divergence between our models of the -distribution *ignoring* is at least as small as the average divergence between our models of the -distribution *given* . This is true regardless of what the two models are - *any* approximation of the observable distribution improves (or gets no worse) when we integrate out a hidden variable, compared to fixing the value of the hidden variable.

Of course, this doesn't say anything about *how much* the approximation improves. Presumably for bad approximations, the divergence will not converge to anywhere near zero as we integrate more and more hidden variables. And if the hidden variable doesn't actually interact with the observables significantly, then presumably the divergence decrease will be near-zero.

So when would we expect this to matter?

I'd expect it to matter mainly when the observable consists of multiple variables which are "far apart" in a large model - i.e. there are many hidden variables mediating the interactions between observables. In other words, I'd expect this phenomenon to mainly be relevant to information at a distance. It's a hint that information at a distance, in complex systems, converges to some sort of universal behavior/properties, which is simpler in some sense than the full fine-grained model.