Intuitions on Universal Behavior of Information at a Distance

johnswentworth

What’s special about information at a distance? Some examples of what I mean:

What information about a little patch of pixels in a photo is relevant to another little patch of pixels far away in the same photo?
Relationship between exact voltages in two transistors which are far apart on the same CPU
When planning a road trip from San Francisco to LA, which details of the route within SF are relevant to the details of the route within LA?

The unifying feature in all these examples is that we imagine high-level observables which are “far apart” within some low-level model, in the sense that interactions between any two observables are mediated by many layers of hidden variables. In an undirected causal model, we could draw the picture like this:

There’s a handful of observed variables (black nodes), whose interactions are mediated by many layers (dotted groups) of unobserved variables (white nodes). Somewhere in the middle, everything connects up. We’re interested in the limit where the number of layers of hidden variables between any two observables is large.

In the limit of many hidden variables intermediating interactions, I suspect that there’s some kind of universal behavior going on - i.e. all models (within some large class) converge to behavior with some common features.

This post will give some hand-wavy, semi-mathematical arguments about what that behavior looks like. The main goal is to show some intuitive pictures; later posts can more carefully build out the details and boundaries of the phenomena sketched here.

Outline:

Background intuition on information distributed across variables, which is not contained in any one variable. Intuitively, it seems like this information should be quickly wiped out in large, noisy systems.
How to quantify distributed information.
Characterization of systems with zero distributed information.
Quick test: in a large system of normal variables, do we actually see the expected behavior? We do, and it looks a lot like a prototypical phase change in complex systems theory.

Intuition on Distributed Information

Fun fact: with three random variables , $Y$ , $Z$ , it’s possible for $X$ to contain zero information about $Z$ , and $Y$ to contain zero information about $Z$ , but for the pair $(X, Y)$ together to contain nonzero information about $Z$ .

Classic example: $X$ and $Y$ are uniform random bits, and $Z = X \oplus Y$ , the exclusive-or of the other two. A priori, the distribution $P [Z]$ is 50/50. If we learn that $X$ is 0 (without knowing $Y$ ) then it’s still 50/50; if we learn that $X$ is 1 (without knowing $Y$ ) then it’s also still 50/50. Likewise for $Y$ ; learning either variable by itself does not change our distribution for $Z$ . Noise in either input wipes out the signal from the other input.

We can also generalize this example to more variables: $X_{1} \dots X_{n}$ are uniform random bits, and $Z = X_{1} \oplus \dots \oplus X_{n}$ . If there’s even just one input $X_{i}$ whose value we don’t know, then we have zero information about $Z$ , even if we know the values of all the other inputs.

On the other hand, it is not the case in general that all information is always wiped out by noise in a few variables. Suppose we have a whole bunch of earthquake-sensors in an area, and one of them drops offline; this does not significantly inhibit our ability to detect earthquakes in the area. In this case, each individual sensor contains information on its own - there’s even redundant information across sensors, since they’re all presumably highly correlated.

We have two prototypical pictures here: one in which information is stored in the relationship between variables, and another in which information is stored in the variables themselves. Intuitively, information stored in relationships seems to be much more sensitive to noise - uncertainty in even just one variable can wipe out the signal from all the others. Thus, a hypothesis: in a large complex system, with lots of noisy unobserved variables mediating observable interactions, information stored in many-variable relationships tends to be wiped out.

I’ll call information stored in relationships “distributed information”. We can re-state our hypothesis as: in a complex system, noise tends to wipe out many-variable distributed information. (In an earlier draft, I called it “entanglement information” in analogy to quantum; this produces the familiar-sounding-to-physicists hypothesis that “noise tends to wipe out many-variable entanglement”.)

A Measure of Distributed Information

In order to formalize this idea, we first need an information-theoretic notion of “information stored in the relationship between variables”. Although the notion is qualitatively discussed in the context of multivariate mutual information, I have surprisingly not seen an actual formulation which separates distributed info from redundancy; multivariate mutual information just adds the two together.

After trying a few approaches, here’s what I’ve settled on.

We have variables $X_{1} . . . X_{n}$ and $Y$ ; we want to talk about the information about $Y$ contained in $X$ , i.e. the mutual information $I (X; Y)$ . We’d like to decompose this into two parts: the information contained in individual variables $X_{i}$ , and the information stored only across multiple variables. But we can’t just add up the information $I (X_{i}, Y)$ stored in each individual variable to estimate the non-distributed info, because some of that information is redundant - we’d be double-counting.

So, strategy: let’s build a model which breaks the entanglement, forcibly throw out the distributed info, without throwing out any information in the individual variables. How do we do that? Start with the model which generated all of our variables (i.e. the experiment). Now, imagine running $n$ independent copies of that model (i.e. $n$ runs of the experiment), but condition on the outcome being $Y$ - i.e. select only runs with the same $Y$ . Finally, imagine that each $X_{i}$ came from a different run. This breaks the relationships between the $X_{i}$ , while still maintaining all the information between any individual $X_{i}$ and $Y$ .

Write it out, and this model is essentially a Naive Bayes approximation: it says that

$P [Y | X] \approx ~ P [Y | X] = \frac{1}{Z (X)} P [Y] \prod_{i} P [X_{i} | Y]$

… where $Z (X)$ is a normalizer. This makes sense - the Naive Bayes approximation ignores information contained in interactions between the $X_{i}$ .

Now, we define the “distributed information” $I_{D}$ as the amount of information thrown out by this approximation:

$I_{D} (Y; X_{1}, \dots, X_{n}) := E_{X} [D_{K L} (P [Y | X] | | ~ P [Y | X])]$

… where $D_{K L}$ is the KL divergence. The “non-distributed information” is then whatever mutual information is not distributed, i.e. $I (Y; X) - I_{D} (Y; X_{1}, \dots, X_{n})$ .

Now, let’s imagine that $X_{1} \dots X_{k}$ are observables, and $X_{k + 1} \dots X_{n}$ are hidden. How does the distributed information of $X_{1} \dots X_{k}$ about $Y$ when the hidden variable values are known compare to the distributed information when the hidden variable values are unknown? We’d guess that unknown hidden variables wipe out some of the distributed information in the observables, and we can prove that using the convexity of KL-divergence:

$I_{D} (Y; X_{1}, \dots, X_{k}) = E_{X_{1} . . . X_{k}} [D_{K L} (P [Y | X_{1} . . . X_{k}] | | ~ P [Y | X_{1} . . . X_{k}])] \leq E_{X_{1} . . . X_{k}} [E_{X_{k + 1} . . . X_{n}} [D_{K L} (P [Y | X] | | ~ P [Y | X])]] = E_{X_{k + 1} . . . X_{n}} [I_{D} (Y; X_{1}, \dots, X_{k} | X_{k + 1} . . . X_{n})]$

In other words: distributed information can only decrease as we integrate hidden variables out of the model. Just as hypothesized, those noisy unobserved variables wipe out distributed information.

Another way to look at it: as we integrate out hidden variables, Naive Bayes becomes a better approximation for the remaining variables.

Now, we haven’t actually shown how much distributed information is lost as we integrate out hidden variables. But we’ll punt that problem to a future post. For now, suppose that as we integrate out unobserved variables, all distributed information falls to zero. What would that distribution look like?

Zero Distributed Information

Just based on the definition, “zero distributed information” of $X_{1} \dots X_{n}$ about $Y$ means that the Naive Bayes approximation is exact:

$P [Y | X] = \frac{1}{Z (X)} P [Y] \prod_{i} P [X_{i} | Y]$

But we don’t just want zero distributed information of some particular variables with respect to one other variable. We want zero information of any subset of observable variables with respect to any other variable:

$P [X_{i} | X_{S}] = \frac{1}{Z (X_{S})} P [X_{i}] \prod_{j \in S} P [X_{j} | X_{i}]$

… where $X_{S}$ is any subset of the observable variables, and $X_{i}$ is any observable not in $X_{S}$ . (Without loss of generality, I’ll usually use the name $Y$ for the variable $X_{i}$ whose probability we’re applying Naive Bayes to, in order to make the formulas a bit easier to follow.)

At this point, we can throw math at the problem. Skip the next two subsections (“First Trick” and “Second Trick”) if you just want the summary and intuition.

First Trick

The first big condition we need to satisfy is that Naive Bayes still holds after integrating out one more variable, i.e. it holds for both $P [Y | X_{S}]$ and $P [Y | X_{S^{'}}]$ , where $S^{'} = S \cup {j}$ for some index $j \notin S, j \neq i$ . We can calculate via the usual rules of probability:

$P [Y | X_{S}] = \sum_{X_{j}} P [Y | X_{S^{'}}] P [X_{j} | X_{S}]$

Applying Naive Bayes to both sides and cancelling common terms then yields:

$\frac{1}{Z (X_{S})} = \sum_{X_{j}} \frac{1}{Z (X_{S^{'}})} P [X_{j} | Y] P [X_{j} | X_{S}]$

Key thing to notice: the left-hand-side does not depend on $Y$ , so the dependence on $Y$ on the right must somehow cancel out. There are two “obvious” ways for this to happen:

$X_{j}$ is independent of $Y$ , so $P [X_{j} | Y] = P [X_{j}]$ , or
$Z (X_{S^{'}}) = P [X_{j} | X_{S}] α (X_{S})$ for some $α$ , so most terms factor out of the sum and we’re left with $\sum_{X_{j}} P [X_{j} | Y] = 1$ (since probabilities sum to 1)

These two can be combined - e.g. if $X_{j}$ has multiple components, then one component could be independent of $Y$ , and $Z (X_{S^{'}})$ could be proportional to the distribution of the other component. There may also be other solutions - I haven’t done all the math yet, especially in the continuous case - but I don’t expect that there’s any qualitatively different behavior.

Anyway, let’s make a note that there can be independent variation in the variables, and then look closer at the second condition $Z (X_{S^{'}}) = P [X_{j} | X_{S}] α (X_{S})$ . If we have a set of observables, none of which are pairwise independent, then we can inductively apply this condition to each of them one-by-one. I won’t write that all out, but it turns out that (barring independences):

$Z (X_{S}) = P [X_{S}]$

Second Trick

Still assuming no pairwise independencies, we have

$P [Y | X_{S}] = \frac{1}{P [X_{S}]} \prod_{j \in S} P [X_{j} | Y] P [Y]$

… or, moving the $P [X_{S}]$ to the other side,

$P [Y, X_{S}] = \prod_{j \in S} P [X_{j} | Y] P [Y]$

In other words: we can choose any set of the observables, pick one of them to be $Y$ , and get a factorization of the joint distribution. This needs to work no matter which one we pick to be $Y$ , so our joint distribution has to factor in multiple different ways.

Let’s play with this a bit.

Pick any three observables - without loss of generality, call them $X_{1}, X_{2}, X_{3}$ . If we pick $X_{1}$ to be $Y$ , then we get

$P [X_{1}, X_{2}, X_{3}] = P [X_{2} | X_{1}] P [X_{3} | X_{1}] P [X_{1}]$

… but we could just as easily pick $X_{2}$ to be $Y$ :

$P [X_{1}, X_{2}, X_{3}] = P [X_{1} | X_{2}] P [X_{3} | X_{2}] P [X_{2}]$

Equate these two, and we find:

$P [X_{3} | X_{1}] = P [X_{3} | X_{2}]$

… which is rather unusual. Combine this with the Naive Bayes condition and apply Bayes’ Rule a couple times:

$P [X_{1} | X_{2}, X_{3}] = \frac{1}{P [X_{2}, X_{3}]} P [X_{2} | X_{1}] P [X_{3} | X_{1}] P [X_{1}] = \frac{1}{P [X_{2}, X_{3}]} P [X_{2} | X_{1}] P [X_{3} | X_{2}] P [X_{1}] = \frac{1}{P [X_{2}]} P [X_{2} | X_{1}] P [X_{1}] = P [X_{1} | X_{2}]$

This proof also generalizes inductively to more variables: $P [X_{1} | X_{2} \dots X_{k}] = P [X_{1} | X_{2}]$ .

What does all this mean? It means that:

Any variable contains exactly the same information about any other variable - all non-independent information is redundant.
Any variable is independent of any other variable given any third variable.

More intuitively, it says that any information which is not strictly local (i.e. not independent of all other variables) must be global (i.e. present in ALL variables).

Summary and Intuition for Zero Distributed Information

Here’s what a system with zero distributed information looks like:

Variables and components of variables can always be independent of the rest of the system.
Within any set of variables without any pairwise independencies, all information is either strictly local (i.e. independent of other variables) or strictly global (i.e. present in all variables). Any variable is independent of any other variable given any third variable.

This isn’t a complete picture yet - we still need to better fill out the cases where some variables are pairwise independent and others are not. But it seems to point toward the idea that any piece of information is globally present within some set of variables, and completely independent of anything outside of that set.

Let’s see what this looks like in a more concrete example.

Quick Test: System of Normals

We’ll build a system of normal random variables. Each variable $X_{i}$ is given by

$X_{i} = Z_{i} + \sum_{j < i} θ_{i j} X_{j}$

… with $Z_{i}$ a standard random normal, and each $θ_{i j}$ either zero with probability $1 - p$ (used to tune density of connections) or uniform between $0$ and $s$ (used to tune strength of connections). We make the system reasonably large, with reasonably low connection-density $p$ , and randomly pick some subset of the $X_{i}$ to consider “observed”.

(Note: this is not a particularly careful rendition of the assumptions made earlier. That’s fine; it’s supposed to be a quick test, and sensitivity to the exact conditions is something we care about anyway.)

Assuming the zero distributed information picture applies, what should this look like?

Well, we should see a mix of two behaviors:

Independence (i.e. diagonal covariance matrix)
Perfect correlation (i.e. rank-one covariance matrix)

Because each individual variable is only one-dimensional and normal, there isn’t really “enough room” for subcomponents of a variable to display different behaviors - each variable should be perfectly correlated with some other variables, and independent of everything else.

So what do we see when we run this?

Well, at very low connection density and strength, of course everything is basically independent. As we turn up the density and strength, we see a correlation matrix like this:

(100 variables from a 1000 variable system, p=0.05, s=0.3)

See those strong horizontal and vertical lines? That’s what a rank-one matrix looks like. See the diagonal? Those are components independent of the globally-correlated rank-one components. In this case, the rank-one component dominates the behavior of the system. Quantifying via the Frobenius norm, the rank-one component accounts for about 98.9% of the correlation.

We can roughly quantify the phenomenon using diagonal and rank-one contributions to the Frobenius norm of the correlation matrix. This only captures the biggest cluster of correlated variables (not any other clusters of perfectly-correlated variables), but looking at the output, the biggest component looks heavily dominant anyway. As we adjust the connection strength, we get this picture:

(200 variables from a 2000 variable system, p=0.05)

This looks like a classic phase transition: at low connection strength, everything varies approximately independently. As the connection strength increases, the system rapidly jumps to a state where most variables are near-perfectly correlated. (Physical examples of this sort of qualitative behavior include melting/freezing crystals, or magnetization of iron as we adjust the temperature.)

So this provides one idea of what “zero distributed information” behavior can look like. Most of our variables are near-completely uncorrelated in one regime, and near-perfectly correlated in another. In either case, any pair of variables is independent given any third variable.

Of course, this is still just scratching the surface of possible behaviors under zero distributed info. Next planned post will come at the problem from a different direction, and hopefully provide a stronger mathematical foundation for exploring what happens when models contain large numbers of unobserved variables.