What is it that makes the concept of “pencil” a good abstraction?
One way to frame it: there are lots of “pencils” in the world - lots of little chunks of the world which I could look at which all contain information about pencils-in-general. Many different places from which I could gain information about “pencils”, and many different places where I can apply that information to make predictions. If I don’t know that pencils are usually made of wood with a graphite core, there are many different chunks of the world which I could look at to gain that information. If I do know that pencils are usually made of wood with a graphite core, there are many different chunks of the world where I could apply that information to predict the materials from which something is made.
Alternatively, consider a gear, spinning in a gearbox. What makes the gear's rotational speed a good abstraction? Well, there are lots of little patches of metal comprising the gear. If I don't know the gear's rotation speed, I could precisely estimate it from any of the little patches. If I do know the gear's rotation speed, I can use it to predict the rotation of many different little patches of metal within the gear.
This seems like an intuitively-reasonable notion of what makes abstractions “good” in general: there are many different places from which we can learn the information, and many different places where we can apply the information to make predictions. In other words, a good abstraction is highly redundant: it appears in many different places, and in any of those places we can use the abstract information to make predictions and/or gain more information about the abstraction in general.
In this post I’ll sketch out one way to formalize this idea mathematically, and show that it’s equivalent to the formalization of abstraction as information-at-a-distance. In particular, in the gear example: conditional on the highly redundant information (i.e. the overall rotation of the gear), the low-level rattling of far-apart chunks of metal is statistically independent. More generally, redundancy yields the same high-level abstract information as the Telephone Theorem, but in a mathematically-cleaner form, and without the need for different summaries between different variables.
Advice for readers: I try to keep the more-dense math in a few specific sections and the appendices, which can all be skimmed/skipped if you just want a conceptual picture.
Epistemic status: I’m aiming for physics-level rigor here. The proofs involve some shenanigans with infinite limits, and I expect them to contain subtle errors. However, I also expect that the results will accurately predict reality in practice, and that whatever subtle errors the proofs contain can be patched by the sort of mathematician who enjoys dealing with tricky limit shenanigans.
Thankyou to Rob Miles and Adam Shimi for suggesting I try Excalidraw.
Basic Idea: Redundant Information Is Conserved Under Resampling
Suppose I sample the genomes of two random humans, and What information is redundant across these two random variables?
One intuitively-reasonable answer: if I threw away one of the two sequences, then the redundant information is whatever I could still figure out from the other. More generally, if I have a whole bunch of genomes from random humans, , the redundant information is whatever I could still figure out after throwing one of them away.
To formalize “what I could still figure out after throwing one away”, we’ll use the idea of resampling: I throw away the value of , and then sample a new value for , using my original probability distribution conditioned on all the other genomes. So, for instance, I throw away , then I look at all the other genomes and see that in most places they’re the same - so when I sample my new I know that it should match all the other genomes in all those places. However, there are a few locations where the other genomes differ, e.g. maybe 10% of them contain a particular mutation. So, when I sample my new , it will contain that mutation with roughly 10% probability (assuming there’s enough data to swamp the impact of my priors).
Now I repeat this process many times, each time throwing away one randomly-chosen genome and resampling it conditional on the others. Intuitively, I expect that the information conserved by this process will be whatever information is highly redundant, so that approximately-zero loss occurs at each step.
- Start with a bunch of random variables , a joint distribution , and a value for each variable. (See the Equations section below for more details on the notation.)
- At each “timestep”, pick one variable at random, throw away its value, and resample it conditional on all the other variable values.
- Run this process for a long time.
- See what information about the initial conditions is conserved.
Conceptual Example: Gear
Suppose I have a gear, spinning around in a gearbox. I look at a few nanometer-size patches on the gear’s surface, just a hundred-ish atoms each, and measure the (approximate) position and velocity of each atom in each little patch. Notation: random variable gives the positions and velocities of each atom in patch .
What information is redundant across these different patches? Intuitively, I can look at any patch and average together the rotational velocities of the atoms about the gear’s center to get a reasonably-precise estimate of the gear’s overall rotation speed. If I throw away , I can still precisely estimate the gear’s overall rotation from the other ’s. Then, when I resample , I will resample it so that the average rotational velocity of the atoms in matches the gear’s overall rotation.
Notation: we’ll use for the variable-values after t steps of this process, and for variable-values after running the process for an arbitrarily long time. So, we’re mainly interested in the mutual information between and . The resampling process itself specifies the distribution .
We’ll use “model” variables and to distinguish probabilities from the “base” distribution (which is only over ) vs probabilities from the resampling process (which is over ). One potential point of confusion: both models contain a variable called “”, but these two variables are “in different scopes” (in the programmer’s sense); “” means something different depending on which model it’s in. In the base model, is just a single instance of our base random variables. In the resampling model, consists of many instances , one for each timestep , and each individual instance is distributed the same as the base model’s . We use the same variable name for both because there’s a conceptual correspondence between the two.
The resampling process defines the full relationship between the two models:
(assuming variable i is resampled at resampling-time ; for all the other variables, , since their values just stay the same)
These equations follow directly from the process outlined above, and define the distribution in terms of the distribution .
… So We’re Running MCMC?
If you’ve worked with Markov Chain Monte Carlo (MCMC) algorithms before, this should look familiar: we’re basically asking which information about the initial conditions will be conserved as we run a standard MCMC process.
If you’ve worked with MCMC algorithms before, you might also guess that this question has two answers:
- In theory, assuming everywhere, the resampling-distribution always approaches as we run the process regardless of initial conditions. Since we always approach the same distribution regardless of initial conditions, no information about the initial conditions is conserved.
- In practice, the time it takes for that limit to kick in increases (often exponentially) with the number of variables in the model, so lots of information is conserved in large models over any practically-achievable number of steps of the process.
In this post, we’re going to think about infinitely large models, so the process no longer converges to at all; that convergence time goes to infinity. The distribution does still converge, but the limiting distribution depends on the initial conditions, and that dependence is exactly what we're interested in. Unfortunately, this introduces a bunch of tricky subtleties about how we take the limits: do we take the limit of infinitely many variables first, or the limit of infinite time first, or do we take them at the same time with some fixed relationship between the two? I’ll handle those subtleties mainly by ignoring them and hoping a mathematician comes along to clean it up later. Remember, we’re aiming for physics-level rigor here.
The important takeaway of this section is that we have a ton of data on how these sorts of processes actually behave in practice, thanks to the popularity of MCMC algorithms. So we don’t just have to rely on physics-level-of-rigor arguments; anyone with firsthand experience with MCMC on large models can use their intuition as a guide. (I mostly won’t explicitly talk more here about lessons from MCMC; I expect that those of you already familiar with the topic can reason it through for yourselves, and explaining the relevant experience/intuitions to people not already familiar is beyond the scope of this post.)
Let’s start with a trivial example to show how any information at all can be conserved by the resampling process. We have two random variables, and . Our model for the two variable values is:
- is an unbiased coin flip
- is a record on a piece of paper indicating whether the coin came up heads or tails (“H” if the coin came up heads, “T” if the coin came up tails)
What happens when we run our resampling process on this system? Well, first we throw away the coin flip, and resample it given our record. If the record says “H”, then we know the coin came up heads, so our sampler selects heads again for ; vice-versa for tails. The first coin is therefore reset to its original value. Then, we throw away the record, and resample it given our coin. If the coin is heads, we know the record says “H”; vice-versa for tails. The record is therefore reset to its original value.
The process continues, back and forth, with each variable “storing the information” when the other is thrown away, and the information then perfectly copied back over into the resampled variable value.
On the other hand, imagine that our record-keeping is imperfect - maybe there’s a 10% chance that records the wrong value. Then, at each “timestep” of the resampling process, there’s roughly a 20% chance (10% for each variable) that we’ll lose the original value. Given, say, 100 timesteps, we’ll lose approximately-all of the information about the original values.
General point: only information which is perfectly conserved at each timestep will be conserved indefinitely; everything else is completely lost.
In general, the information perfectly conserved indefinitely will be the value of deterministic constraints, i.e. functions and such that with probability 1 in the base distribution . (We can prove this via the Telephone Theorem plus an equilibrium condition, but it’s not the main theorem of interest in this post.)
Many Measurements Of One Thing
Let’s say we have a stick of length , and take conditionally-independent measurements of . We’ll model each measurement as normally distributed with mean and standard deviation , and we’ll assume that we have enough data that the prior on doesn’t matter much (i.e. we’ll just pretend the prior is flat).
To simplify the analysis a little, we’ll resample the variables in order rather than at random - i.e. we resample conditional on all the measurements, then resample each of the measurements conditional on , and repeat.
When we resample , we draw from a normal distribution with mean equal to the average of the measurements (i.e. ) and standard deviation , so our new will be about away from the previous average. When we resample the measurements, their new average will be normally distributed with mean and standard deviation , so the new average will be about away from the previous . In other words: and the measurement average follow a random walk, drifting about per timestep. Over timesteps, they will drift a distance of about . (In general, the distance a random walk drifts scales with rather than , since it often wanders back on itself.)
So: if we run the process for a number of steps , then all information about the initial conditions is lost. On the other hand, if the number of variables , then the drift is close to zero, so and the measurement average are approximately conserved. The order in which we take our limits matters.
Practically speaking, we’re looking for information which is approximately conserved, i.e. the “timescale” over which it’s lost is large. So it makes sense to consider and the measurement average as approximately-conserved when is large. That’s our abstract information in this example.
Now for our first theorem about this kind of resampling process.
Imagine that our base distribution is local - i.e. each variable only “directly interacts with” a few “neighbor variables”. When modeling a physical gear, for instance, each little chunk of metal only interacts directly with the chunks of metal spatially adjacent. Any longer-range interactions have to “go through” those direct interactions.
Our theorem says that interactions are still local after controlling for the information conserved by the resampling process. In the gear example, after controlling for the high-level rotation of the gear, the remaining low-level vibrations and rattling are still local; the low-level details of each chunk of metal interact directly only with the low-level details in chunks spatially adjacent.
Why does this matter? In general, locality is the main tool which lets us reason about our high-dimensional world at all. It means we can look at one part of the world, and understand what’s going on there without having to understand everything that’s happening in the whole universe. The factorization theorem says that this still applies when we condition on our high-level knowledge - in other words, we can “zoom in” on lower-level details, and add them to our high-level picture without having to understand all the other low-level details in the whole universe. Conditional on our high-level knowledge, any low-level information still has to flow through neighboring variables in order to influence things “far away” in the graph.
That’s going to be key to our next theorem, which is the main item of interest in this post.
Resampler Conserves Locality: If the base distribution factors over some graph , then so does the limiting resampling distribution . This factorization theorem applies to both undirected graphical models (i.e. Markov Random Fields) and directed graphical models (i.e. Bayes Nets/Causal Models). See the appendices for a proof sketch.
In the gear example, the graph would be the adjacency graph for chunks of metal: each chunk is a node, and the edges show spatial adjacency. The factorization follows the standard formulas for factorization of Markov Random Fields or Bayes Nets, depending on which type of graphical model we’re using.
The Interesting Part: Resampler-Conserved Quantities Mediate Information At A Distance
Time for the big claim from earlier: in the gear example, conditional on the highly redundant information (i.e. the overall rotation of the gear), the low-level rattling of far-apart chunks of metal is statistically independent.
More generally: assume that our base distribution factors on a graph . Conditional on all the quantities perfectly conserved by the resampling process, variables far apart in are independent. If you’ve read the Telephone Theorem, this is basically the same, but with one big upgrade: our “high-level summary” no longer depends on which notion of “far away” we use; the same summary applies to any sequence of nonoverlapping nested Markov blankets. We can take the information conserved by the resampling process to be the “high-level abstractions” for the whole model.
Let’s unpack that. First, we’ll do a quick recap of the Telephone Theorem.
We start with a large Causal Model/Bayes Net:
Each node is a random variable, and the arrows show direct causal influence; paths show indirect causal influence. If you’ve used MCMC before, you were probably picturing something like this already; we usually use MCMC with Bayes nets, because the locality structure makes it easy to resample variables (we only need to look at neighbor values rather than the values of all variables).
Now, just like in the Telephone Theorem, we picture a sequence of nested Markov blankets in our model:
Each possible sequence of blankets cuts the graph into pieces, with each piece only connected directly to the piece before and the piece after. A choice of sequence of blankets defines a notion of “far away” - i.e. if two variables are separated by a large number of “layers” of blankets in the sequence, then they are “far apart”. In order for to have any mutual information with , that information must propagate through each of the layers in between.
The basic idea of the Telephone Theorem is that information is either perfectly conserved or completely lost as we move through enough layers; information can only propagate “far away” if the information can be perfectly computed from each layer individually.
… but if some information can be perfectly computed from each layer individually, then that information will be conserved by our resampler. When resampling , I can still perfectly calculate the information from or , and therefore the information will be perfectly conserved. So, the only information which can propagate far away is information which is perfectly conserved by resampling. That means that conditional on the information conserved by resampling, the mutual information must drop to zero.
A slightly more formal version of that argument is in the appendices, but that’s the core idea.
Let satisfy , i.e. encodes the values of all conserved quantities in the resampling process. For any infinite sequence of nested nonoverlapping Markov blankets on the base model, the conditional mutual information as .
In the gear example, our Markov blankets might be nested layers of metal:
“Far away” then indicates moving through a large number of such layers.
In order for information to propagate far away, we must be able to compute it from each layer - i.e. we can very precisely compute the overall rotation of the gear from the overall rotation of chunks of metal in each layer. So, when we resample a chunk of metal, the overall rotation will be conserved - it will be “stored in” the other chunks.
This reasoning must apply to any information which propagates through many layers: if the information propagates far away, it will be conserved by resampling. So, assuming the overall rotation is the only quantity conserved by the resampling process, far-apart chunks of the gear must be independent given the overall rotation.
Intuitively, this lets us factor the gear into a “global” component and a “local” component. The global component, the gear’s overall rotation, is redundantly represented; it can be estimated by looking at many different little chunks, and we expect these estimates to (approximately) agree. The local component captures everything else, and is guaranteed to be “local” in the sense that far apart pieces are independent.
We started with the intuition that abstraction is about redundant information: there are many different places from which we can learn the information, and many different places where we can apply the information to make predictions. That’s what makes abstractions generalizable and useful.
Then, we showed that a formalization of this intuition based on resampling variables reproduces the main ideas of abstraction as information-relevant-at-a-distance. In particular, the resampling approach yields a better version of the Telephone Theorem.
Personally, I came to all this in a different order: I noticed that the Telephone Theorem required redundancy of information, figured out the resampling thing, and only backed out the intuition of abstraction-as-redundant-information later. Nonetheless, it was an exciting thing to find: when different intuitive formulations of an idea turn out to be basically-equivalent, that’s strong evidence that we’re on the right track.
In terms of applications, the locality results in particular are exactly the sort of thing which I’ve been looking for since my last update on testing the Natural Abstraction Hypothesis. In combination with the generalized Koopman-Pitman-Darmois theorem, they get us to the point where calculating abstractions from a base model is roughly-tractable, i.e. it can be done in something like time with respect to the size of the base model. That still isn’t quite efficient enough to handle big models, and unfortunately big models are exactly where we expect to see nontrivial results, so I’m still not quite at the point of running empirical tests. But it feels like the hard part is now past; there are some clear steps forward on the math, and I expect that those will basically close the gap to efficient calculation.
On the theoretical side, the first leg of the Natural Abstraction Hypothesis says:
For most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself.
Assuming the proofs in this post basically hold up, and the loopholes aren’t critical, I think this claim is now basically proven. There’s still some operationalization to be done (e.g. the “dimension” of the summary hasn’t actually been addressed yet), loopholes to close (e.g. deterministic computation makes things tricky), some legwork to flesh it all out (e.g. including numerical approximation), various extensions (e.g. logical uncertainty), and a lot of distillation needed, but I think this math is enough to conclude that the basic claim is probably true in worlds following local laws (like ours).
The basic idea is also useful even in nonlocal models, which I’ll hopefully write about in the not-too-distant future. That form is more readily applicable to clustering-style applications, like e.g. recognizing “pencils” as a kind of object.
Proof Sketch: Factorization
We’ll use three facts. First, is invariant under resampling any variable - i.e. the distribution reaches an equilibrium as we run the resampling process. I won’t actually prove that, because I don’t expect that there’s anything new involved; standard MCMC convergence proofs should largely carry over (allowing for conserved quantities, of course). Formally:
Second, the resampler is local:
… where indicates the neighbors of in the graph on which factors.
Third, screens off from (thus the “Markov Chain” part of “Markov Chain Monte Carlo). So, .
Combining these three, we find
… i.e. the neighbors which screen off from everything else in the base model also screen off from everything else in , given . The Hammersley-Clifford Theorem tells us that this is a sufficient condition for to factor over the graph (modulo taking some limits).
Note that the Hammersley-Clifford theorem applies to undirected graphical models. I won’t prove the extension of our theorem to Bayes Nets here because, again, there's nothing particularly novel involved.
Proof Sketch: Resampler Telephone Theorem
Once we have locality of given , this theorem is easy: it’s exactly like the Telephone Theorem, but on the distribution rather than . Because factors over the same graph as , we can pick any sequence of nonoverlapping nested Markov blankets in the base model, carry them over directly to , and apply the Telephone Theorem argument:
- The relationship between and is mediated by , so . In other words, mutual information with B_1 decreases as we move outward through the layers.
- MI is nonnegative and decreasing, so it must approach a limit as .
- Once the MI is arbitrarily close to that limit, information about is arbitrarily perfectly conserved.
- Information is perfectly conserved only when the information is carried by a deterministic constraint, i.e. where with probability 1 for some .
The key thing to notice here is the deterministic constraint . In the two-variable example, we saw that exactly this kind of information is conserved by our resampling process: when we resample variables in , the constraint value is perfectly “remembered” by , and when we resamples variables in , the constraint value is perfectly “remembered” by . Newly-generated sample values are forced to be perfectly consistent with the constraint value, so that information sticks around.
So, if with probability 1, and the blankets and are nonoverlapping, then the value is perfectly conserved when resampling. It’s perfectly conserved by the variables in when resampling a variable outside of , and perfectly conserved by the variables in when resampling a variable outside of . So, conditional on information conserved by the resampling process (or, equivalently for purposes of the distribution, conditional on ), the value of is known; it does not give any information at all. So, conditional on quantities conserved by resampling, the mutual information must drop to zero in the limit of large .