The Lightcone Theorem: A Better Foundation For Natural Abstraction?

15th May 2023

23Rohin Shah

9johnswentworth

12Thane Ruthenis

6Rohin Shah

4johnswentworth

4Rohin Shah

4Thane Ruthenis

8Thane Ruthenis

8johnswentworth

6Thane Ruthenis

4johnswentworth

4Thane Ruthenis

4johnswentworth

7Thane Ruthenis

4johnswentworth

6romeostevensit

2Thane Ruthenis

New Comment

17 comments, sorted by Click to highlight new comments since: Today at 1:58 AM

The Lightcone Theorem says: conditional on , any sets of variables in which are a distance of at least apart in the graphical model are independent.

I am confused. This sounds to me like:

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Some things that I am confused about as a result:

- I don't see why you are surprised, or why you would have said it wouldn't work for finite T. (It seems obviously true to me from the statement, which makes me think I'm missing some subtlety.)
- I don't understand why the distribution of must be the same as the distribution of . It seems like it should hold for arbitrary .
- I don't see why this is relevant for natural abstractions. To me, the interesting part about abstractions is that it is generally fine to keep track of a small amount of information, even though there is tons and tons of information that "could have" been relevant (and
*does*affect outcomes but in a way that is "noise" rather than "signal"). But this theorem is only telling you that you can throw away information that could never possibly have been relevant.

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance implies that nothing other than could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

I don't understand why the distribution of must be the same as the distribution of . It seems like it should hold for arbitrary .

It does, but then doesn't have the same distribution as the original graphical model (unless we're running the sampler long enough to equilibrate). So we can't view as a latent generating that distribution.

But this theorem is only telling you that you can throw away information that could never possibly have been relevant.

Not quite - note that the resampler itself throws away a ton of information *about* while going from to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

So the reason this is interesting (for the thing you're pointing to) is not that it lets us ignore information from far-away parts of which could not possibly have been relevant given , but rather that we want to further throw away information from itself (while still maintaining conditional independence at a distance).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance implies that nothing other than could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

... I feel compelled to note that I'd pointed out a very similar thing a while ago.

Granted, that's not exactly the same formulation, and the devil's in the details.

Okay, that mostly makes sense.

note that the resampler itself throws away a ton of information

aboutwhile going from to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

I agree this is true, but why does the Lightcone theorem matter for it?

It is also a theorem that a Gibbs resampler initialized at equilibrium will produce distributed according to , and as you say it's clear that the resampler throws away a ton of information about in computing it. Why not use that theorem as the basis for identifying the information to throw away? In other words, why not throw away information from while maintaining ?

EDIT: Actually, *conditioned on **,* it is not the case that is distributed according to .

(Simple counterexample: Take a graphical model where node A can be 0 or 1 with equal probability, and A causes B through a chain of > 2T steps, such that we always have B = A for a true sample from X. In such a setting, for a true sample from X, B should be equally likely to be 0 or 1, but * *, i.e. it is deterministic.)

Of course, this is a problem for both my proposal and for the Lightcone theorem -- in either case you can't view as a latent that generates (which seems to be the main motivation, though I'm still not quite sure why that's the motivation).

Sounds like we need to unpack what "viewing as a latent which generates " is supposed to mean.

I start with a distribution . Let's say is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as , where is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with , rather than all the other rolls, which is useful insofar as is much smaller than all the rolls.

Note that is not supposed to match ; then the representation would be useless. It's the marginal which is supposed to match .

The lightcone theorem lets us do something similar. Rather all the 's being independent given , only those 's sufficiently far apart are independent, but the concept is otherwise similar. We express as (or, really, , where summarizes info in relevant to , which is hopefully much smaller than all of ).

Okay, I understand how that addresses my edit.

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques)

My impression is that it being a concrete example *is* the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer.

The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.

Hmm. I may be currently looking at it from the wrong angle, but I'm skeptical that it's the right frame for defining abstractions. It seems to group low-level variables based on raw distance, rather than the detailed environment structure? Which seems like a very weak constraint. That is,

By further iteration, we can conclude that any number of sets of variables which are all separated by a distance of are independent given . That’s the full Lightcone Theorem.

We can make *literally any* choice of those sets subject to this condition: we can draw the boundaries any way we want. Which means the abstractions we'd recover are not going to be convergent: there's a free parameter of the boundary choice.

Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which is equivalent to ?

The eigendecomposition/mesoscale-approximation/gKPD approaches seem like they might move in that direction, though I admit I don't see their implications at a first glance.

If we ignore the sketchy part - i.e. pretend that regions cover all of and are all independent given - then gKPD would say roughly:

ifcan be represented as dimensional or smaller

What's the here? Is it meant to be ?

Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which is equivalent to ?

Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large ).

What's the here? Is it meant to be ?

System size, i.e. number of variables.

Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large ).

Can you elaborate on this expectation? Intuitively, should consist of a number of higher-level variables as well, and each of them should correspond to a specific set of lower-level variables: abstractions and the elements they abstract over. So for a given , there should be a specific "correct" way to draw the boundaries in the low-level system.

But if ~any way of drawing the boundaries yields the same , then what does this mean?

Or perhaps the "boundaries" in the mesoscale-approximation approach represent something other than the factorization of into individual abstractions?

Sure, but isn't the goal of the whole agenda to show that *does* have a certain correct factorization, i. e. that abstractions are convergent?

I suppose it may be that any choice of low-level boundaries results in the same , but the itself has a canonical factorization, and going from back to reveals the corresponding canonical factorization of ? And then depending on how close the initial choice of boundaries was to the "correct" one, is easier or harder to compute (or there's something else about the right choice that makes it nice to use).

Yes, there is a story for a canonical factorization of , it's just separate from the story in this post.

By the way, do we need the proof of the theorem to be quite this involved? It seems we can just note that for for any two (sets of) variables , separated by distance , the earliest sampling-step at which their values can intermingle (= their lightcones intersect) is (since even in the "fastest" case, they can't do better than moving towards each other at 1 variable per 1 sampling-step).

Is there a good primer somewhere on how causal models interact with the standard model of physics?

Do you have any cached thoughts on the matter of "ontological inertia" of abstract objects? That is:

- We usually think about abstract environments in terms of DAGs. In particular, ones without global time, and with no situations where we update-in-place a variable. A node in a DAG is a one-off.
- However, that's not faithful to reality. In practice, objects have a continued existence, and a good abstract model should have a way to track e. g. the state of a particular human across "time"/the process of system evolution. But if "Alice" is a variable/node in our DAG, she only exists for an instant...
- The model in this post deals with this by assuming that the
*entire*causal structure is "copied" every timestep. So every timestep has an "Alice" variable, and is a function of plus some neighbours... - But that's not right either. Structure
*does*change; people move around (acquire new causal neighbours and lose old ones) and are born (new variables are introduced), etc.

I think we want our model of the environment to be "flexible" in the sense that it doesn't assume the graph structure gets copied over fully every timestep, *but* that it has some language for talking about "ontological inertia"/one variable being an "updated version" of another variable. But I'm not quite sure how to describe this relationship.

At the bare minimum, it has to be of same "type" as (e. g., "human"), be directly causally connected to , 's value has to be largely determined by 's value... But that's not enough, because by this definition Alice's newborn child will probably also count as Alice.

Or maybe I'm overcomplicating this, and every variable in the model would just have an "identity" signifier baked-in? Such that ?

Going up or down the abstraction levels doesn't seem to help either. ( isn't necessarily an abstraction over the same set of lower-level variables as , nor does she necessarily have the same relationship with the higher-level variables.)

Back to my question: do you have any cached thoughts on that?

Thankyou to David Lorell for his contributions as the ideas in this post were developed.For about a year and a half now, my main foundation for natural abstraction math has been

The Telephone Theorem: long-range interactions in a probabilistic graphical model (in the long-range limit) are mediated by quantities which are conserved (in the long-range limit). From there, thenext big conceptual stepis to argue that the quantities conserved in the long-range limit are also conserved by resampling, and therefore the conserved quantities of an MCMC sampling process on the model mediate all long-range interactions in the model.The most immediate shortcoming of the Telephone Theorem and the resampling argument is that they talk about behavior

in infinite limits. To use them, either we need to have an infinitely large graphical model, or we need to take an approximation. For practical purposes, approximation is clearly the way to go, but just directly adding epsilons and deltas to the arguments gives relatively weak results.This post presents a different path.

The core result is the Lightcone Theorem:

In short: the initial condition of the resampling process provides a latent, conditional on which we have

exactindependence at a distance.This was… rather surprising to me. If you’d floated the Lightcone Theorem as a conjecture a year ago, I’d have said it would probably work as an approximation for large T, but no way it would work exactly for finite T. Yet here we are.

## The Proof, In Pictures

The proof is best presented visually.

^{[1]}High-level outline:We start with the graphical model:

Within that graphical model, we’ll pick some tuple of variables XR (“R” for “region”)

^{[2]}. I’ll use the notation XD(R,t) for the variables a distance t away from R, XD(R,>t) for variables a distance greater than t away from R, XD(R,<t) for variables a distance less than t away from R, etc.Note that for any t, XD(R,t) (the variables a distance t away from R) is a Markov blanket, mediating interaction between XD(R,<t) (everything less than distance t from XR), and XD(R,>t) (everything more than distance t from XR).

Next, we’ll draw the Gibbs resampler as a graphical model. We’ll draw the full state Xt at each timestep as a “layer”, with X0 as the initial layer and X=XT as the final layer. At each timestep, some (nonadjacent) variables are resampled conditional on their neighbors, so we have arrows from the neighbor-variables in the previous timestep. The rest of the variables stay the same, so they each just have a single incoming variable from themselves at the previous timestep.

Now for the core idea: we’re going to perform a do() operation on the resampler-graph. Specifically, we’re going to hold XD(R,T) constant; none of the variables in that Markov blanket are ever resampled in the do()-transformed resampling process.

Notice that, in the do()-operated process, knowing X0 also tells us the value of XtD(R,T) for all t. So, if we condition on X0, then visually we’re conditioning on:

Note that the “cylinder” (including the “base”) separates XT into two pieces - one contains everything less than distance T from X1R,...,XTR, and the other everything more than distance T from X1R,...,XTR. The separation indicates that X0 is a Markov blanket between those pieces… at least within the do()-operated resampling process.

Now for the last step: we’ll draw a forward “lightcone” around our do()-operation. As the name suggests, it expands outward along outgoing arrows, starting from the nodes intervened-upon, to include everything downstream of the do()-intervention.

Outsideof that lightcone, the distribution of the do()-operated process matches that of the non-do()-operated process.Crucially, X0, XTR=XR, and XTD(R,≥2T)=XD(R,≥2T) are all outside of the lightcone, so their joint distribution is the

sameunder the do()-operated and non-do()-operated resampling process.Since X0 mediates between XR and XTD(R,≥2T) in the do()-operated process, and the joint distribution is the same between the do()-operated and non-do()-operated process, X0 must mediate between XR and XTD(R,≥2T) under the non-do()-operated process.

In other words: any two sets of variables at least a distance of 2T apart (i.e. XR and XTD(R,≥2T)) are independent given X0. That’s the Lightcone Theorem for two sets of variables.

Finally, note that we can further pick out two subsets of XTD(R,≥2T) which are themselves separated by 2T, and apply the Lightcone Theorem for two sets of variables again to conclude that XR and the two chosen sets of variables are all mutually independent given X0. By further iteration, we can conclude that any number of sets of variables which are all separated by a distance of 2T are independent given X0. That’s the full Lightcone Theorem.

## How To Use The Lightcone Theorem?

The rest of this post is more speculative.

X0 mediates interactions in X over distance of at least 2T, but X0 also typically has a bunch of “extra” information in it that we don’t really care about - information which is lost during the resampling process. So, the next step is to define the latent Λ:

Λ(X0):=(x↦lnP[X=x|X0])

By the

minimal mapargument, Λ=P[X|Λ]=P[X|X0]; Λ is an informationally-minimal summary of all the information in X0 relevant to X.… but that doesn’t mean that Λ is minimal among

approximatesummaries, nor that it’s a very efficient representation of the information. Those are the two main open threads: approximation and efficient representation.On the approximation front, a natural starting point is to look at the eigendecomposition of the transition matrix for the resampling process. This works great in some ways, but plays terribly with information-theoretic quantities (logs blow up for near-zero probabilities). An eigendecomposition of the

logtransition probabilities plays better information-theoretic quantities, but worse with composition as we increase T, and we’re still working on theorems to fully justify the use of the log eigendecomposition.On the efficient representation front, a natural stat-mech-style starting point is to pick “mesoscale regions”: m sets of variables XR1,...,XRm which are all separated by at least distance 2T, but big enough to contain the large majority of variables. Then

lnP[XR1,…,XRm|Λ]=∑ilnP[XRi|Λ]

At this point the physicists wave their hands REALLY vigorously and say “... and since XR1,...,XRm includes the large majority of variables, that sum will approximate P[X|Λ]”. Which is of course

extremelysketchy, but for some (very loose) notions of “approximate” it can work for some use cases. I’m still in the process of understanding exactly how far that kind of approximation can get us, and for what use-cases.Insofar as the mesoscale approximation does work, another natural step is to invoke

generalized Koopman-Pitman-Darmois(gKPD). This requires a little bit of a trick: a Gibbs sampler run backwards is still a Gibbs sampler. So, we can swap X0 and X in the Lightcone Theorem: subsets of X0 separated by a distance of at least 2T are independent given X. From there:… which is most of what we need to apply gKPD. If we ignore the sketchy part - i.e. pretend that regions X0R1,...,X0Rm cover all of X0 and are all independent given X - then gKPD would say roughly:

ifΛ can be represented as n/2 dimensional or smaller,thenΛ is isomorphic to ∑ifi(XRi) for some functions fi, plus a limited number of “exception” terms. There’s a lot of handwaving there, but that’s very roughly the sort of argument I hope works. If successful, it would imply a nice maxent form (though not as nice as the maxent form I washoping fora year ago, which I don’t think actually works), and would justify using an eigendecomposition of the log transition matrix for approximation.^{^}I’ve omitted from the post various standard things about Gibbs samplers, e.g. explaining why we can model the variables of the graphical model as the output of a Gibbs sampler, how big T needs to be in order to resample all the variables at least once, how to generate X0 from X (rather than vice-versa), etc. Leave a question in the comments if you need more detail on that.

^{^}Notation convention: capital-letter indices like XA indicate index-tuples, i.e. if A=(1,2,3) then XA=(X1,X2,X3).