When constructing a high-level abstract causal DAG from a low-level DAG, one operation which comes up quite often is throwing away information from a node. This post is about how to do that.
First, how do we throw away information from random variables in general? Sparknotes:
Given a random variable X, we can throw away information by replacing X with f(X) for some function f.
Given some other random variable Y, f(X) “contains all information in X relevant to Y” if and only if P[Y|f(X)]=P[Y|X]. In particular, the full distribution function (y→P[Y=y|X]) is a minimal representation of the information about Y contained in X.
For our purposes, starting from a low-level causal DAG, we want to:
Pick a node Xi
Pick a set of nodes XS (with i∈S)
Replace Xi by f(Xi) for some function f, such that
P[X¯S|f(Xi)]=P[X¯S|Xi]
Here ¯S denotes all the node indices outsideS. (Specifying S rather than ¯S directly will usually be easier in practice, since XS is usually a small neighborhood of nodes around Xi.) In English: we want to throw away information from Xi, while retaining all information relevant to nodes outside the set S.
Two prototypical examples:
In a digital circuit, we pick the voltage in a particular wire at a particular time as Xi. Assuming the circuit is well designed, we will find that only the binary value bin(Xi) is relevant to voltages far away in the circuit or in time. So, with all “nearby” voltages as XS, we can replace Xi by bin(Xi).
In a fluid, we pick the (microscopic) positions and momenta of all the particles in a little cell of spacetime as Xi. Assuming uniform temperature, identical particles, and some source of external noise - even just a little bit - we expect that only the total number and momentum of particles in the cell will be relevant to the positions and momenta of particles far away in space and time. So, with all “nearby” cells/particles as XS, we can replace the microscopic positions and momenta of all particles in the cell with the total number and momentum of particles in the cell.
In both examples, we’re throwing out “local” information, while maintaining any information which is relevant “globally”. This will mean that local queries - e.g. the voltage in one wire given the voltage in a neighboring wire at the same time - are not supported; short-range correlations violate the abstraction. However, large-scale queries - e.g. the voltage in a wire now given the voltage in a wire a few seconds ago - are supported.
Modifying Children
We still have one conceptual question to address: when we replace Xi by f(Xi), how do we modify children nodes of Xi to use f(Xi) instead?
The first and most important answer is: it doesn’t matter, so long as whatever they do is consistent with f(Xi). For instance, suppose Xi ranges over {-1, 0, 1}, and f(Xi)=X2i. When f(Xi)=1, the children can act as though Xi were -1 or 1 - it doesn’t matter which, so long as they don’t act like Xi=0. As long as the childrens’ behavior is consistent with the information in f(Xi), we will be able to support long-range queries.
There is one big catch, however: the children do need to all behave as if Xi had the same value, whatever value they choose. The joint distribution P[Xch(i)|Xsp(i),f(Xi)] (where ch(i) = children of i and sp(i) = spouses of i) must be equal to P[Xch(i)|Xsp(i),X∗i] for some value X∗i consistent with f(Xi). The simplest way to achieve this is to pick a particular “representative” value X∗i(f∗) for each possible value f∗ of f(Xi), so that f(X∗i(f∗))=f∗.
Example: in the digital circuit case, we would pick one representative “high” voltage (for instance the supply voltageVDD) and one representative “low” voltage (for instance the ground voltage VSS). X∗i(f(Xi)) would then map any high voltages to VDD and any low voltages to VSS.
Once we have our representative value function X∗i(f(Xi)), we just have the children use X∗i(f(Xi)) in place of Xi.
If we want, we could even simplify one step further: we could just choose f to spit out representative values directly. That convention is cleaner for proofs and algorithms, but a bit more confusing for human usage and examples.
When constructing a high-level abstract causal DAG from a low-level DAG, one operation which comes up quite often is throwing away information from a node. This post is about how to do that.
First, how do we throw away information from random variables in general? Sparknotes:
For more explanation of this, see Probability as Minimal Map.
For our purposes, starting from a low-level causal DAG, we want to:
Here ¯S denotes all the node indices outside S. (Specifying S rather than ¯S directly will usually be easier in practice, since XS is usually a small neighborhood of nodes around Xi.) In English: we want to throw away information from Xi, while retaining all information relevant to nodes outside the set S.
Two prototypical examples:
In both examples, we’re throwing out “local” information, while maintaining any information which is relevant “globally”. This will mean that local queries - e.g. the voltage in one wire given the voltage in a neighboring wire at the same time - are not supported; short-range correlations violate the abstraction. However, large-scale queries - e.g. the voltage in a wire now given the voltage in a wire a few seconds ago - are supported.
Modifying Children
We still have one conceptual question to address: when we replace Xi by f(Xi), how do we modify children nodes of Xi to use f(Xi) instead?
The first and most important answer is: it doesn’t matter, so long as whatever they do is consistent with f(Xi). For instance, suppose Xi ranges over {-1, 0, 1}, and f(Xi)=X2i. When f(Xi)=1, the children can act as though Xi were -1 or 1 - it doesn’t matter which, so long as they don’t act like Xi=0. As long as the childrens’ behavior is consistent with the information in f(Xi), we will be able to support long-range queries.
There is one big catch, however: the children do need to all behave as if Xi had the same value, whatever value they choose. The joint distribution P[Xch(i)|Xsp(i),f(Xi)] (where ch(i) = children of i and sp(i) = spouses of i) must be equal to P[Xch(i)|Xsp(i),X∗i] for some value X∗i consistent with f(Xi). The simplest way to achieve this is to pick a particular “representative” value X∗i(f∗) for each possible value f∗ of f(Xi), so that f(X∗i(f∗))=f∗.
Example: in the digital circuit case, we would pick one representative “high” voltage (for instance the supply voltage VDD) and one representative “low” voltage (for instance the ground voltage VSS). X∗i(f(Xi)) would then map any high voltages to VDD and any low voltages to VSS.
Once we have our representative value function X∗i(f(Xi)), we just have the children use X∗i(f(Xi)) in place of Xi.
If we want, we could even simplify one step further: we could just choose f to spit out representative values directly. That convention is cleaner for proofs and algorithms, but a bit more confusing for human usage and examples.