The previous post in this sequence discussed how to throw away information in causal DAGs. This post provides a detailed example using a Markov chain.

Suppose we have an n-state Markov chain (CS people: picture a finite state machine with n states and random state transitions at each timestep). A matrix of state transition probabilities gives the probability of transitioning to state when the system starts the timestep in state . Writing the state at time as , we have . More generally, , where denotes a matrix power (i.e. matrix-multiplied by itself times). To complete the specification, we’ll assume that the system starts in a random state at time , with the initial distribution given.

As a causal DAG, this system is just a chain: the state at time depends only on the state at time :

People typically draw basic Markov chains the same way we draw finite state machines: a graph with one node for each state, and arcs indicating transitions. Unlike an FSM, where the next arc is chosen by a symbol from some input stream, here the next arc is chosen randomly - so each arc has a probability associated with it. An example:

This is NOT a causal diagram, it is a state transition diagram. It says that, if the system is in state 1, then at the next timestep it will randomly transition to state 1, 2, or 5. (I haven’t included the probabilities on each arc; all that matters for our purposes is that each arc shown has nonzero probability.) Since we have two graph representations of the system (the state transition diagram and the causal DAG), I will generally refer to vertices in the state transition diagram “states” and vertices in the causal diagram as “nodes”.

What happens if we __throw away long-term-irrelevant information__ from a node in this Markov chain?

Here’s the idea:

- Pick the node
- Pick the set of nodes from to for some large-ish (we’ll denote this set )
- Throw away all info from which is not relevant to nodes outside : replace with , a representation of the function .

Let’s think about what that last piece looks like. could be any of the states 1 through 6; must assign different values to any two states with different distributions . But for large , many of the states will have (approximately) the *same* long-run distribution - this is the foundational idea of __ergodicity__. In the example above, nodes 5 & 6 will have the same long-run distribution, and nodes 2, 3, 4 will have the same long-run distribution.

To see why, imagine what happens if we start in state 5, assuming that the 5 -> 6 transition is much more likely than the 5 -> 5 transition. Well, since 5 -> 6 is much more likely than 5 -> 5, we’ll probably jump to state 6 next. And state 6 always jumps back to 5, so in two steps we’ll be back to 5. And so forth - back and forth, alternating between state 5 and 6 every timestep. But every one in awhile, we’ll jump from 5 -> 5, throwing the back-and-forth oscillation out of sync. If we imagine two copies of this chain running side-by-side, they’d start out oscillating in sync, but eventually drift out of sync. If we walk away for a while and look back at the chain much later, we’d expect that it’s roughly equally likely to be in state 5 or 6, regardless of which it started in.

That’s the key: if the chain started in state 5 or 6, with 5 -> 6 much more likely than 5 -> 5, than after a while, it would be roughly equally likely to be in state 5 or state 6. for large m. Even if 5 -> 6 is not much more likely than 5 -> 5, the two long-run distributions will still be the same - the long-run probabilities of 5 and 6 just won’t be roughly equal (we’ll stay in state 5 somewhat more often than 6).

A more general criteria:

- View the state transition diagram as a directed graph, and ask which states are connected in both directions - i.e. a set of states in which we can reach any state from any other by following the arrows
- Some arrows “knock oscillations out of sync” - read up on reducibility and ergodicity in Markov chains for technical details (I first saw this stuff in an operations research class)

If both of these criteria are met for some set of states, then each of those states i implies the same long-run behavior .

Getting back to our abstraction: doesn’t need to distinguish between states 5 and 6, or between states 2, 3, 4. Our states are grouped like this:

… and is A, B, or C. Our causal diagram looks exactly like before, with in place of :

We need to choose representative -values for each of A, B, C, so we’ll pick . So, if , then is B with probability 1 (since is 5 or 6, both of which map to B). is then chosen as though were 6, since 6 is our representative value for B.

Our abstract model no longer supports short-range queries around . To see what goes wrong, consider , assuming once again that 5 -> 6 is much more likely than 5 -> 5. In the original model, this gave rise to oscillation between states 5 and 6, so if the system was in state 6 at time , then it would most likely be in state 6 again at time . But in the new model, throws away information distinguishing states 5 and 6 - both are just “B”. If = 6, then = B, and behaves as though were the representative value 6 - implying that is 5, rather than 6. No match :(.

Yet this does not impact the validity of long-range queries at all! Because both and imply the same *long-run* predictions, the model *does* support long-range queries, like .

Finally, we can imagine cleaning up the model a bit by abstracting the whole chain, rather than just one node. Using the same info-throw-away transformation on every node, the abstraction looks like this:

Intuitively, not only do we have a Markov chain on the high-level variables , we also have enough information in the high-level model to predict correlations between low-level , as long as the 's we query are at least m timesteps apart. That's the property which makes this abstraction "natural" - more on that later.