Writing Causal Models Like We Write Programs

johnswentworth

Clunc

We’ll start with a made-up programming language called Clunc. The distinguishing feature of clunc is that it combines classes and functions into a single type, called a clunc. It looks like this:

quad = Clunc {
    x = 4
    constant = 3
    linear = 2*x + constant
    result = x*x + linear
}

We could then go to a terminal:

>>> quad.result
27
>>> quad.linear
11

In order to use this clunc like a function, we apply the do() operator. For instance,

>>> quad3 = do(quad, x=2)

… creates a new clunc which is just like quad, except that x is 2 rather than 4:

>>> quad3
Clunc {
    x = 2
    constant = 3
    linear = 2*x + constant
    result = x*x + linear
}

When we query fields of quad3, they reflect the new x-value:

>>> quad3.result
11
>>> quad3.linear
7

There’s no designated “input” or “output”; we can use the do() operator to override any values we please. For instance

>>> quad_zero_linear = do(quad, linear=0)
>>> quad_zero_linear
Clunc {
    x = 4
    constant = 3
    linear = 0
    result = x*x + linear
}
>>> quad_zero_linear.result
16

A few quick notes:

Clunc is purely clunctional: everything is immutable, and each variable can only be written once within a clunc. No in-place updates.
Clunc is lazy.
Variables can be set randomly, e.g. “x = random.normal(0, 1)”.
The do() operator creates a new clunc instance with the changes applied. If there are any random variables, they are re-sampled within the new clunc. If we want multiple independent samples of a randomized clunc M, then we can call do(M) (without any changes applied) multiple times.

To make this whole thing Turing complete, we need one more piece: recursion. Cluncs can “call” other cluncs, including themselves:

factorial = Clunc {
    n = 4
    base_result = 1
    recurse_result = do(factorial, n=n-1).result
    result = (n == 0) ? base_result : n * recurse_result
}

… and that’s where things get interesting.

Causal Models

Hopefully the mapping from clunc to probabilistic causal models is obvious: any clunc with random variables in it is a typical Pearl-style causal DAG, and the $d o ()$ operator works exactly like it does for causal models. The “clunc” is really a model, given by structural equations. The one big change is the possibility of recursion: causal models “calling” other models or other instances of themselves.

To get some practice with this idea, let’s build a reasonably-involved analogue model of a ripple-carry adder circuit.

We’ll start at the level of a NAND gate (levels below that involve equilibrium models, which would require a bunch of tangential explanation). We’ll assume that we have some model $M_{N A N D}$ , and we use $d o (M_{N A N D}, a = . . ., b = . . .) . r e s u l t$ to get the (noisy) output voltage of the NAND gate in terms of the input voltages $a$ and $b$ . Since we’re building an analogue model, we’ll be using actual voltages (including noise), not just their binarized values.

We’ll take $M_{N A N D}$ as given (i.e. assume somebody else built that model). Building everything out of NAND gates directly is annoying, so we’ll make an XOR as well: $M_{x o r} = M o d e l {a = 0.0 b = 0.0 i n t e r m e d i a t e = d o (M_{N A N D}, a = a, b = b) . r e s u l t l e f t = d o (M_{N A N D}, a = a, b = i n t e r m e d i a t e) . r e s u l t r i g h t = d o (M_{N A N D}, a = i n t e r m e d i a t e, b = b) . r e s u l t r e s u l t = d o (M_{N A N D}, a = l e f t, b = r i g h t) . r e s u l t}$

This looks like a program which performs an XOR using NAND gates. But really, it’s a Pearl-style causal DAG model which uses a lot of NAND-submodels. We can write out the joint probability distribution $P [a = a^{*}, b = b^{*}, i n t e r m e d i a t e = i^{*}, l e f t = l^{*}, r i g h t = r^{*}, r e s u l t = r e s u l t^{*} | M_{x o r}]$ via the usual method, with each line in the model generating a term in the expansion: $P [a = a^{*} | M_{X O R}] = I [a^{*} = 0] P [b = b^{*} | M_{X O R}] = I [b^{*} = 0] P [i n t e r m e d i a t e = i^{*} | M_{X O R}, a = a^{*}, b = b^{*}] = P [r e s u l t = i^{*} | d o (M_{N A N D}, a = a^{*}, b = b^{*})] P [l e f t = l^{*} | M_{X O R}, a = a^{*}, i n t e r m e d i a t e = i^{*}] = P [r e s u l t = l^{*} | d o (M_{N A N D}, a = a^{*}, b = i^{*})] P [r i g h t = r^{*} | M_{X O R}, i n t e r m e d i a t e = i^{*}, b = b^{*}] = P [r e s u l t = r^{*} | d o (M_{N A N D}, a = i^{*}, b = b^{*})] P [r e s u l t = r e s u l t^{*} | M_{X O R}, l e f t = l^{*}, r i g h t = r^{*}] = P [r e s u l t = r e s u l t^{*} | d o (M_{N A N D}, a = l^{*}, b = r^{*})]$ The full distribution is the product of those terms.

That’s just the first step. Next, we need a full adder, a circuit block which computes the sum and carry bits for one “step” of binary long addition. It looks like this:

$Mfull_adder=Model{a=0b=0c=0sab=do(MXOR,a=a,b=b).results=do(MXOR,a=sab,b=c).resultcarryab=do(MNAND,a=a,b=b)carryc=do(MNAND,a=sab,b=c)carry=do(MNAND,a=carryab,b=carryc)}$

As before, we can write out the components of the joint distribution line-by-line. I’ll just do a few this time:

$P [a = a^{*} | M_{f u l l_a d d e r}] = I [a^{*} = 0] . . . P [s_{a} b = s_{a b}^{*} | M_{f u l l_a d d e r}, a = a^{*}, b = b^{*}] = P [r e s u l t = s_{a b}^{*} | M_{X O R}, a = a^{*}, b = b^{*}] P [s = s^{*} | M_{f u l l_a d d e r}, s_{a b} = s_{a b}^{*}, c = c^{*}] = P [r e s u l t = s^{*} | M_{X O R}, a = s_{a b}^{*}, b = c^{*}] . . .$

Notice that some of these involve probabilities on the model $M_{X O R}$ , which we could further expand using the joint distribution of $M_{X O R}$ variables from earlier.

Finally, we can hook up a bunch of full adders to make our 32-bit ripple-carry adder:

$Mrc=Model{a=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]b=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]f0=do(Mfull_adder,a=a[0],b=b[0])fi=do(Mfull_adder,a=a[i],b=b[i],c=fi−1.carry)∀i∈1,…,31result=[f0,…,f31]}$

The components of the joint distribution for this one are left as an exercise for the reader.

Why Is This Useful?

Classes/functions let us re-use code; we don’t have to repeat ourselves. Likewise, clunc-ish causal models let us re-use submodels; we don’t have to repeat ourselves.

Obviously this has many of the same advantages as in programming. We can modularize our models, and fiddle with the internals of one submodel independently of other submodels. We can “subclass” our models via the $d o ()$ -operator, to account for different contexts. Different people can work on different submodels independently - we could even imagine libraries of submodels. An electrical engineer could write a probabilistic causal model representing the low-level behavior of a chip; others could then import that model and use it as a reference when designing things which need to work with the chip, like packaging, accessories, etc.

From a more theoretical perspective, when we write programs with unbounded runtime, we have to have some way to re-use code: there’s only so many lines in the program, so the program must visit some of the lines multiple times in the course of execution. Some lines must be re-used. Likewise for probabilistic models: if we want to define large models - including unbounded models - with small/finite definitions, then we need some way to re-use submodels. We could do that by writing things like “ $\forall i :< s u b m o d e l_{i} >$ ”, but if we want Turing completeness anyway, we might as well go for recursion.

From a pure modelling perspective, the real world contains lots of repeating structures. If we’re modelling things like cars or trees, we can re-use a lot of the information about one car when modelling another car. We think of cars as variations on a template, and that’s exactly what the do() operator provides: we give it some “template” model, and apply modifications to it. The corresponding inverse problem then says: given a world full of things which are variations on some templates, find the templates and match them to the things - i.e. learn to recognize and model “cars” and “trees”. Clunc-ish causal models are a natural fit for this sort of problem; they naturally represent things like “corvette with a flat tire”.

Finally, the main reason I’ve been thinking about this is to handle abstraction. Clunc-ish models make layers of abstraction natural; lower-level behaviors can be encapsulated in submodels, just as we saw above with the ripple-carry adder. If we want to write abstraction-learning algorithms - algorithms which take in raw data and spit out multi-level models with layers of abstraction - then clunc-ish models are a natural form for their output. This is what multi-level world models look like.

[-]xuan5y70

Belatedly seeing this post, but I wanted to note that probabilistic programming languages (PPLs) are centered around this basic idea! Some useful links and introductions to PPLs as a whole:
- Probabilistic models of cognition (web book)
- WebPPL
- An introduction to models in Pyro
- Introduction to Modeling in Gen

And here's a really fascinating paper by some of my colleagues that tries to model causal interventions that go beyond Pearl's do-operator, by formalizing causal interventions as (probabilistic) program transformations:

Bayesian causal inference via probabilistic program synthesis
Sam Witty, Alexander Lew, David Jensen, Vikash Mansinghka
https://arxiv.org/abs/1910.14124
Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.

30

Writing Causal Models Like We Write Programs

30

Clunc

Causal Models

Why Is This Useful?