# 16

This is an edited transcript of Scott Garrabrant's May 23 LessWrong talk on finite factored sets. Video:

Since there are existing introductions to finite factored sets that are shorter (link) or more formal (link), I decided to do things a bit differently with this transcript:

• I've incorporated Scott's slides into the transcript, and cut out some parts of the transcript that are just reading a slide verbatim.
• There was a text chat and Google Doc attendees used to discuss the talk while it was happening; I've inserted excerpts from this conversation into the transcript, in grey boxes.
• Scott also paused his talk at various times to answer questions; I've included these Q&A sections in white boxes.

# Finite Factored Sets

## Some Context

Scott: I'm going to start with some context. For people who are not already familiar with my work: My main motivation is to reduce existential risk. This leads me to want to figure out how to align advanced AI. This leads me to want to become less confused about intelligence and about agents embedded in their environment. I think there are a lot of hard open problems in trying to figure out how to do this.

This leads me to do a bunch of weird math and philosophy, and this talk is going to be an example of some of that weird math and philosophy.

## Factoring the Talk

Scott: This talk can be split into  parts. For example, this slide is in part 1 (the combinatorics talk) and .

I recognize that this table of contents has a lot more table and a lot less content than tables of contents normally have. But that's okay, because we'll have another table of contents in 6 more slides.

## Set Partitions

Scott: All right. Let's do some background math.

A partition of a set  is a way to view  as a disjoint union. You have a bunch of sets that are disjoint from each other, and you can union the sets together to get .

We'll also be using the lattice structure on partitions:

• We'll say that  is finer than  (and  is coarser than ) if for all elements , if  and  are in the same part in , they have to be in the same part in . So if  is finer than , then you can think of  as making some distinctions; and  has to make all of those distinctions, and possibly some more distinctions.
• The common refinement of two partitions  and  is the coarsest partition that is finer than both  and . So it has to make all of the distinctions that either  or  make, and no more.

This picture on the right is all of the partitions of a four-element set, and the graph structure is showing the "finer" relation, with finer things closer to the bottom.

## Set Factorizations

Scott: Now for some new stuff.

A factorization is a set  of non-trivial partitions ("factors") of some set , such that for each way of choosing one part from each factor in , there exists a unique element of  in the intersection of those parts.

A factorization is basically a multiplicative version of a partition. A factorization of  is a way to view  as a product, in the same way that a partition of  is a way to view  as a disjoint union.

Scott: If we have a factorization  of our set , there will be a bijection between  and the product . This bijection comes from sending an element  to the tuple consisting only of parts containing that element, .

Because of this, the size of  is always going to be equal to the product of the sizes of 's factors.

This bijection is basically restating the definition above. It's saying that for each way of choosing one part from each factor, we're going to have a unique element in the intersection of those parts.

Another thing we can observe is that any factor in a factorization must always be a partition into parts of equal size, because otherwise you're not going to be able to make this bijection work out.

We will often be working with a finite factored set, which is a pair , where  is a finite set and  is a factorization of .

This definition is a little bit loopy. There's this thing where I'm first introducing , and I'm saying that  is a collection of partitions of . But there's another way you can view it in which you first define , which is just a collection of sets. Then you take the product of all of the sets in , and this gives you a set , which is tuples of the product of the elements in . Then you can identify, with each element of a , the subset of the space of tuples that you get by projecting onto that part in the tuple.

So there's this temptation to define  first and then define  from , and there's also a temptation to do the opposite. This bijection is saying that we can take either lens.

Scott: To me, this is an extremely natural notion. I want to really push this naturalness, so I want to describe how factorization is dual to the definition of a partition.

You can see here I have two alternative definitions of partition and factorization.

For partition: We have a bunch of (non-empty) subsets of ; we have functions to , which are the inclusions; we can add all these functions together to give a function from their disjoint union to ; and we want this function to be a bijection.

Similarly, a factorization is a set of (non-trivial, meaning not having exactly one element) partitions of  such that the obvious function to the product of the elements of  from  is a bijection. So for each partition in , there's this projection onto that partition; and we can multiply all of these together to get a function from  to the product, and we want this function to be a bijection.

You can see that in these two definitions, I’m pointwise dualizing a bunch of the words to show that there's a duality between partitions and factorizations. You'll notice that there’s also a duality between subsets and partitions. You can view a partition as "that which is dual to a subset," and you can also view it as "that which is built up out of subsets in a certain way." When I say that factorizations are dual to partitions, I'm using the "that which is built up from subsets in a certain way" conception of partitions.

I'm first going to give some examples, and then I'll open up the floor for questions about factorizations.

## Enumerating Factorizations

Scott: Exercise: What are the factorizations of a four-element set, ?

.

.

.

.

.

.

.

.

.

.

.

.

The first thing to observe is that given any set with at least one element, we're always going to be able to define at least one factorization, a sort of trivial factorization.

This is a kind of one-dimensional factorization. We have one factor, which is the discrete partition.

Recall that in the definition of factorization, we want it to be the case that for each way of choosing one part from each factor, we want a unique element of  in the intersection of those parts. The trivial factorization satisfies that.

What if we want a nontrivial factorization? Well, recall that the product of the sizes of the factors always has to equal the size of . The only way to express  as a nontrivial product is to express it as a  product. So we're looking for two partitions/factors that are going to combine to give us .

All of our factors also have to break  into parts of equal size. So we're looking for  partitions that each break our -element set into  parts of size . There are exactly three partitions that break a 4-element set into  parts of size , which separates the small numbers from the large numbers; , which separates the even numbers from the odd numbers; and , which separates the inner numbers from the outer numbers.

We're going to be able to choose any pair of these. So we're going to have three more factorizations:

The point here is that for each way of choosing, you're going to get a unique element that is the conjunction of those two choices. "Small or large?" and "even or odd?" jointly pick out a specific element of .

Scott: In general, given an -element set, you can ask how many factorizations there are of that set. We have the sequence here.

You can notice that if  is prime, there’s only one factorization, which is the trivial one.

A very surprising fact to me was that this sequence did not appear on OEIS. OEIS is a database that combinatorialists use to share their sequences of integers, and check whether their sequences have been studied by other people and look for connections. This sequence didn't show up, even when I did minor modifications to deal with the degenerate cases.

I just don't get this. To me this basically feels like the multiplicative version of the Bell numbers. The Bell numbers count how many partitions there are in a set of size . It's sequence #110 out of over 300,000 on the OEIS. And yet this sequence doesn't show up at all. I don't really have an understanding of why that happened.

Okay. Now I'm going to pause for questions.

# The Main Talk (It's About Time)

Scott: We can't talk about time without talking about Pearlian causality. This paradigm allows you to, given a collection of variables and a joint probability distribution over those variables, infer causal/temporal relationships between the variables.

(In this talk, I'm going to be using "causal" and "temporal" interchangeably, though I think that there might be more to say there philosophically.)

Pearl can infer temporal data from statistical data. I'm not going to explain exactly how it works, because I'm not going to be technically dependent on it. This is really great. There's this adage, "Correlation does not imply causation,” and I think that it's just wrong and Pearl is correctly inferring causation from correlation all over the place using the combinatorial structure of the correlation.

The way that he does this is by inferring these graphs, like the one on the left. I think this graph is not entirely inferable from a probability distribution on the variables as it is, but some graphs are. You can definitely infer from a probability distribution that came from this graph, using the Pearlian paradigm, whether or not "the ground is wet" is before "the ground is slippery," for example.

Scott: The Pearlian paradigm is great. It's my main starting point. However, I claim that it's cheating: the "given a collection of variables" is actually hiding a lot of the work.

Normally people emphasize the joint distribution in saying that Pearl is inferring temporal data from statistical data, but I think that what's actually going on is that Pearl is inferring temporal data from statistical data together with factorization data.

For example, let's say all the variables in this picture are binary. The choice that we're working with these 5 variables, as opposed to just working with 32 distinct worlds without labels, is actually doing a lot of the work.

I think this issue is also entangled with a failure to adequately handle abstraction and determinism.

Let's say you're given these 5 variables and a joint probability distribution on the variables. One thing you could do is say, "I'm going to forget the variables that were given to me, and I'm just going to consider all Bell number of 32 different ways that the world can be."

And you could try to do Pearlian inference on this set of variables. The set is huge, but I'm working from an ontology where I don't care about things being huge. But you're going to run into a problem, which is that many of these variables that you define are going to be deterministic functions of each other, or partially deterministic functions of each other. Some of your variables/partitions will be coarsenings of other variables/partitions.

It turns out that the Pearlian paradigm, or the part that I'm trying to copy here, doesn't work well with all of this determinism. So you can't just throw away the variables you were given.

That's why I think that this cheating I'm accusing the Pearlian paradigm of doing is entangled with not being able to handle abstraction and determinism. By "abstraction" I mean I want some of my variables to be able to be coarser or finer copies of each other, which is similar to determinism.

## We Can Do Better

Scott: So we're going to do something new.

• Where Pearl was given a collection of variables, we're going to take all partitions of a given set.
• Where Pearl inferred a directed acyclic graph, we're going to infer a finite factored set.
• In the Pearlian world, we could read off of our directed acyclic graph something that we'd want to call time. When we have a path from one node to another in a DAG, that means that the first node is before the second node.

In the factored sets paradigm, we're going to have to define our own way of looking at a finite factored set and telling when some partitions are before or after other partitions.
• Pearl draws special attention to when two nodes in your DAG have no common ancestor. This corresponds to orthogonality, or independence. (In this talk, I'm going to use the word "independence" when I'm talking about probability, and I'm going to use the word "orthogonality" when I'm talking about combinatorial properties.)

We're going to be able to read something analogous off of our finite factored sets.
• Pearl has a concept, d-separation, which is kind of complicated. And the reason d-separation is interesting is that it's equivalent to conditional independence in all probability distributions you could put on your DAG.

We're going to define an analogous concept, conditional orthogonality, which we're going to be able to read off of a finite factored set.
• In the Pearlian world, we have that d-separation gives a compositional graphoid. In our world, we're going to have a compositional semigraphoid satisfied by conditional orthogonality. (There's a fifth graphoid axiom that we're not going to satisfy, but I argue we didn't want to satisfy it in the first place.)
• As with d-separation in Pearl, we're going to have a result saying that conditional orthogonality exactly corresponds to conditional independence in all probability distributions that you could put on your structure.
• We're going to be able to do temporal inference.
• And we're going to discuss some applications.

This is a green slide, so it's a table of contents slide; here are some slide numbers.

## Time and Orthogonality

Scott: Before we define time and orthogonality, I'm going to define something called history.

We're starting with a fixed finite factored set, and that’s our context.  is going to be a partition of , and the history of  is the smallest set of factors  such that, if I take an element of  and hide it from you, and you want to know what part in  it's in, it suffices for me to tell you what part it's in within each of the factors in .

So if I want to determine the -value of an element of , it suffices for me to know the value of all of the factors in . So what's written here is saying that if two elements agree on all of the factors in , then they have to agree on .

(Whenever I have a partition, I can kind of think of it as a variable. So when I say "the -value of an element," I mean "what part it's in within .")

This definition is the central starting point. We're going to define everything from this, so it's worth pausing on.

You can kind of think of factors as basic properties from which everything can be computed. And I want to know which factors are going into the computation of , which factors determine the value of .

Scott: Then we're going to define time. We'll say that  is weakly before  if the history of  is a subset of the history of . And we'll say that  is strictly before  if the history of  is a strict subset of the history of .

One way you might want to think about this is that the history of  is kind of like a past light cone. One spacetime point being before another spacetime point means that the past light cone of the first point is a subset of the past light cone of the second point. That gives some intuition as to why one might want to think of a temporal relationship as a subset relationship.

We're also going to define orthogonality. We’ll say that partitions  and  are orthogonal if their histories are disjoint.

I think I actually want to pause for questions here, so let's do that.

## Game of Life

Scott: Let's let  be the set of all Game of Life computations starting from a board of some size.

I don't want to have to deal with weird stuff about the boundary, and I also want it to be a finite factored set. The way that I'm dealing with this is I'm looking at this pyramid of only the cells that are computable from the initial board state.

So you can imagine starting with your  board, and we're just pushing the computation forward in time. (In the picture, going up is forward in time.) And as we progress through the time steps in Game of Life, our board is getting smaller, because the boundary is shrinking in to say, "I don't know what this is anymore, because it's dependent on stuff outside the boundary." So we have this set of all pyramids of Game of Life computations.

Because Game of Life is deterministic, we actually have a bijection between the set of all Game of Life computations and the set of all initial board states. And we have an obvious factorization of the set of all initial board states, which is that we have one factor  for each cell that is the partition that separates out "Is that cell alive or dead at time ?".

We can also talk about other partitions. There are a whole bunch of partitions on this huge structure.

I'm going to draw attention to partitions that take a cell and take a time step, and ask the question "Is this cell alive or dead at this given time step?", which is going to separate out your space into two sets.

In this picture on the right, I have three (cell, time) pairs—the red cell, the blue cell, and the green cell—which correspond to partitions. And the grid on the bottom is the initial state, time .

The multi-cell squares in the initial board state are the history of the red cell, the blue cell, and the green cell. So the light red squares on the bottom are saying, "Which cells at time  do you need to know the value of in order to be able to compute the value of the red cell at time ?".

This, again, pushes the picture that histories are like past light cones. We can see that the history of the red partition is a subset of the history of the blue partition. And so the partition corresponding to “Is that red cell alive or dead?” is before the partition corresponding to “Is that blue cell alive or dead?”.

The red partition is orthogonal to the green partition. And the blue partition is not orthogonal to the green partition, because their histories intersect.

The blue cell and the green cell are almost orthogonal, in that if I conditioned on the values of these two cyan squares at the bottom, then they would become orthogonal. So that's what we're going to talk about next—conditional orthogonality, or "When are things orthogonal conditional on certain assumptions?".

## Conditional Orthogonality

Scott: To define conditional orthogonality, we're first going to define conditional history. The conditional history of  given  is the smallest set of factors (again, smallest in the subset ordering) satisfying these two conditions.

The first condition is basically the same as in history, except we make this assumption that we're in . So the first condition is saying that if I want to know the value of , and I'm working under the assumption that my point is in , it suffices for me to know the value of all of the factors in .

The second condition is going to be a little hard to motivate. It's not going to mention ; it's just going to be about  and . The second condition says that if I take an element of  and I want to determine whether or not it's in , this process of determining whether or not it's in  can be parallelized:

• I can look at the factors in  and ask, "Is the assignment of values to the factors in  consistent with my point being in ?"
• And then I can separately look at the factors in  and ask, "Is the value of the factors in  consistent with my point being in ?"

And then if both of these questions return "true," then my point has to be in .

I'll say that without the second condition, conditional history would not even be well-defined. I do not think it should be obvious that it's well-defined anyway! In order for this history to be well-defined, I need to be able to say that there's a smallest set of factors (in the subset ordering) satisfying these conditions.

One way to see why we need the second condition is: You could imagine that we have a finite factored set that consists of two bits. There's my bit and there's your bit, and they're each either  or . And my factorization is the obvious factorization, with one factor for “What is my bit?” and one factor for “What is your bit?”. So then the history of “What is my bit?”, the history of the partition that separates out the different values that my bit can be, is just going to be the one factor that is “What is my bit?”.

But then imagine conditioning on the assumption that our bits are the same. In that case, knowing my bit is sufficient to know my bit, and knowing your bit is sufficient to know my bit, but the intersection of knowing my bit and knowing your bit is empty.

This is why we need the second condition. The second condition is saying that the act of conditioning on the fact that the two bits were the same, entangled my bit with your bit; so you're not allowed to then take one and not the other. This makes conditional history well-defined.

From conditional history we're going to be able to define conditional orthogonality. We say that  and  are orthogonal given the subset  if their conditional histories are disjoint. And we'll say that  and  are orthogonal given a partition, if  and  are orthogonal given each individual part in .

I think it's pretty hard to motivate this definition. If I go over to categorical language, I think I have a picture that makes it look a little bit nicer than this, or a little bit more motivated. But I'm mostly just going to appeal to this definition's consequences to try to convince you that it's a good definition, as opposed to trying to push that it's really natural.

## Compositional Semigraphoid Axioms

Scott: Here's some consequences. This is just a bunch of properties showing that conditional orthogonality is pretty well-behaved.

There are five graphoid axioms, of which the first four are the semigraphoid axioms. And then there's the axiom of composition, which is the fifth item on the above list and is the reason we have a "compositional" semigraphoid.

The field that works with Pearlian inference sometimes talks about these axioms. When they talk about it, they talk about it with sets of variables; we're instead working with partitions. So where they took a union of a set of variables, we're instead going to work with a common refinement. But that's all an easy translation.

All of these properties have a  in all of the conditions. You can just ignore the  and think of it without the  if you want to.

This first property says that  is orthogonal to  if and only if  is orthogonal to , which hopefully should make sense.

The second and the fifth are converses that together say that  is orthogonal to the common refinement of  and , if and only if  is separately orthogonal to  and to . So they're collectively saying that being orthogonal to a common refinement is equivalent to being orthogonal to each of the pieces.

That's especially nice, because it's saying that if you want to know whether something is orthogonal, it suffices to break it up however you want and check whether it's orthogonal to each individual piece.

d-separation in DAGs also satisfies all these properties.

## The Fundamental Theorem

Scott: Those were just some nice properties. The main thing I want to push is that conditional orthogonality is really a combinatorial fragment of conditional independence.

We're going to show that  is orthogonal to  given  if and only if  is independent of  given  in all probability distributions that you can put on that factored set. (If you divide both sides by the probability of  in the equation, it's more obvious that this is asserting conditional independence.)

And so I need to explain what it means to put a probability distribution on a factored set.

A probability distribution on a finite factored set  is just a probability distribution on  that factors. What I mean by “factors” is that you can view it as coming from a probability distribution on each of the individual factors in your factorization. I take a probability distribution on  for each , and then I multiply together all those probability distributions and take the product of those measures to get a probability distribution on .

We'll call a probability distribution on  a probability distribution on  if it can be constructed in this way; or equivalently, if the probability of any given element is equal to the product of the probabilities of each of the parts in a factor that contain that element.

And then we get the fundamental theorem of finite factored sets. We can also probably extend this theorem to talk about all probability distributions in general position, or something like that, rather than talking about all probability distributions; but I'm not going to worry about that right now.

This theorem is going to allow us to infer orthogonality data from probabilistic data. If we're given a bunch of probabilistic data—for example, a big probability distribution on our set—we can use the fundamental theorem to tell us when things should be orthogonal.

Next, we're going to show how to infer temporal data from orthogonality data. And we're going to do that working entirely combinatorially; so this slide is actually the only place I'm going to mention probability. Part of the thing that I like about this is that I kind of push all the probability away and think about these questions just using combinatorial properties.

I think I'm going to take another question break.

## Temporal Inference

Scott: We're going to start with a set , which is going to be our sample space—although I don't want to carry around any probability intuitions. I want to think about everything combinatorially, and I want the only place probability shows up to be the previous slide.  is a finite set, and it’s something like the space of all possible observations we can make.

One thing you might think that we'd want to do would be to infer a factorization of , but that's not what we're going to do. Instead, we're going to find a factored set model of . We're doing this because I'm thinking of  as the observably distinct situations that we can be in, and I want to allow for inferring latent structure.

There was a question about how to make a factored set out of a three-node Pearlian DAG. There were eight possibilities, but I said, "We don't want a factorization on an 8-element set; we want a factorization on a 32-element set, and then we want to send that over to an 8-element set with a function." In that example, the function  was latent structure.

In the model :

• Not assuming  is injective is what allows for latent structure.
• Not assuming  is surjective means that it's okay if some of the elements of  don't actually get hit. This is useful if we just want to start with some variables and take their product and have that be my sample space. I don't want to assume that every combination of variables is necessarily possible; and that's what not assuming surjectivity is going to do.

Given a partition of , we can send that partition backwards across the function  to give a partition of . Which is just saying that if you have all of your parts, and then you take the preimage as functions of the parts, this will give you a partition of your original space.

We also have an orthogonality database on , which is a collection of rules about orthogonality in . One place you could think of this database as coming from is a probability distribution. You sample a whole bunch of times, and you get a probability distribution, and whenever you observe that two things are independent you'll put in a rule saying they're orthogonal. And then whenever you observe that things are not independent, you put in a rule that says those things should not be orthogonal. Which makes sense because of the fundamental theorem.

But you could also imagine that orthogonality database coming from anywhere else. This is entirely a combinatorial notion.

A "rule" means putting a triple of partitions either in the set  (for "orthogonal") or the set  (for "not orthogonal").

So we have this collection of rules, and we want to infer a factored set model. We're not actually going to be able to infer a full factored set model, because whenever you have one model, you can always just add in more factors and add in more stuff, and then just delete all that data. Instead, we're going to be able to infer properties of factored set models—properties that all factored set models that satisfy our database are going to have to satisfy.

One of these properties is: we'll say that  is before  according to the database  if  of  is before  of  for all models that satisfy our database.

This is what we're going to mean by temporal inference. Now we have a mathematical specification of (though not a computational procedure for), "Given a bunch of orthogonality rules and non-orthogonality rules, when are those rules sufficient to infer a temporal relationship?"

I've specified this combinatorial notion of temporal inference. But I haven't shown this notion isn't vacuous. So there are these questions: “Are we ever going to be able to infer time with this?" "How is our approach to temporal inference going to be able to compare to Pearlian temporal inference?”

Pearlian temporal inference is actually very strong. If you have a big system with a bunch of variables, you're very likely going to be able to infer a lot of temporal relations from it. Can we keep up with that?

To answer that question, we're going to look at an example.

## Two Binary Variables (Pearl)

Scott: If  and  are independent, then we're not going to have a path from  to  or vice versa. If  and  are not independent, then maybe there's a path from  to , maybe there's a path from  to , maybe there's some third node that has a path to both of them.

In either case, we're not going to be able to infer any positive temporal relationships.

To me, this feels like it's where "correlation does not imply causation" comes from. If you just have two variables, you can't really infer causation; Pearl needs a lot more variables in order to have a richer combinatorial structure that will allow his approach to infer time.

But the natural next question is: "Is  independent of ?"

In the Pearlian ontology, we have  and , and those are our variables, and we ask whether they're independent.

In the factored sets ontology, we take our  and our , and the first thing we do is we take a product of  and , and now we have four different possibilities. And then we forget all our labels, and we consider all possible variables on these four possibilities. And one of those variables is —are  and  the same?

And if  is independent of , then the finite factored set paradigm is actually going to be able to conclude that  is before .

So not only is the paradigm non-vacuous, and not only can it keep up with Pearl and infer things that Pearl can't, but also we can use it to infer time even from two binary variables, which is kind of the simplest possible thing that you could hope to be able to do.

Pearl is also going to be able to infer some things that we can't, but as a reminder, Pearl is using extra assumptions. Relative to our ontology, there's extra factorization data that Pearl is using; and I'm not exactly sure where one would get that extra data.

## Two Binary Variables (Factored Sets)

Scott: Let  be the set of all binary strings of length two. We're going to define an orthogonality database which is only going to have two rules.

• The first rule is that  ("What is the first bit?") is orthogonal to  ("Do the bits match?"). The  here is just saying we're not making any conditions. (You don't have to worry about the complicated conditional orthogonality definitions for this proof; everything is just going to be normal orthogonality, which will make it easier to think through.)
• Second,  is not orthogonal to itself. This just means that  is nondeterministic; both of the parts in  are supported by our function , both parts are possible.

If we got this orthogonality database from a probability distribution in general position, we would have a lot more rules in our database. We would have a rule for basically every triple of partitions. But inferring time is going to be monotonic with respect to adding partitions: adding more rules just lets you infer more stuff. So we can just work with the smallest database that we're going to need in order to be able to show that  ("What is the first bit?") is before  ("What is the second bit?").

You should probably just assume that everything is not orthogonal to itself; but " is not orthogonal to itself" is the only statement of that form we're going to need.

So, theorem:  is before . Which, recall, means that  is before  for all factored set models.

Proof. First, we’re going to prove that  is weakly before :

• Let  be a model that satisfies our database, and write  for the history of , etc. Since  is orthogonal to , we're going to have that  and  are disjoint.

If the  makes it a little bit harder to see what these things are saying, you can kind of ignore that part. If you just think of , and  as partitions of , where , it might be a little bit easier to understand what's going on here.

• Because  is in  is orthogonal to , which means the history of  is disjoint from the history of .
• Because  is not orthogonal to itself, the history of  is not disjoint with itself, i.e., it's not empty.
•  is a coarsening of the common refinement of  and . And since we can compute the value of  from the values of  and , we can also compute the value of  from the values of  and . Which means that the history of , has to be a subset of .

To see why this is, recall that a history is the smallest set of factors I need to be able to compute a partition's value. If I have a set of factors that is sufficient to compute the value of  and sufficient to compute the value of , that's going to be sufficient to compute the value of , because  is a deterministic function of .

• We also get this subset relation for any other combination of , and . So , and .
• Since , and  is disjoint from , we get that —which is to say,  is (weakly) before  in all models.

Let's extend this further, and show that  is strictly before .

• Assume for the purpose of contradiction that . Since , we'll then just have that  is a subset of , because  will be .
• But  is disjoint from , which means that  is just going to have to be empty. But we assumed that  is nonempty; so this is a contradiction.

Thus  is not equal to . So  is a strict subset of , and  is before

Proofs in this space largely look like this. We look at a bunch of histories of our variables, we look at various different Boolean combinations of our histories, and we keep track of when they are empty and non-empty; and we use this to eventually infer that some of our histories are subsets of others.

I haven't spent much time on temporal inference, and I haven't really looked into the computational problem of inferring time from orthogonality databases yet. I basically just have two proofs of concept: this one, and another proof of concept that uses something that looks a little bit more like screening off, and uses conditional orthogonality to infer time in a more complicated situation.

## Applications / Future Work / Speculation

Scott: I'm going to break some directions I want to go in into three categories. The first category is inference, which has a lot of computational questions. The second category is infinity, which has more mathematical questions. And the third category is embedded agency, which has more philosophical questions.

### Inference

Decidability of temporal inference — I haven't been able to even prove whether the temporal inference method I just described is decidable, because there are infinitely many factored set models that you might have to consider.

I suspect it is decidable. Given the size of , there is an upper bound on how complicated of an  you have to consider; and I conjecture that this upper bound is small, and probably computable by some simple function.

There's then the further question of whether we could do efficient temporal inference.

Conceptual inference — In our example, we inferred that  is before , and we kind of inferred this structure that what's going on is that the universe is computing  and , and then it's computing  from  together with . So  and  look like primitive properties, and  looks like a derived property that you might compute from  and . And I think that this "being a primitive property" is actually also pointing at something about what kind of concepts you want to use.

Imagine that I have a number that's either 0 or 1, and it's also either blue or green. I can define a property called "bleen," which is that the number is a blue 0  a green 1. Now I have these three different properties I can talk about: number, color, and bleen-ness.

And we're doing something that looks like inferring that color is primitive, while bleen-ness is derived.

This system has the potential to be able to distinguish between blue and bleen. I think finite factored sets have the potential to do to "there's no principled difference between blue and bleen" what Pearl did to the adage "correlation does not imply causation". So that's a thing I'm really excited about maybe being able to do.

Temporal inference from raw data and fewer ontological assumptions — We can do a more end-to-end form of temporal inference, one that doesn't have to see where our variables are coming from.

We can do temporal inference with deterministic relationships

Time without orthogonality — You could actually imagine starting with a huge factored set with a large number of dimensions, and then taking some  and forgetting a lot of the structure and just taking some weird subset of our big factored set.

If you do this, it might be the case that there is no orthogonality besides the vacuous orthogonalities. There might be no orthogonality in , but you can still make inferences from the probability distribution. The probability distribution will still lie on some algebraic curve that's pointing at the structure. And so even when there's no orthogonality at all, you might still be able to infer temporal relationships just from the probabilities.

Conditioned factored sets — Instead of assuming there's a factored set model of , you could imagine a partial factored set model where my  can be a partial function.

This kind of corresponds to pretending that you had a factored set and you got some samples from , but those samples got passed through a filter. And so you can try to do inference on the more general class where maybe your data also went through a filter before you got it.

### Infinity

Extending definitions to the infinite case — I talked a little bit about this earlier. I think that the fundamental theorem should be able to be extended to finite-dimensional factored sets, where by "finite-dimensional" I mean that  is finite, but  is infinite. But my proof assume finiteness, not just finite dimension.

Continuous time — Here's something I'm pretty excited about. Basically, I think that Pearl is not really ready to be continuous. There’s this blog post by Eliezer called Causal Universes that pointed out that physics kind of looks like a continuous Bayes net. But we didn't have a continuous Bayes net, and we still don't have a continuous Bayes net; and I feel like we're a lot closer to having a continuous Bayes net now.

Everything's still finite, and so we're running into problems there. But I think that once we’ve figured out this infinity stuff, we're going to be able to make things continuous, and I think that for two reasons.

1. The Pearlian world is really dependent on parents. We really cared about if this node was a parent of this node. In the finite factored sets paradigm, we only care about the analog of ancestor. So we're not really talking about an edge from this node to this node; we're only talking about “Are there paths?”, which is a lot closer to being able to be continuous.
2. We have this ability to zoom in and out on our variables. Instead of having fixed variables, we're taking them to be partitions, which allows for this zooming in and out.

I think that because of these two properties, we're a lot closer to being able to have not just continuous probabilities, but continuous time—continuity in the part that in Pearl was the DAG.

New lens on physics — Finite factored sets might be able to shine some new light on physics. I don't know; I'm not a physicist.

### Embedded agency

I'm mostly viewing finite factored sets as the successor to Cartesian frames. In fact, if you look at the definition of a factored set model, it's almost equivalent; a binary factored set model is basically a Cartesian frame.

So I want to be able to model things like embedded observations. I want to be able to take a partition that corresponds to an action of an agent and take another partition that corresponds to an observation, and be able to say, "Ah, the thing that is deciding, this agent, kind of gets to do whatever functions it wants on this other thing." And I actually have some preliminary thoughts on how to define observations in a factored set world.

A point I want to make here is that in these embedded agency applications, I'm largely doing something different than I was doing in the temporal inference set. I focused in this talk on being able to do temporal inference because I think it's really concrete to be able to say, "Hey, we can do something like Pearlian temporal inference, but better and with fewer assumptions."

When I'm thinking about embedded agency, I'm not really thinking about doing inference. I'm just thinking about being able to set up these factored set models where I would otherwise set up a DAG.

Or it doesn't even need to be a DAG. Anywhere we would want to set up a graph that has edges between nodes that correspond to causality, or information flow, or something like that, there's some hope of being able to translate that over to the factored set world and thus be able to have it work better with abstraction.

So I'm not really thinking about where these things are coming from. I'm more thinking about "How we can use this as a new way of modeling things that we would otherwise model with graphs?"

So:

Embedded observations — I have a definition of embedded observation. It looks a lot like the Cartesian frames version, except it's cleaner.

Counterfactability — I have a notion of counterfactability that I can get out of factored sets. It's basically taking the position that you can't always take counterfactuals, but some places you can. I'm using the fact that a factored set takes your world and views it as a bunch of tuples to give me "interventions." My "interventions" correspond to coming in and tweaking some of the values in the tuples. So the factored set is also giving you a theory of how to do interventions.

Cartesian Frame successor — I think this framework is largely capturing a lot of the Cartesian frames content.

Unraveling causal loops — This is largely what I talk about in Saving Time. It feels like often we have these causal loops, and it feels like a bottleneck to being able to not have causal loops in our picture is to be able to have something that has time and also abstraction in the picture. And that's kind of what we're doing.

Conditional time — This is observing that just like we defined conditional orthogonality, we could define conditional time. We just take conditional time to be that the conditional histories are subsets of each other.

I have some hope that this might actually be interesting and useful, especially for the unraveling causal loops thing. I think that conditional time allows you to have multiple different timelines on your structure simultaneously, and have them be concretely playing together. I'm not sure whether it's actually going to end up being useful, but I'm just noticing that there's a very tempting way to define conditional time.

Logical causality from logical induction — One of the first things that we did after finding logical induction was to ask, "Can we extend this to a theory of logical counterfactuals?" And we said, "How can we do this? Well, we could take our big probability distribution over logical sentences that we have at any given time. Can we do Pearl to that?”

And the answer is, "No, because there's a whole bunch of determinism everywhere."

I think that we still can't just go and do logical causality from something like logical induction. Basically, the finite factored set framework kind of doesn't believe in the uniform distribution on a four-element set. And I have some hopes of being able to fix this, but if you imagine a four-element set and the uniform distribution on it, and you consider the four two-part partitions on your four-element set (like the , and  in our "two binary variables" example). If you have the uniform distribution, we have that  is independent of , and  is independent of , but  is not independent of .

So the independence structure of the uniform distribution on a four-element set does not actually satisfy the compositional semigraphoid axioms. It specifically doesn't satisfy the composition axiom.

Similarly to how I'm excited about the infinite case even though the fundamental theorem will fail, I think that we're not going to be able to just directly use the fundamental theorem, but we could still ask, "What are some things consistent with this data?" We might have some extra independence that comes from something other than orthogonality. I could talk more about that, but I think I'll move on now.

Orthogonality as simplifying assumptions for decisions — We talked about orthogonality as coming from independence, but I think part of being an agent interacting with the world is that there are some parts of the world out there that I just believe are not going to be affected by my action. And I make this assumption, and I act based on this assumption. And I think this is another way that we can think of orthogonality as coming in.

And so I just wanted to point out that orthogonality doesn't need to come from probability. We can think of it as coming from somewhere else.

And finally, conditional orthogonality as an abstraction desideratum — This is largely the kind of stuff that John Wentworth talks about. Like, an abstraction of  should have the property that everything that I care about is orthogonal to , given the abstraction of . So I only throw away irrelevant data, and I can express "I only throw away irrelevant data" as "everything I care about is orthogonal to the underlying structure, given the abstraction."

All right. Those are my long rants about potential research directions for finite factored sets.

I'm going to say one more thing that's not on this slide: You can think of history as similar to entropy. You can think of my notion of history as a non-quantitative version of entropy.

And actually, when proving the semigraphoid axioms, I use some lemmas that look a lot like theorems about entropy, and then I replaced the sums and the expectations with unions.

And when you look at time—well, time is the subset relation on history, which is like saying that time is an inequality relation on entropy, and "later" corresponds to higher entropy. This is speculation, but this is a whole other ontology that you could put on what history means, etc.

# Q&A

[Select questions from the Q&A are included below. The full Q&A transcript is here.]

New Comment