I don't understand the motivation behind the fundamental theorem, and I'm wondering if you could say a bit more about it. In particular, it suggests that if I want to propose a family of probability distributions that "represent" observations somehow ("represents" maybe means in the sense of Bayesian predictions or in the sense of frequentist limits), I want to also consider this family to arise from some underlying mutually independent family along with some function. I'm not sure why I would want to propose an underlying family and a function at all, and even if I do I'm not sure why I want to suppose it is mutually independent.

One thought I had is that maybe this underlying family of distributions on S is supposed to represent "interventions". The reasoning would be something like: there is some set of things that fully control my observations that I can control independently and which also vary independently on their own. I don't find this convincing, though - I don't see why independent controllability should imply independent variation.

Another thought I had was that maybe it arises from some kind of maximum entropy argument, but it's not clear why I would want to maximise the entropy of a distribution on some S for every possible configuration of marginals.

Also, do you know how your model relates to structural equation models with hidden variables? Your factored set S looks a lot like a set of independent "noises", and the function f:S->Y looks a lot like a set of structural equations and I think it's straightforward to introduce hidden variables as needed to account for any lossiness. In particular, given a model and a compatible orthogonality database, I can construct an SEM by taking all the variables that appear in the database and defining the structural equation for X to be X:=X∘f. However, the set of all SEMs that are compatible with a given orthogonality database I think is larger than the set of all FFS models that are similarly compatible. This is because SEMs (in the Pearlean sense) can be distinct even if they have "apparently identical" structural equations. For example, X:=1,Y:=Xand X:=1,Y:=1 are distinct because interventions on X will have different results, while my understanding is that they will correspond to the same FFS model.

Your result 2e looks interestingly similar to the DAG result that says X⊥Z and X⊥/Z|Y implies something like X→Y←Z (where ⊥ is d-separation). In fact, I think it extends existing graph learning algorithms: in addition to checking independences among the variables as given, you can look for independences between any functions of the given variables. This seems like it would give you many more arrows than usual, though I imagine it would also increase your risk of spurious indepdendences. In fact, I think this connects to approaches to causal identification like regression with subsequent independence test: if X is independent of Y−E[Y|X], we prefer X→Y, and if Y is independent of X−E[X|Y], prefer Y→X.

I don't understand the motivation behind the fundamental theorem, and I'm wondering if you could say a bit more about it. In particular, it suggests that if I want to propose a family of probability distributions that "represent" observations somehow ("represents" maybe means in the sense of Bayesian predictions or in the sense of frequentist limits), I want to also consider this family to arise from some underlying mutually independent family along with some function. I'm not sure why I would want to propose an underlying family and a function at all, and even if I do I'm not sure why I want to suppose it is mutually independent.

One thought I had is that maybe this underlying family of distributions on S is supposed to represent "interventions". The reasoning would be something like: there is some set of things that fully control my observations that

Ican control independently and which also vary independently on their own. I don't find this convincing, though - I don't see why independent controllability should imply independent variation.Another thought I had was that maybe it arises from some kind of maximum entropy argument, but it's not clear why I would want to maximise the entropy of a distribution on some S for every possible configuration of marginals.

Also, do you know how your model relates to structural equation models with hidden variables? Your factored set S looks a lot like a set of independent "noises", and the function f:S->Y looks a lot like a set of structural equations and I think it's straightforward to introduce hidden variables as needed to account for any lossiness. In particular, given a model and a compatible orthogonality database, I can construct an SEM by taking all the variables that appear in the database and defining the structural equation for X to be X:=X∘f. However, the set of all SEMs that are compatible with a given orthogonality database I think is larger than the set of all FFS models that are similarly compatible. This is because SEMs (in the Pearlean sense) can be distinct even if they have "apparently identical" structural equations. For example, X:=1,Y:=Xand X:=1,Y:=1 are distinct because interventions on X will have different results, while my understanding is that they will correspond to the same FFS model.

Your result 2e looks interestingly similar to the DAG result that says X⊥Z and X⊥/Z|Y implies something like X→Y←Z (where ⊥ is d-separation). In fact, I think it extends existing graph learning algorithms: in addition to checking independences among the variables as given, you can look for independences between any functions of the given variables. This seems like it would give you many more arrows than usual, though I imagine it would also increase your risk of spurious indepdendences. In fact, I think this connects to approaches to causal identification like regression with subsequent independence test: if X is independent of Y−E[Y|X], we prefer X→Y, and if Y is independent of X−E[X|Y], prefer Y→X.