Produced as part of SERI MATS 3.0. Thanks to Vivek Hebbar and Paul Colognese for discussion.
Behind the problem of human counterfactuals creeps the problem of understanding abstraction / ontology identification.
A nice theory of counterfactuals would be useful for many things, including low-impact measures for corrigible AI:
a flooded workshop changes a lot of things that don't have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron. [the natural operationalization of this averaging requires counterfactuals]
So whence the difficulty of obtaining one?
Well, we do have at least one well-defined class of counterfactuals: "just take a chunk of atoms, replace it by another, and continue running the laws of physics". This is a discontinuity in the laws of physics that would never take place in the real world, but we don't care about that: we can just continue running the mathematical laws of physics from that state, as if we were dealing with a Game of Life board.
But this doesn't correspond to our intuitive notion of counterfactuals. When humans think about counterfactuals, we are basically changing the state of a latent variable inside our heads, and rerunning a computation. For example, maybe we change the state of the "yesterday's weather" variable from "sunny" to "rainy", and rerun the computation "how did the picnic go?".
The problem with this is our latent variables don't neatly correspond to parts of physical reality. Sometimes they don't even correspond to any parts of physical reality at all! And so, some (in fact, most) of the variable changes we offhandedly perform, don't univocally correspond to physical counterfactuals natively expressed in our laws of physics.
If you just replace a three-dimensional cube of atmosphere to include a rainy cloud, people will notice a cloud appeared out of nowhere. So as a necessary consequence, people will be freaked out by this artificial fact, which is not at all what you had in mind for your counterfactual. Sometimes you'll be able to just add the cloud when no one is looking. But most times, and especially when dealing with messier human concepts, the physical counterfactual will be under-determined, or even none of them will correspond to what you had in mind, using your neatly compartmentalized variables.
This is not to say human counterfactuals are meaningless: they are a way of taking advantage of regularities discovered in the world. When a physicist says "if I had put system A there, it would have evolved into system B", they just mean said causality relation has been demonstrated by their experiments, or is predicted by their gears-level well-tested theories (modulo the philosophical problem of induction, as always). Similarly, a counterfactual might help you notice or remember rainy days are no good for picnics, which is useful for future action.
But it becomes clear that such natural language counterfactuals depend on the mind's native concepts. And so, instead of a neat and objective mathematical definition that makes sense of these counterfactuals, we should expect ontology identification (matching our concepts with physical reality) to be the hard part to operationalizing them.
More concretely, suppose we had a solution to ontology identification: a probability distribution P(Mindstate|Worldstate). By having additionally a prior over worldstates (or mindstates), we can obtain the dual distribution P(Worldstate|Mindstate). And given that, we can just use the do() operator in a mindstate to natively implement the counterfactual, and then condition on the new mindstate to find which probability distribution over reality it corresponds to.
Maybe we should expect the distribution P(Mindstate|Worldstate) to contain lots of contingent information depending on how the human brain came about and learned (especially if Natural Abstractions fails). And hence also the perfect operationalization of natural language counterfactuals would be far from a simple definition.
Even this notion might not be well-defined. The actual laws of physics might be expressed in terms different from particle positions, for example wave functions. In that case, "rearranging atoms" is under-determined, and the actual counterfactuals we can natively talk about are of the different form "what if this function suddenly became this other function?". It is enough for my point to consider any such "native counterfactuals" for whatever mathematically expressed laws of physics we use.
This does presuppose, not only that such laws exist, but also that they can be run on any physical setup expressible in their language. It does seem like we live in such a world, but it is mathematically possible for the laws of physics to be under-determined on certain setups.
As an example of this, if we ponder "what if Mr. Smith won the last election?", are we thinking of just the final vote count changing out of nowhere? Or people actually casting different votes? Do we also have to change the machinery in their heads that led them to cast said votes? Any one of these implementations breaks other variables we wanted to hold constant. For example, in the first case people might discover there has been some kind of mistake in vote counting. In the second, people will be surprised about voting for Smith, even though they meant to cast another vote. In the third, we need to make a myriad more decisions about operationalization. We might find any instantiation of the counterfactual necessarily brings about other unrealistic changes we didn't want to implement.
Notice ontology identification is usually taken to mean "mapping from the AI's concepts to human concepts". Here, instead, we are trying to map directly with physical reality (although it could be understood as "our best guess about physical reality", which are still human concepts).
We can think of a mindstate as a value assignment to the nodes of the causal graph of our concepts. The worldstates don't need any additional structure.