This is technical note explaining how to define counterfactuals on partially observable Markov decision processes (POMDP).

The POMDP formalism is explained here. This note will just sketch out how counterfactuals are defined; the full details will be in the final paper.

Taking another action

Suppose that an agent has been active for n turns on POMDP μ, and has seen history hn. Then suppose it wants to estimate the counterfactual of what would have happened if had done other actions after timestep t<n. So what is the counterfactual probability of history hm, t<m, which can be written as:

μ(hm∣(hn)c,ht).

This rather clunky notation (let me know if there's a better way of writing this) is trying to estimate the probability of hm, given that hn is the counterfactual history, and both hn and hm start with ht before diverging.

It might seem surprising that no policy is mentioned in that expression - after all, the probability of a history is given by the environment μ and the agent's action choices. But histories like hm include action choices, so these don't need to be specified separately.

The first thing to notice is that μ and hn give a probability distribution over st, the (hidden) state at timestep t. And the value of st can change the subsequent probability of hm. So the counterfactual probability is defined as:

μ(hm∣(hn)c,ht)=∑s∈Sμ(st=s∣hn)μ(hm∣ht,st=s).

Counterfactual equivalence

Though the counterfactual probability is defined in terms of the states, the expression itself only involves histories.

Thus if μ and μ′ are two POMDPs with the same sets of actions and observations (but potentially different sets of states) we can say they are counterfactually equivalent if they generate the same counterfactual probabilities:

∀hm,hn,ht:μ(hm∣(hn)c,ht)=μ′(hm∣(hn)c,ht).

Consider the simple POMDP μ (actually, an MDP, since it's fully observable), defined as:

The reasons for the notation will be explained in a later post; but here, starting from a single state, the agent can take one of two actions, and each action has a 0.5 chance of ending up in one of two states.

Now consider the POMDP μ′, defined as:

Here there are one of two initial states, each equally likely, and then each action will lead with certainty to another state that comes from the initial hidden state and the action choice.

Note that there are four possible histories on μ and μ′, and two are compatible with each action, and those two are equally probable given that action. So every history will appear with equal probability on μ and μ′: they are observationally equivalent.

However, they are not counterfactually equivalent. For instance, μ(o0awow0∣o0aror1,o0)=0.5, but

A putative new idea for AI control; index here.This is technical note explaining how to define counterfactuals on partially observable Markov decision processes (POMDP).

The POMDP formalism is explained here. This note will just sketch out how counterfactuals are defined; the full details will be in the final paper.

## Taking another action

Suppose that an agent has been active for n turns on POMDP μ, and has seen history hn. Then suppose it wants to estimate the counterfactual of what would have happened if had done other actions after timestep t<n. So what is the counterfactual probability of history hm, t<m, which can be written as:

This rather clunky notation (let me know if there's a better way of writing this) is trying to estimate the probability of hm, given that hn is the counterfactual history, and both hn and hm start with ht before diverging.

It might seem surprising that no policy is mentioned in that expression - after all, the probability of a history is given by the environment μ and the agent's action choices. But histories like hm include action choices, so these don't need to be specified separately.

The first thing to notice is that μ and hn give a probability distribution over st, the (hidden) state at timestep t. And the value of st can change the subsequent probability of hm. So the counterfactual probability is defined as:

## Counterfactual equivalence

Though the counterfactual probability is

definedin terms of the states, the expression itself only involves histories.Thus if μ and μ′ are two POMDPs with the same sets of actions and observations (but potentially different sets of states) we can say they are

counterfactually equivalentif they generate the same counterfactual probabilities:Consider the simple POMDP μ (actually, an MDP, since it's fully observable), defined as:

The reasons for the notation will be explained in a later post; but here, starting from a single state, the agent can take one of two actions, and each action has a 0.5 chance of ending up in one of two states.

Now consider the POMDP μ′, defined as:

Here there are one of two initial states, each equally likely, and then each action will lead with certainty to another state that comes from the initial hidden state and the action choice.

Note that there are four possible histories on μ and μ′, and two are compatible with each action, and those two are equally probable given that action. So every history will appear with equal probability on μ and μ′: they are

observationally equivalent.However, they are not counterfactually equivalent. For instance, μ(o0awow0∣o0aror1,o0)=0.5, but

Conversely, if we consider the POMDP μ′′:

Then it's not hard to check that it's counterfactually equivalent with μ.