Counterfactuals on POMDP

Stuart_Armstrong

A putative new idea for AI control; index here.

This is technical note explaining how to define counterfactuals on partially observable Markov decision processes (POMDP).

The POMDP formalism is explained here. This note will just sketch out how counterfactuals are defined; the full details will be in the final paper.

Taking another action

Suppose that an agent has been active for $n$ turns on POMDP $μ$ , and has seen history $h_{n}$ . Then suppose it wants to estimate the counterfactual of what would have happened if had done other actions after timestep $t < n$ . So what is the counterfactual probability of history $h_{m}$ , $t < m$ , which can be written as:

$μ (h_{m} ∣ (h_{n})^{c}, h_{t})$ .

This rather clunky notation (let me know if there's a better way of writing this) is trying to estimate the probability of $h_{m}$ , given that $h_{n}$ is the counterfactual history, and both $h_{n}$ and $h_{m}$ start with $h_{t}$ before diverging.

It might seem surprising that no policy is mentioned in that expression - after all, the probability of a history is given by the environment $μ$ and the agent's action choices. But histories like $h_{m}$ include action choices, so these don't need to be specified separately.

The first thing to notice is that $μ$ and $h_{n}$ give a probability distribution over $s_{t}$ , the (hidden) state at timestep $t$ . And the value of $s_{t}$ can change the subsequent probability of $h_{m}$ . So the counterfactual probability is defined as:

$μ (h_{m} ∣ (h_{n})^{c}, h_{t}) = \sum_{s \in S} μ (s_{t} = s ∣ h_{n}) μ (h_{m} ∣ h_{t}, s_{t} = s)$ .

Counterfactual equivalence

Though the counterfactual probability is defined in terms of the states, the expression itself only involves histories.

Thus if $μ$ and $μ^{'}$ are two POMDPs with the same sets of actions and observations (but potentially different sets of states) we can say they are counterfactually equivalent if they generate the same counterfactual probabilities:

$\forall h_{m}, h_{n}, h_{t} : μ (h_{m} ∣ (h_{n})^{c}, h_{t}) = μ^{'} (h_{m} ∣ (h_{n})^{c}, h_{t})$ .

Consider the simple POMDP $μ$ (actually, an MDP, since it's fully observable), defined as:

The reasons for the notation will be explained in a later post; but here, starting from a single state, the agent can take one of two actions, and each action has a $0.5$ chance of ending up in one of two states.

Now consider the POMDP $μ^{'}$ , defined as:

Here there are one of two initial states, each equally likely, and then each action will lead with certainty to another state that comes from the initial hidden state and the action choice.

Note that there are four possible histories on $μ$ and $μ^{'}$ , and two are compatible with each action, and those two are equally probable given that action. So every history will appear with equal probability on $μ$ and $μ^{'}$ : they are observationally equivalent.

However, they are not counterfactually equivalent. For instance, $μ (o_{0} a^{w} o^{w 0} ∣ o_{0} a^{r} o^{r 1}, o_{0}) = 0.5$ , but

$μ^{'} (o_{0} a^{w} o^{w 0} ∣ o_{0} a^{r} o^{r 1}, o_{0}) = \sum_{i = 0}^{1} μ^{'} (s_{0} = s_{0}^{i} ∣ o_{0} a^{r} o^{r 1}) μ^{'} (o_{0} a^{w} o^{w 0} ∣ s_{0} = s_{0}^{i}) = 0 \cdot 1 + 1 \cdot 0 = 0$ .

Conversely, if we consider the POMDP $μ^{''}$ :

Then it's not hard to check that it's counterfactually equivalent with $μ$ .