In order to distinguish EDT or UDT from CDT, we need to consider inferences from the choice policy other than the causal effects of the agent following it. For example, we need to exploit some correlation between the agent's choice of policy and the choices or beliefs of another agent. This seems to require a richer model than we usually use in reinforcement learning.
If we don't have any nontrivial interaction between the policy and the outcome, then EDT/CDT will collapse. This seems to happen in your setting.
I don't see a good way to construct a suitable practical model, though I haven't thought about it too much.
The difference between EDT and CDT only appears when there are non-causal correlations between the environment and agent's choice of policy. But in the setting you described, the only impact of the policy is on the agent's actions, which then causally affect the environment. In this setting, EDT always makes the same recommendation as CDT, since conditioning is the same as causal intervention.
UDT also makes the same choices as EDT, because the "behavior after observing X" only affects what happens after observing X. So it doesn't matter whether we update.
I might be missing some aspect of the model that correlates decisions and outcomes though.
One candidate is forgetfulness. If you don't keep a record of your past states, then your decision has an impact both on what happens in the future, but it also affects your beliefs about what state you are currently in. So I guess if we have forgetting, then the model you described can capture a difference.
It seems like RL with forgetting is generally pretty subtle subject. As far as I can tell direct policy search is the only algorithm people use in this setting. If episodes are independent, this is the same as UDT (not CDT).
"But in the setting you described, the only impact of the policy is on the agent’s actions"
I don't think so. P_M(\zeta | \pi) is meant to describe the distribution over trajectories given a policy (according to the model). Unless I'm missing something, the model could contain non-causal correlations.
I see; you're right.
You mention that PM could reflect the true dynamics of the environment; I read that and assumed it was a causal model mapping a (state, action) pair to the next state. But if it captures a more general state of uncertainty, then this does pick up the difference between EDT/UDT and CDT.
Note that if PM reflects the agent's logical uncertainty about its own behavior, then we can't generally expand the expectations out as an integral over possible trajectories.
For example, if I train one model to map x to E[y(x)|x], and I train another model to map (x,b) to P(y(x)=b), then the two quantities won't generally be related by integration.
When thinking about the connection between theoretical frameworks and practical algorithms, I think this would be an interesting issue to push on.
I don't understand why you say:
it "seems to require a richer model than we usually use in [RL]".
"This seems to happen in your setting."
Are you suggesting that a model as I've defined it is not satisfactory/sufficient for some reason?
can you elaborate a bit?