In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy

jessicata

Summary: I define memoryless Cartesian environments (which can model many familiar decision problems), note the similarity to memoryless POMDPs, and define a local optimality condition for policies, which can be roughly stated as "the policy is consistent with maximizing expected utility using CDT and subjective probabilities derived from SIA". I show that this local optimality condition is necesssary but not sufficient for global optimality (UDT).

Memoryless Cartesian environments

I'll define a memoryless Cartesian environment to consist of:

a set of states
a set of actions $A$
a set of observations $O$
an initial state $s_{1} \in S$
a transition function $t : S \times A \to Δ S$ , determining the distribution of states resulting from starting in a state and taking a certain action
an observation function $m : S \to O$ , determining what the agent sees in a given state
a set $S_{T} \subset S$ of terminal states. If the environment reaches a terminal state, the game ends.
a utility function $U : S_{T} \to [0, 1]$ , measuring the value of each terminal state.

On each iteration, the agent observes some observation, and takes some action. Unlike in a POMDP, the agent has no memory of previous observations: the agent's policy must take into account only the current observation. That is, the policy $π$ is of type $O \to Δ A$ . In this analysis I'll assume that, for any state and policy, the expected number of iterations in the Cartesian environment starting from that state and using that policy is finite.

Memoryless Cartesian environments can be used to define many familier decision problems (for example, the absent-minded driver problem, Newcomb's problem with opaque or transparent boxes (assuming Omega runs a copy of the agent to make its prediction), counterfactual mugging (also assuming Omega simulates the agent)). Translating a decision problem to a memoryless Cartesian environment obviously requires making some Cartesian assumptions/decisions, though; in the case of Newcomb's problem, we have to isolate Omega's simulation of the agent as a copy of the agent.

Globally and locally optimal policies

Memoryless Cartesian environments are much like memoryless POMDPs, and the following analysis is quite similar to that given in some previous work on memoryless POMDPs: the main difference is that I am targeting (local) optimality given a known world model, while previous work usually targets asymptotic (local) optimality given an unknown world model.

Let us define the expected utility of a particular state, given a policy:

$V_{π} (s) := U (s) if s \in S_{T}$ $V_{π} (s) := \sum a π (a | m (s)) (\sum s^{'} t (s^{'} | s, a) V_{π} (s^{'})) otherwise$

Although this definition is recursive, the recursion is well-founded (since the expected number of iterations starting from any particular state is finite). Note that the agent's initial expected utility is just $V_{π} (s_{1})$ . Now we can also define a Q function, determining the expected utility of being in a certain state and taking a certain action:

$Q_{π} (s, a) := \sum s^{'} t (s^{'} | s, a) V_{π} (s^{'})$

Let $N$ be a random variable indicating the total number of iterations, and $S_{1}, . . ., S_{N}$ be random variables indicating the state on each iteration. It is now possible to define the frequency of a given state (i.e. the expected number of times the agent will encounter this state):

$F_{π} (s) := E [N \sum i = 1 [S_{i} = s] ∣ ∣ ∣ ∣ π]$

These frequencies are bounded since the expectation of $N$ is bounded. Given an observation, the agent may be uncertain which state it is in (since multiple states might result in the same observation). It is possible to use SIA to define subjective state probabilities using these frequencies:

$S I A_{π} (s | o) := [m (s) = o] F_{π} (s)$

Note that I've defined SIA to return an un-normalized probability distribution; this turns out to be useful later, since it naturally handles the case when the observation $o$ occurs with probability 0.

How might an agent decide which action to take? Under one approach (UDT), the agent simply computes the globally optimal policy $π$ that results in maximum expected utility (that is, a policy $π$ maximizing $V_{π} (s_{1})$ ) and takes the action recommended by this policy (perhaps stochastically). While UDT is philosophically satisfying, it is not a very direct algorithm. It would be nice to have a better intuition for how an agent using UDT acts, such that we could (in some cases) derive a polynomial-time algorithm.

So let's consider a local optimality condition. Intuitively, the condition states that if the agent has a nonzero probability of taking an action $a$ given observation $o$ , then that action should maximize expected utility (given the agent's uncertainty about which state it is in). More formally, the local optimality condition states:

$\forall o \in O, a \in A : π (a | o) > 0 \Rightarrow a \in arg max a^{'} \in A \sum s S I A_{π} (s | o) Q_{π} (s, a^{'})$

Philosophically, a policy is locally optimal iff it is consistent with CDT (using SIA probabilities). This local optimality condition is not sufficient for global optimality (for the same reason that not all Nash equilibria in cooperative games are optimal), but it is necessary. The proof follows.

Global optimality implies local optimality

Let $s$ be a state and $π$ be a policy. Consider a perturbation of the policy $π$ : given observation $o$ , the agent will take action $a_{+}$ more often, and action $a_{-}$ less often. This results in a change of the agent's expected utility starting from each state:

$d_{π} (o, a_{+}, a_{-}, s) := \frac{\partial}{\partial (π (a_{+} | o) - π (a_{-} | o))} V_{π} (s)$

$d_{π} (o, a_{+}, a_{-}, s) = 0 if s \in S_{T}$ $d_{π} (o, a_{+}, a_{-}, s) = (\sum a π (a | m (s)) \sum s^{'} t (s^{'} | s, a) d (o, a_{+}, a_{-}, s^{'})) + [m (s) = o] (Q_{π} (s, a_{+}) - Q_{π} (s, a_{-})) otherwise$

This has a natural interpretation: to compute $d_{π} (o, a_{+}, a_{-}, s)$ , we compute the expected value of simulating a run starting from $s$ using policy $π$ and summing $Q_{π} (s^{'}, a_{+}) - Q_{π} (s^{'}, a_{-})$ (i.e. how much better $a_{+}$ is than $a_{-}$ in expectation in state $s^{'}$ ) for all visited states $s^{'}$ with $m (s^{'}) = o$ .

To determine the optimal policy, we are concerned with $d_{π} (o, a_{+}, a_{-}, s_{1})$ for different observations $o$ and actions $a_{+}, a_{-}$ . To compute this, we imagine starting from the state $s_{1}$ and following policy $π$ , and sum $Q_{π} (s, a_{+}) - Q_{π} (s, a_{-})$ for all visited states $s$ with $m (s) = o$ . This expected sum is actually equivalent to

$\sum s, m (s) = o F_{π} (s) (Q_{π} (s, a_{+}) - Q_{π} (s, a_{-}))$ $= \sum s S I A_{π} (s | o) (Q_{π} (s, a_{+}) - Q_{π} (s, a_{-}))$

i.e. the expected value of of $Q_{π} (s, a_{+}) - Q_{π} (s, a_{-})$ with $s$ having $S I A$ probabilities (up to a multiplicative constant). From here the implication should be clear: if a policy $π$ is not locally optimal, then there is some $o, a_{+}, a_{-}$ triple such that a small change in making $a_{+}$ more likely and $a_{-}$ less likely given observation $o$ will increase expected utility (just set $a_{-}$ to the non-optimal action having nonzero probability given $o$ , and set $a_{+}$ to be a better alternative action). So this policy $π$ would not be globally optimal either.

Conclusion

In memoryless Cartesian environments, policies consistent with CDT+SIA are locally optimal in some sense, and all globally optimal (UDT) policies are locally optimal in this sense. Therefore, if we look at (Cartesian) UDT the right way, it's doing CDT+SIA with some method for making sure the resulting policy is globally optimal rather than just locally optimal. It is not clear how to extend this analysis to non-Cartesian environments where logical updatelessness is important (e.g. agent simulates predictor), but this seems like a useful research avenue.

Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.

Given some existing policy , EDT+SSA recommends that upon receiving observation $o$ we should choose an action from $arg max a \sum s_{1} . . . s_{n} n \sum i = 1 S S A (s_{i} in s_{1} . . . s_{n} ∣ o, π_{o \to a}) U (s_{1} . . . s_{n}) .$ (For notational simplicity, I'll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, $π_{o \to a} (o^{'}) = a$ if $o = o^{'}$ and $π_{o \to a} (o^{'}) = π (o^{'})$ otherwise. $S S A (s_{i} in s_{1} . . . s_{n} ∣ o, π_{o \to a})$ is the SSA probability of being in state $s_{i}$ of the environment trajectory $s_{1} . . . s_{n}$ given the observation $o$ and the fact that one uses the policy $π_{o \to a}$ .

The SSA probability $S S A (s_{i} in s_{1}, . . ., s_{n} ∣ o, π_{o \to a})$ is zero if $m (s_{i}) \neq o$ and $S S A (s_{i} in s_{1} . . . s_{n} ∣ o, π_{o \to a}) = P (s_{1} . . . s_{n} ∣ π_{o \to a}, o) \frac{1}{# (o, s_{1} . . . s_{n})}$ otherwise. Here, $# (o, s_{1} . . . s_{n}) = \sum_{i = 1}^{n} [m (s_{i}) = o]$ is the number of times $o$ occurs in $s_{1} . . . s_{n}$ . Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1/2 probability to tails in the Sleeping Beauty problem and sticks with 1/2 if it's told that it's Monday). $P (s_{1} . . . s_{n} ∣ π_{o \to a}, o)$ is the (regular, non-anthropic) probability of the sequence of states $s_{1} . . . s_{n}$ , given that $π_{o \to a}$ is played and $o$ is observed at least once. If (as in the sum above) $o$ is observed at least once in $s_{1} . . . s_{n}$ , we can rewrite this as $P (s_{1} . . . s_{n} ∣ π_{o \to a}, o) = \frac{P (s_{1} . . . s_{n} ∣ π_{o \to a})}{P (o ∣ π_{o \to a})} .$ Importantly, note that $P (o ∣ π_{o \to a})$ is constant in $a$ , i.e., the probability that you observe $o$ at least once cannot (in the present setting) depend on what you would do when you observe $o$ .

Inserting this into the above, we get $arg max a \sum s_{1} . . . s_{n} n \sum i = 1 S S A (s_{i} in s_{1} . . . s_{n} ∣ o, π_{o \to a}) U (s_{1} . . . s_{n}) = arg max a \sum s_{1} . . . s_{n} with o \sum i = 1... n, m (s_{i}) = o \frac{P (s_{1} . . . s_{n} ∣ π_{o \to a})}{# (o, s_{1} . . . s_{n}) P (o ∣ π_{o \to a})} U (s_{1} . . . s_{n}),$ where the first sum on the right-hand side is over all histories that give rise to observation $o$ at some point. Dividing by the number of agents with observation $o$ in a history and setting the policy for all agents at the same time cancel each other out, such that this equals $arg max a \frac{1}{P (o ∣ π_{o \to a})} \sum s_{1} . . . s_{n} with o P (s_{1} . . . s_{n} ∣ π_{o \to a}) U (s_{1} . . . s_{n}) = arg max a \sum s_{1} . . . s_{n} with o P (s_{1} . . . s_{n} ∣ π_{o \to a}) U (s_{1} . . . s_{n}) = arg max a \sum s_{1} . . . s_{n} P (s_{1} . . . s_{n} ∣ π_{o \to a}) U (s_{1} . . . s_{n}) .$ Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; if there are multiple observations, then multiple policies might satisfy the right-hand side of this equation and not all of these are optimal.

[1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf

Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely "isomorphic"/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution is to identify with all objects in the world that are perfectly correlated with you. However, the underlying motivation is unrelated to Conitzer's example.

[2] Vincent Conitzer: A Dutch Book against Sleeping Beauties Who Are Evidential Decision Theorists. Synthese, Volume 192, Issue 9, pp. 2887-2899, October 2015. https://arxiv.org/pdf/1705.03560.pdf

I noticed that the sum inside $arg {max}_{a} \sum_{s_{1}, . . ., s_{n}} \sum_{i = 1}^{n} S S A (s_{i} in s_{1}, . . ., s_{n} ∣ o, π_{o \to a}) U (s_{n})$ is not actually an expected utility, because the SSA probabilities do not add up to 1 when there is more than one possible observation. The issue is that conditional on making an observation, the probabilities for the trajectories not containing that observation become 0, but the other probabilities are not renormalized. So this seems to be part way between "real" EDT and UDT (which does not set those probabilities to 0 and of course also does not renormalize).

This zeroing of probabilities of trajectories not containing the current observation (and renormalizing, if one was to do that) seems at best useless busywork, and at worst prevents coordination between agents making different observations. In this formulation of EDT, such coordination is ruled out in another way, namely by specifying that conditional on o→a, the agent is still sure the rest of π is unchanged (i.e., copies of itself receiving other observations keep following π). If we remove the zeroing/renormalizing and say that the agent ought to have more realistic beliefs conditional on o→a, I think we end up with something close to UDT1.0 (modulo differences in the environment model from the original UDT).

(Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)

Sorry for taking an eternity to reply (again).

On the first point: Good point! I've now finally fixed the SSA probabilities so that they sum up to 1, which really they should, to really have a version of EDT.

>prevents coordination between agents making different observations.

Yeah, coordination between different observations is definitely not optimal in this case. But I don't see an EDT way of doing it well. After all, there are cases where given one observation, you prefer one policy and given another observation you favor another policy. So I think you need the ex ante perspective to get consistent preferences over entire policies.

>(Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)

The only significance is to get a version of EDT, which we would traditionally assume to have self-locating beliefs. From a purely mathematical point of view, I think it's nonsense.

I now have a draft for a paper that gives this result and others.

Elsewhere, I illustrate this result for the absent-minded driver.

This result features in the paper by Piccione and Rubeinstein that introduced the absent-minded driver problem [1].

Philosophers like decision theories that self-ratify, and this is indeed a powerful self-ratification principle.

This self-ratification principle does however rely on SIA probabilities assuming the current policy. We have shown that conditioning on your current policy, you will want to continue on with your current policy. i.e. the policy will be a Nash Equilibrium. There can be Nash Equilibria for other policies $π^{'}$ however. The UDT policy will by definition equal or beat these from the ex ante point of view. However, others can achieve higher expected utility conditioning on the initial observation i.e. higher $S I A_{π^{'}} (s | o) Q_{π^{'}} (s, a)$ . This apparent paradox is discussed in [2] [3], and seems to reduce to disagreement over counterfactual mugging.

So why do we like the UDT solution over solutions that are more optimal locally, and that also locally self-ratify? Obviously we want to avoid resorting so circular reasoning (i.e. it gets the best utility ex ante). I think there are some okay reasons:

i) it is reflectively stable (i.e. will not self-modify, will not hide future evidence) and ii) it makes sense assuming modal realism or many worlds interpretation (then we deem it parochial to focus on any reference frame other than equal weighting across the whole wavefunction/universe) iii) it makes sense if we assume that self-location somehow does not iv) it's simpler (utility function given weighting 1 across all worlds). In principle, UDT can also include the locally optimal v) it transfers better to scenarios without randomization as in Nate + Ben Levenstein's forthcoming [4].

I imagine there are more good arguments that I don't yet know.

p19 Piccione, Michele, and Ariel Rubinstein. "On the interpretation of decision problems with imperfect recall." Games and Economic Behavior 20.1 (1997): 3-24.
Schwarz, Wolfgang. "Lost memories and useless coins: revisiting the absentminded driver." Synthese 192.9 (2015): 3011-3036.
http://lesswrong.com/lw/3dy/has_anyone_solved_psykoshs_nonanthropic_problem/
Cheating Death in Damascus / Nate Soares and Ben Levenstein / Forthcoming