(From conversations with Sam, Abram, Tsvi, Marcello, and Ashwin Sah) A basic EDT agent starts with a prior, updates on a bunch of observations, and then has an choice between various actions. It conditions on each possible action it could take, and takes the action for which this conditional leads the the highest expected utility. An updateless (but non-policy selection) EDT agent has a problem here. It wants to not update on the observations, but it wants to condition on the fact that its takes a specific action given its observations. It is not obvious what this conditional should look like. In this post, I agrue for a particular way to interpret this conditioning on this conditional (of taking a specific action given a specific observation).

Formally, we have a set of observations O, a set of actions A, and a set of possible utilities U. The agent has a prior distribution P over O×A×U. For simplicity, we assume that U takes one of two possible values, 0 and 1. Our original EDT agent upon seeing O=o considers takes action
argmaxaP(U=1|O=o∧A=a).

Similarly, we would like our updateless EDT upon seeing O=o to take the action

argmaxaP(U=1|(A=a|O=o)),

but it is not clear what the "conditional distribution" Q(X)=P(X|(A=a|O=o)) should look like. We are not conditioning on a standard event. We will define Q as having a minimal KL divergence with P subject to the constraint that Q(A=a|O=o)=1. However there is still more than one way to define this. First, because KL divergence is not symmetric, and second, for on direction, the KL divergence will be infinite, so we need to take a limit, and it matters how we take this limit.

Q minimizing DKL(Q||P) subject to Q(A=a|O=o)=1.

Limit as x goes to 1 of Q minimizing DKL(P||Q) subject to Q(A=a|O=o)=x.

Limit as x goes to 0 of Q minimizing DKL(P||Q) subject to Q(O=o∧A≠a)=x.

Note that all of the above methods are natural generalizations of something that does the normal bayesian update when conditioning on an actual event.

Here, recall that DKL(P||Q)=∑iP(i)logP(i)Q(i).

As a simple example, assume that A and O are independently uniform between two options. If we condition on A=a1|O=o1, it seems most natural to preserve the fact that the probability that o1 is 1/2. Our choice of action should not change the probability of o1. However both definitions 1 and 3 give Q(o_1)=1/3. We are in effect ruling out the world in which O=o1 and A=a2, and distributing our probability mass between the other 3 worlds equally.

Option 2, however updates to Q(O=o1∧A=a1)=1/2 while keeping Q(O=o2∧A=a1) and Q(O=o2∧A=a2) at 1/4.

Philosophically, I think that option 1 is just the wrong direction to minimize KL divergence, (Usually the free parameter that you are minimizing should be the second term in the KL divergence.) while option 2 is correctly conditioning on the conditional (a|o) and option 3 is conditioning on the implication o→a.

Thus, I will interpret P(−|(A=a|O=o)) to be calculated via method 2. I may also want to use P(−|O=o→A=a), which would be calculated via method 3 (or equivalently just updated on as the event O=o→A=a).

Conditioning in this way causes the agent to believe that it can never change the probability of its observation, which seems bad for situations like transparent Newcomb. However, it appears you might be able to get past this by having multiple different instances of the agent (in this case one real and one in Omega's head), and doing conditioning on the conditional for all instances simultaneously. This way, one instance can change the probability of the observation for the other instance. However, this seems unsatisfying since it requires knowing where all copies of yourself are in advance.

(From conversations with Sam, Abram, Tsvi, Marcello, and Ashwin Sah) A basic EDT agent starts with a prior, updates on a bunch of observations, and then has an choice between various actions. It conditions on each possible action it could take, and takes the action for which this conditional leads the the highest expected utility. An updateless (but non-policy selection) EDT agent has a problem here. It wants to not update on the observations, but it wants to condition on the fact that its takes a specific action given its observations. It is not obvious what this conditional should look like. In this post, I agrue for a particular way to interpret this conditioning on this conditional (of taking a specific action given a specific observation).

Formally, we have a set of observations O, a set of actions A, and a set of possible utilities U. The agent has a prior distribution P over O×A×U. For simplicity, we assume that U takes one of two possible values, 0 and 1. Our original EDT agent upon seeing O=o considers takes action argmaxa P(U=1|O=o∧A=a).

Similarly, we would like our updateless EDT upon seeing O=o to take the action

argmaxa P(U=1|(A=a|O=o)),

but it is not clear what the "conditional distribution" Q(X)=P(X|(A=a|O=o)) should look like. We are not conditioning on a standard event. We will define Q as having a minimal KL divergence with P subject to the constraint that Q(A=a|O=o)=1. However there is still more than one way to define this. First, because KL divergence is not symmetric, and second, for on direction, the KL divergence will be infinite, so we need to take a limit, and it matters how we take this limit.

Note that all of the above methods are natural generalizations of something that does the normal bayesian update when conditioning on an actual event.

Here, recall that DKL(P||Q)=∑iP(i)logP(i)Q(i).

As a simple example, assume that A and O are independently uniform between two options. If we condition on A=a1|O=o1, it seems most natural to preserve the fact that the probability that o1 is 1/2. Our choice of action should not change the probability of o1. However both definitions 1 and 3 give Q(o_1)=1/3. We are in effect ruling out the world in which O=o1 and A=a2, and distributing our probability mass between the other 3 worlds equally.

Option 2, however updates to Q(O=o1∧A=a1)=1/2 while keeping Q(O=o2∧A=a1) and Q(O=o2∧A=a2) at 1/4.

Philosophically, I think that option 1 is just the wrong direction to minimize KL divergence, (Usually the free parameter that you are minimizing should be the second term in the KL divergence.) while option 2 is correctly conditioning on the conditional (a|o) and option 3 is conditioning on the implication o→a.

Thus, I will interpret P(−|(A=a|O=o)) to be calculated via method 2. I may also want to use P(−|O=o→A=a), which would be calculated via method 3 (or equivalently just updated on as the event O=o→A=a).

Conditioning in this way causes the agent to believe that it can never change the probability of its observation, which seems bad for situations like transparent Newcomb. However, it appears you might be able to get past this by having multiple different instances of the agent (in this case one real and one in Omega's head), and doing conditioning on the conditional for all instances simultaneously. This way, one instance can change the probability of the observation for the other instance. However, this seems unsatisfying since it requires knowing where all copies of yourself are in advance.