Related: Conceptual Problems with UDT and Policy Selection, Formalising decision theory is hard


Anyone who is interested in decision theory. The post is pretty general and not really technical; some familiarity with counterfactual mugging can be useful, but overall the required background knowledge is not much.


The post develops the claim that identifying the correct solution to some decision problems might be intricate, if not impossible, when certain details about the specific scenario are not given. First I show that, in counterfactual mugging, some important elements in the problem description and in a possible formalisation are actually underspecified. Next I describe issues related to the concept of perfect prediction and briefly discuss whether they apply to other decision scenarios involving predictors. Then I present some advantages and disadvantages of the formalisation of agents as computer programs. A summary with bullet points concludes.

Missing parts of a “correct” solution

I focus on the version of the problem with cards and two humans since, to me, it feels more grounded in reality—a game that could actually be played—but what I say applies also to the version with a coin toss and Omega.

What makes the problem interesting is the conflict between these two intuitions:

  • Before Player A looks at the card, the best strategy seems to never show the card, because it is the strategy that makes Player A lose the least in expectation, given the uncertainty about the value of the card (50/50 high or low)
  • After Player A sees a low card, showing it seems a really good idea, because that action gives Player A a loss of 0, which is the best possible result considering that the game is played only once and never again. Thus, the incentive to not reveal the card seems to disappear after Player A knows that the card is low.

[In the other version, the conflict is between paying before the coin toss and refusing to pay after knowing the coin landed tails.]

One attempt at formalising the problem is to represent it as a tree (a formalisation similar to the following one is considered here). The root is a 50/50 chance node representing the possible values of the card. Then Player A chooses between showing and not showing the card; each action leads to a leaf with a value which indicates the loss for Player A. The peculiarity of counterfactual mugging is that some payoffs depend on actions taken in a different subtree.

[The tree of the other version is a bit different since the player has a choice only when the coin lands tails; anyway, the payoff in the heads case is “peculiar” in the same sense of the card version, since it depends on the action taken when the coin lands tails.]

With this representation, it is easy to see that we can assign an expected value (EV) to each deterministic policy available to the player: we start from the root of the tree, then we follow the path prescribed by the policy until we reach a payoff, which is assigned a weight according to the chance nodes that we’ve run into.

Therefore it is possible to order the policies according to their expected values and determine which one gives the lowest expected loss [or, in the other version, the highest EV] respect to the root of the tree. This is the formalism behind the first of the two intuitions presented before.

On the other hand, one could object that it is far from trivial that the correct thing to do is to minimise expected loss from the root of the tree. In fact, in the original problem statement, the card is low [tails], so the relevance of the payoffs in the other subtree—where the card is high [heads]—is not clear and the focus should be on the decision node with the low card, not on the root of the tree. This is the formalism behind the second intuition.

Even though the objection related to the second intuition sounds reasonable, I think one could point to other, more important issues underlying the problem statement and formalisation. Why is there a root in the first place and what does it represent? What do we mean when we say that we minimise loss “from the start”?

These questions are more complicated than they seem: let me elaborate on them. Suppose that the advice of maximising EV “from the start” is generally correct from a decision theory point of view. It is not clear how we should apply that advice in order to make correct decisions as humans, or to create an AI that makes correct decisions. Should we maximise value...

  1. ...from the instant in which we are “making the decision”? This seems to bring us back to the second intuition, where we want to show the card once we’ve seen it is low.
  2. ...from our first conscious moment, or from when we started collecting data about the world, or maybe from the moment which the first data point in our memory is about? In the case of an AI, this would correspond to the moment of the “creation” of the AI, whatever that means, or maybe to the first instant which the data we put into the AI points to.
  3. ...from the very first moment since the beginning of space-time? After all, the universe we are observing could be one possible outcome of a random process, analogous to the 50/50 high/low card [or the coin toss].

Regarding point 1, I’ve mentioned the second intuition, but other interpretations could be closer to the first intuition instead. The root could represent the moment in which we settle our policy, and this is what we would mean with “making the decision”.

Then, however, other questions should be answered about policy selection. Why and when should we change policy? If selecting a policy is what constitutes a decision, what exactly is the role of actions, or how is changing policy fundamentally different from other actions? It seems we are treating policies and actions as concepts belonging to two different levels in a hierarchy: if this is a correct model, it is not clear to me why we do not use further levels, or why we need two different levels, especially when thinking in terms of embedded agency.

Note that giving precise answers to the questions in the previous paragraph could help us find a criterion to distinguish fair problems from unfair ones, which would be useful to compare the performance of different decision theories, as pointed out in the conclusion of the paper on FDT. Considering fair all the problems in which the outcome depends only on the agent’s behavior in the dilemma at hand (p.29) is not a satisfactory criterion when all the issues outlined before are taken into account: the lack of clarity about the role of root, decision nodes, policies and actions makes the “borders” of a decision problem blurred, and leaves the agent’s behaviour as an underspecified concept.

Moreover, resolving the ambiguities in the expression “from the start” could also explain why it seems difficult to apply updatelessness to game theory (see the sections “Two Ways UDT Hasn’t Generalized” and “What UDT Wants”).


A weird scenario with perfect prediction

So far, we’ve reasoned as if Player B—who determines the loss of Player A by choosing the value of that best represents his belief that the card is high—can perfectly guess the strategy that Player A adopts. Analogously, in the version with the coin toss, Omega is capable of perfectly predicting what the decision maker does when the coin lands tails, because that information is necessary to determine the payoff in case the coin lands heads.

However, I think that also the concept of perfect prediction deserves further investigation: not because it is an implausible idealisation of a highly accurate prediction, but because it can lead to strange conclusions, if not downright contradictions, even in very simple settings.

Consider a human that is going to choose only one between two options: M or N. Before the choice, a perfect predictor analyses the human and writes the letter (M or N) corresponding to the predicted choice on a piece of paper, which is given to the human. Now, what exactly prevents the human from reading the piece of paper and choosing the other option instead?

From a slightly different perspective: assume there exists a human, facing a decision between M and N, who is capable of reading a piece of paper containing only one letter, M or N, and choosing the opposite—seems quite a weak assumption. Is a “perfect predictor” that writes the predicted option on a piece of paper and gives it to the human… always wrong?

Note that allowing probabilities doesn’t help: a human capable of always choosing M when reading a prediction like “probability p of choosing M, probability 1-p of choosing N” seems as plausible as the previous human, but again would make the prediction always wrong.

Other predictions

Unlike the previous example, Newcomb’s and other problems involve decision makers who are not told about the prediction outcome. However, the difference might not be as clear-cut as it first appears. If the decision maker regards some information—maybe elements of the deliberation process itself—as evidence about the imminent choice, the DM will also have information about the prediction outcome, since the predictor is known to be reliable. To what extent is this information about the prediction outcome different from the piece of paper in the previous example? What exactly can be considered evidence about one’s own future choices? The answer seems to be related to the details of the prediction process and how it is carried out.

It may be useful to consider how a prediction is implemented as a specific program. In this paper by Critch, the algorithm plays the prisoner’s dilemma by cooperating if it successfully predicts that the opponent will cooperate, and defecting otherwise. Here the “prediction” consists in a search for proofs, up to a certain length, that the other algorithm outputs Cooperate when given as input. Thanks to a bounded version of Löb’s theorem, this specific prediction implementation allows to cooperate when playing against itself.

Results of this kind (open-source game theory / program equilibrium) could be especially relevant in a future in which important policy choices are made by AIs that interact with each other. Note, however, that no claim is made about the rationality of 's overall behaviour—it is debatable whether 's decision to cooperate against a program that always cooperates is correct.

Moreover, seeing decision makers as programs can be confusing and less precise than one would intuitively think, because it is still unclear how to properly formalise concepts such as action, policy and decision-making procedure, as discussed previously. If actions in certain situations correspond to program outputs given certain inputs, does policy selection correspond to program selection? If so, why is policy selection not an action like the other ones? And—related to what I said before about using a hierarchy of exactly two levels—why don’t we also “select” the code fragment that does policy selection?

In general, approaches that use some kind of formalism tend to be more precise than purely philosophical approaches, but there are some disadvantages as well. Focusing on low-level details can make us lose sight of the bigger picture and limit lateral thinking, which can be a great source of insight for finding alternative solutions in certain situations. In a blackmail scenario, besides the decision to pay or not, we could consider what factors caused the leakage of sensible information, or the exposure of something we care about, to adversarial agents. Another example: in a prisoner’s dilemma, the equilibrium can shift to mutual cooperation thanks to the intervention of an external actor that makes the payoffs for defection worse (the chapter on game theory in Algorithms to Live By gives a nice presentation of this equilibrium shift and related concepts).

We may also take into account that, for efficiency reasons, predictions in practice might be made with methods different from close-to-perfect physical or algorithmic simulation, and the specific method used could be relevant for an accurate analysis of the situation, as mentioned before. In the case of human interaction, sometimes it is possible to infer something about one’s future actions by reading facial expressions; but this also means that a predictor can be tricked if one is capable of masking their own intentions by keeping a poker face.


  • The claim that a certain decision is correct because it maximises utility may require further explanation, since every decision problem sits in a context which might not be fully captured in the problem formalisation.
  • Perfect prediction leads to seemingly paradoxical situations. It is unclear whether these problems underlie other scenarios involving prediction. This does not mean the concept must be rejected; but our current understanding of prediction might lack critical details. Certain problems may require clarification of how the prediction is made before a solution is claimed as correct.
  • The use of precise mathematical formalism can resolve some ambiguities. At the same time, interesting solutions to certain situations may lie “outside” the original problem statement.

Thanks to Abram Demski, Wolfgang Schwarz and Caspar Oesterheld for extensive feedback.

This work was supported by CEEALAR.



There are biases in favor of the there-is-always-a-correct-solution framework. Uncovering the right solution in decision problems can be fun, and finding the Decision Theory to solve them all can be appealing.

On “wrong” solutions

Many of the reasons provided in this post explain also why it’s tricky to determine what a certain decision theory does in a problem, and if the given solution is wrong. But I want to provide another reason, namely the following informal...

Conjecture: for any decision problem that you believe CDT/EDT gets wrong, there exists a paper or book in which a particular version of CDT/EDT gives the solution that you believe is correct, and/or a paper or book that argues that the solution you believe is correct is actually wrong.

Here’s an example about Newcomb’s problem.



New Comment
12 comments, sorted by Click to highlight new comments since:

Hey Michael, I agree that it is important to look very closely at problems like Counterfactual Mugging and not accept solutions that involve handwaving.

Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite.

I was quite skeptical of paying in Counterfactual Mugging until I discovered the Counterfactual Prisoner's Dilemma which addresses the problem of why you should care about counterfactuals given that they aren't factual by definition.

Ideally you'd start doing something like UDT from the beginning of time, but humans don't know UDT when they are born, you'd have to adjust it to take this into account by treating these initial decisions as independent of your UDT policy.

Hi Chris!

Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite.

My point in the post is that the paradoxical situation occurs when the prediction outcome is communicated to the decision maker. We have a seemingly correct prediction—the one that you wrote about—that ceases to be correct after it is communicated. And later in the post I discuss whether this problematic feature of prediction extends to other scenarios, leaving the question open. What did you want to say exactly?

I was quite skeptical of paying in Counterfactual Mugging until I discovered the Counterfactual Prisoner's Dilemma which addresses the problem of why you should care about counterfactuals given that they aren't factual by definition.

I've read the problem and the analysis I did for (standard) counterfactual mugging applies to your version as well.

The first intuition is that, before knowing the toss outcome, the DM wants to pay in both cases, because that gives the highest utility (9900) in expectation.

The second intuition is that, after the DM knows (wlog) the outcome is heads, he doesn't want to pay anymore in that case—and wants to be someone who pays when tails is the outcome, thus getting 10000.

Well, you can only predict conditional on what you write, you can't predict unconditionally. However, once you've fixed what you'll write in order to make a prediction, you can't then change what you'll write in response to that prediction.

Actually, it isn't about utility in expectation. If you are the kind of person who pays you gain $9900, if you aren't you gain $100. This is guaranteed utility, not expected utility.

The fact that it is "guaranteed" utility doesn't make a significant difference: my analysis still applies. After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition).

"After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition)" - No you can't. The only way to get 10,000 is to pay if the coin comes up the opposite way it comes up. And that's only a 50/50 chance.

If the DM knows the outcome is heads, why can't he not pay in that case and decide to pay in the other case? In other words: why can't he adopt the policy (not pay when heads; pay when tails), which leads to 10000?

If you pre-commit to that strategy (heads don't post, tails pay) it provides 10000, but it only works half the time.

If you decide that after you see the coin, not to pay in that case, then this will lead to the strategy (not pay, not pay) which provides 0.

It seems you are arguing for the position that I called "the first intuition" in my post. Before knowing the outcome, the best you can do is (pay, pay), because that leads to 9900.

On the other hand, as in standard counterfactual mugging, you could be asked: "You know that, this time, the coin came up tails. What do you do?". And here the second intuition applies: the DM can decide to not pay (in this case) and to pay when heads. Omega recognises the intent of the DM, and gives 10000.

Maybe you are not even considering the second intuition because you take for granted that the agent has to decide one policy "at the beginning" and stick to it, or, as you wrote, "pre-commit". One of the points of the post is that it is unclear where this assumption comes from, and what it exactly means. It's possible that my reasoning in the post was not clear, but I think that if you reread the analysis you will see the situation from both viewpoints.

I am considering the second intuiton. Acting according to it results in you receiving $0 in Counterfactual Prisoner's Dilemma, instead of losing $100. This is because if you act updatefully when it comes up heads, you have to also act updatefully when it comes up tails. If this still doesn't make sense, I'd encourage you to reread the post.

Omega, a perfect predictor, flips a coin. If it comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails and you were told it was tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads and you were told it was heads.

Here there is no question, so I assume it is something like: "What do you do?" or "What is your policy?"

That formulation is analogous to standard counterfactual mugging, stated in this way:

Omega flips a coin. If it comes up heads, Omega will give you 10000 in case you would pay 100 when tails. If it comes up tails, Omega will ask you to pay 100. What do you do?

According to these two formulations, the correct answer seems to be the one corresponding to the first intuition.

Now consider instead this formulation of counterfactual PD:

Omega, a perfect predictor, tells you that it has flipped a coin, and it has come up heads. Omega asks you to pay 100 (here and now) and gives you 10000 (here and now) if you would pay in case the coin landed tails. Omega also explains that, if the coin had come up tails—but note that it hasn't—Omega would tell you such and such (symmetrical situation). What do you do?

The answer of the second intuition would be: I refuse to pay here and now, and I would have paid in case the coin had come up tails. I get 10000.

And this formulation of counterfactual PD is analogous to this formulation of counterfactual mugging, where the second intuition refuses to pay.

Is your opinion that

The answer of the second intuition would be: I refuse to pay here and now, and I would have paid in case the coin had come up tails. I get 10000.

is false/not admissible/impossible? Or are you saying something else entirely? In any case, if you could motivate your opinion, whatever that is, you would help me understand. Thanks!

To be honest, this thread has gone on long enough that I think we should end it here. It seems to me that you are quite confused about this whole issue, though I guess from your perspective it seems like I am the one who is confused. I considered asking a third person to try looking at this thread, but I decided it wasn't worth calling in a favour.

I made a slight edit to my description of Counterfactual Prisoner's Dilemma, but I don't think this will really help you understand:

Omega, a perfect predictor, flips a coin and tell you how it came up. If if comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads. In this case it was heads.

Ok, if you want to clarify—I'd like to—we can have a call, or discuss in other ways. I'll contact you somewhere else.