# 4

Decision TheoryCounterfactualsAI
Frontpage

l was attempting to write a reference post on the concept of comparability in decision theory problems, but I realised that I don't yet have a strong enough grasp on the various positions that one could adopt to write a post worthy of being a reference. I'll quote my draft quite liberally below:

In the context of decision theory, comparability is about whether or not it is fair to compare counterfactuals when evaluating decisions, a decision algorithm or decision theory. Perhaps the best way to illustrate is with the example of a medical trial. Let's suppose we're trying to see if aspirin reduces the amount of pain experienced. So if we create two groups, give aspirin to one and then observe that group as having experienced less pain, that is evidence that it does what we want. However, if the aspirin group was healthy and most of other group had cancer, then this wouldn't be a fair test. We would be treating two groups as comparable when they differed in an attribute relevant the to outcome we cared about.
Given a decision problem, we normally apply a decision theory to construct counterfactuals, then calculate the utility for each and finally make a decision. Prima facie, it appears that these counterfactuals must be comparable in order for a decision theory to count as being reasonable. Otherwise, we would open ourselves to the critique of being the same as a naive researcher in the aspirin example.
We can clarify this with an example. Casual decision theorists recommend 2-boxing for Newcomb's Problem. They admit that a 1-boxer will receive an extra $1 million, but they would likely argue that this isn't a fair comparison since the opaque box contains$1 million for the 1-boxer, but not for the 2-boxer. This is essentially a dispute over whether these counterfactuals are comparable.

Why is this important?

Well, as far as I can tell attempts to understand counterfactuals have taken us to logical counterfactuals at which point we've become stuck. Asking an easier question could help us to become unstuck. And determining if counterfactuals are comparable seems easier than saying what counterfactuals are. Indeed, I would go so far as to say that if we don't know what we mean by comparability then we don't fully even know what we are looking for.

Given this, I find it strange that this notion hasn't been discussed to any significant degree at all on Less Wrong as far as I can tell, although I haven't performed an in-depth search.

But before we go any further, it's worth asking what objections could be made to this approach. I'll quote my draft again:

Firstly, comparability only makes sense if the notion of counterfactuals makes sense. If they don't exist, then we would have to abandon the quest.
Secondly, we could admit counterfactuals, but deny comparability. Why might we do this? Firstly, the requirement for the past to be comparable regardless of our action seems to assume that our action shouldn't affect the past. But if there wasn't any fundamental difference between forwards and backwards causation, this assumption would seem unsupported. Secondly, we might think that only the partial information we have before we are told our action must be the same and that there is no requirement for the counterfactuals to actually be comparable. Evidential decision theory could be justified on these terms.
Thirdly, we could argue for a notion of comparability that can be trivially satisfied. Causal decision theory leaves the past unchanged and only intervenes at the point of the decision. So these kind of counterfactuals are always trivially comparable, as it seems reasonable to presume that comparability only depends on the past and that identical pasts are automatically comparable. Note that it might be possible to argue for causal decision theory within the comparability framework. If only exact pasts were comparable, then that'd exclude almost every theory exist for CDT.
Fourthly, we could argue that there are many different notions of comparability, so the question, "What does it mean for counterfactuals to be comparable?" is meaningless without further information about the purpose we are asking.

Three questions:

I'll finish this post with three questions designed to help clarify the notion of comparability. If you have time, I'd really appreciate it if you thought about the questions before writing your own answers, as that'd likely increase the diversity of responses.

1) Suppose you have the option to choose one of two boxes: the first containing an item worth 5 utility and the second containing an item worth 10 utility. Almost no-one would dispute that these counterfactuals are comparable, but why?

2) As discussed above, a casual decision theorist would likely argue that the counterfactuals constructed by a timeless decision theorist aren't comparable because the 1-boxer has \$1 million in the mystery box, while the two-boxer's box is empty. Most people on Less Wrong think that the casual decision theorist is wrong. How can we respond to this claim? Does it satisfy another notion of comparability or does this show the notion of comparability is irrelevant?

3) An evidential decision theorist wouldn't smoke in the Smoking Lesion problem so they don't get cancer. Most people argue that they are incorrect because when evaluating smoking we can't compare a group of people predisposed to cancer to a normal group of people. Is this correct? And if so, does this mean the 1-boxer is correct in rejecting timeless decision theory counterfactuals as non-comparable? (There have been criticisms of the Smoking Lesion problem, but I think we could just make the same argument with Counterfactual Blackmail instead).

This post was supported by the AI Safety Research Program and was influenced by discussion with Davide Zagami and Pablo Moreno although the opinions expressed here are my own. It is an extension of work performed at the EA Hotel.

# 4

New Comment

My impression is that logical counterfactuals, and counterfactuals, and comparability is - at the moment - too confused, and most disagreements here are "merely verbal" ones. Most of your questions (seem to me to) point in the direction of different people using different definitions. I feel slightly worried about going too deep into discussions along the lines of "Vojta reacts to Chris' claims about what other LW people argue against hypothetical 1-boxing CDT researchers from classical academia that they haven't met" :D.

My take on how to do counterfactuals correctly is that this is not a property of the world, but of your mental models:

Definition (comparability according to Vojta): Two scenarios are comparable (given model and observation sequence ) if they are both possible in and and consistent with .

According to this view, counterfactuals only make sense if your model contains uncertainty...

(Aside on logical counterfactuals: Note that there is difference between the model that I use and the hypothetical models I would be able to infer were I to use all my knowledge. Indeed, I can happily reason about 6th digit of being 7, since I don't know what it is, despite knowing the formula for calculating . I would only get into trouble if I were to do the calculations (and process their implications for the real world). Updating your models with new logical information seems like an important problem, but one I think is independent from counterfactual reasoning.)

...however, there remains the fact humans do counterfactual reasoning all the time, even about impossible things ("What if I decided to not write this comment?", "What if the Sun revolved around the Earth?"). I think this is consistent with the above definition, from three reasons. First, the models that humans use are complicated, fragmented, incomplete, and wrong. So much so that positing logical impossibilities (the Sun going around the Earth thing) doesn't make the model inconsistent (because it is so fragmented and incomplete). Second, when doing counterfactuals, we might take it for granted that you are to replace the actual observation history by some alternative . So you then apply the above definition to and (e.g., me not starting to write this comment). When is compatible with the model we use, everything is logically consistent (in ). For example, it might actually be impossible for me to not have started writing this comment, but it was perfectly consistent with my (wrong) model. Finally, when some counterfactual would be inconsistent with our model, we might take it for granted that we are supposed to relax in some manner. Moreover, people might often implicitly assume same/similar relaxation. For example, suppose I know that the month of May has 31 days. The natural relaxation is to be uncertain about month lengths while still remembering it was something between 28 and 31. I might this say that 30 was a perfectly reasonable length, while being indignant upon being asked to consider May that is 370 days long.

As for the implications for your question: The phrasing of 1) seems to suggest a model that has uncertainty about your decision procedure. Thus picking both 10 and 5 seems possible (and consistent with observation history of seeing the two boxes), and thus comparable. Note that this would seem fishier if you additionally posited that you are a utility maximizer (but, I argue, most people would implicitly relax this assumption if you asked them to consider the 5 counterfactual). Regarding 2) I think that "a typical AF reader" uses a model in which "a typical CDT adherent" can deliberate, come to the one-boxing conclusion, and find 1M in the box, making the options comparable for "typical AF readers". I think that "a typical CDT adherent" uses a model in which "CDT adherents" find the box empty while one-boxers find it full, thus making the options incomparable. The third question I didn't understand.

Disclaimer: I haven't been keeping up to date on discussions regarding these matters, so it might be that what I write has some obvious and known holes in it...

Hey Vojta, thanks so much for your thoughts.

I feel slightly worried about going too deep into discussions along the lines of "Vojta reacts to Chris' claims about what other LW people argue against hypothetical 1-boxing CDT researchers from classical academia that they haven't met" :D.

Fair enough. Especially since this post isn't so much about the way people currently frame their arguments but attempt to persuade people to reframe the discussion around comparability.

My take on how to do counterfactuals correctly is that this is not a property of the world, but of your mental models

I feel similarly. I've explained my reasons for believing this in the Co-operation Game, Counterfactuals are an Answer, not a Question and Counterfactuals as a matter of Social Convention.

According to this view, counterfactuals only make sense if your model contains uncertainty...

I would frame this slightly differently and say that this is the paradigmatic case which forms the basis of our initial definition. I think the example of numbers can be constructive here. The first numbers to be defined are the counting numbers: 1, 2, 3, 4... It is then convenient to add fractions, then zero, then negative numbers and eventually we extend to the complex numbers. In each case we've slightly shifted the definition of what a number is and this choice is solely determined by convention. Of course, convention isn't arbitrary, but determined by what is natural.

Similarly, the cases where there is actual uncertainty provides the initial domain over which we define counterfactuals. And we can then try to extend this as you are doing above. I see this as a very promising approach.

A lot of what you are saying there aligns with my most recent research direction (Counterfactuals as a matter of Social Convention), although it's unfortunately stalled with coronavirus and my focus being mostly on attempting to write up my ideas from the AI safety program. There seem to be a bunch of properties that make a situation more or less likely to be accepted by humans as a valid counterfactual. I think it would be viable to identify the main factors, with the actual weighting being decided by each human. This would acknowledge both the subjective, constructed nature of counterfactuals, but also the objective elements with real implications that doesn't make this a completely arbitrary choice. I would be keen to discuss further/bounce ideas of each other if you'd be up for it.

Finally, when some counterfactual would be inconsistent with our model, we might take it for granted that we are supposed to relax M in some manner

This sounds very similar to the erasure approach I was previously promoting, but have shifted away from. Basically, I when I started thinking about it, I realised that only allowing counterfactuals to be constructed by erasing information didn't match how humans actually use counterfactuals.

Second, when doing counterfactuals, we might take it for granted that you are to replace the actual observation history o by some alternative o′

This is much more relevant to how I think now.

I think that "a typical AF reader" uses a model in which "a typical CDT adherent" can deliberate, come to the one-boxing conclusion, and find 1M in the box, making the options comparable for "typical AF readers". I think that "a typical CDT adherent" uses a model in which "CDT adherents" find the box empty while one-boxers find it full, thus making the options incomparable

I think that's an accurate framing of where they are coming from.

The third question I didn't understand.

What was unclear? I made one typo where I said an EDT agent would smoke when I meant they wouldn't smoke. Is it clearer now?

An evidential decision theorist would smoke in the Smoking Lesion problem so they don't get cancer.

Is this possibly a typo, and should it instead say that EDT would not smoke? (I never seem to remember the details of Smoking Lesion, but this seems inconsistent with the "so they don't get cancer".)

Yeah, sorry, that's a typo, fixed now.