AI ALIGNMENT FORUMAF

Chris_Leong

Applying the Counterfactual Prisoner's Dilemma to Logical Uncertainty

I'm curious, do you find this argument for paying in Logical Counterfactual Mugging persuasive? What about the Counterfactual Prisoner's Dilemma argument for the basic Counterfactual Mugging?

Another approach is to change the example to remove the objection

Interesting point about the poker game version. It's still a one shot game, so there's no real reason to hide a 0 unless you think they're a pretty powerful predictor, but it is always predicting something coherent.

I don't see how you're applying CPD to LU

The claim is that you should pay in the Logical Counterfactual Prisoner's Dilemma and hence pay in Logical Counterfactual Mugging which is the logically uncertain version of Counterfactual Mugging.

Symmetric? The original is already symmetric. But "symmetric" is a concept which applies to multi-player games. Counterfactual PD makes PD into a one-player game. Presumably you meant "a one-player version"?

Edited now. I meant it's a symmetric version of counterfactual mugging. So not in the game theory sense, but just that there is now no difference between heads and tails.

This is only true if you use classical CDT, yeah? Whereas EDT can get $9900 in both cases, provided it believes in a sufficient correlation between what it does upon seeing heads vs tails. Point noted. Maybe I should have been more careful about specifying what I was comparing Also, why is this posted as a question? Accident. It's fixed now Decision Theory is multifaceted To be honest, this thread has gone on long enough that I think we should end it here. It seems to me that you are quite confused about this whole issue, though I guess from your perspective it seems like I am the one who is confused. I considered asking a third person to try looking at this thread, but I decided it wasn't worth calling in a favour. I made a slight edit to my description of Counterfactual Prisoner's Dilemma, but I don't think this will really help you understand: Omega, a perfect predictor, flips a coin and tell you how it came up. If if comes up heads, Omega asks you for$100, then pays you $10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for$100, then pays you $10,000 if it predicts you would have paid if it had come up heads. In this case it was heads. Decision Theory is multifaceted I am considering the second intuiton. Acting according to it results in you receiving$0 in Counterfactual Prisoner's Dilemma, instead of losing $100. This is because if you act updatefully when it comes up heads, you have to also act updatefully when it comes up tails. If this still doesn't make sense, I'd encourage you to reread the post. Decision Theory is multifaceted If you pre-commit to that strategy (heads don't post, tails pay) it provides 10000, but it only works half the time. If you decide that after you see the coin, not to pay in that case, then this will lead to the strategy (not pay, not pay) which provides 0. Decision Theory is multifaceted "After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition)" - No you can't. The only way to get 10,000 is to pay if the coin comes up the opposite way it comes up. And that's only a 50/50 chance. Decision Theory is multifaceted Well, you can only predict conditional on what you write, you can't predict unconditionally. However, once you've fixed what you'll write in order to make a prediction, you can't then change what you'll write in response to that prediction. Actually, it isn't about utility in expectation. If you are the kind of person who pays you gain$9900, if you aren't you gain \$100. This is guaranteed utility, not expected utility.

Decision Theory is multifaceted

Hey Michael, I agree that it is important to look very closely at problems like Counterfactual Mugging and not accept solutions that involve handwaving.

Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite.

I was quite skeptical of paying in Counterfactual Mugging until I discovered the Counterfactual Prisoner's Dilemma which addresses the problem of why you should care about counterfactuals given that they aren't factual by definition.

Ideally you'd start doing something like UDT from the beginning of time, but humans don't know UDT when they are born, you'd have to adjust it to take this into account by treating these initial decisions as independent of your UDT policy.

A few thoughts:

• Even if we could theoretically double output for a product, it doesn't mean that there will be sufficient demand for it to be doubled. This potential depends on how much of the population already has thing X
• Even if we could effectively double our workforce, if we are mostly replacing low-value jobs, then our economy wouldn't double
• Even if we could say halve the cost of producing robot workers, that might simply result in extra profits for a company instead of increasing the size of the economy
• Even if we have a technology that could double global output, it doesn't mean that we could or would deploy it in that time, especially given that companies are likely to be somewhat risk adverse and not scale up as fast as possible as they might be worried about demand. This is the weakest of the four arguments in my opinion, which is why it is last.

So economic progress may not accurately represent technological progress, meaning that if we use this framing we may get caught up in a bunch of economic debates instead of debates about capacity.

What makes counterfactuals comparable?

Yeah, sorry, that's a typo, fixed now.

What makes counterfactuals comparable?

Hey Vojta, thanks so much for your thoughts.

I feel slightly worried about going too deep into discussions along the lines of "Vojta reacts to Chris' claims about what other LW people argue against hypothetical 1-boxing CDT researchers from classical academia that they haven't met" :D.

Fair enough. Especially since this post isn't so much about the way people currently frame their arguments but attempt to persuade people to reframe the discussion around comparability.

My take on how to do counterfactuals correctly is that this is not a property of the world, but of your mental models

I feel similarly. I've explained my reasons for believing this in the Co-operation Game, Counterfactuals are an Answer, not a Question and Counterfactuals as a matter of Social Convention.

According to this view, counterfactuals only make sense if your model contains uncertainty...

I would frame this slightly differently and say that this is the paradigmatic case which forms the basis of our initial definition. I think the example of numbers can be constructive here. The first numbers to be defined are the counting numbers: 1, 2, 3, 4... It is then convenient to add fractions, then zero, then negative numbers and eventually we extend to the complex numbers. In each case we've slightly shifted the definition of what a number is and this choice is solely determined by convention. Of course, convention isn't arbitrary, but determined by what is natural.

Similarly, the cases where there is actual uncertainty provides the initial domain over which we define counterfactuals. And we can then try to extend this as you are doing above. I see this as a very promising approach.

A lot of what you are saying there aligns with my most recent research direction (Counterfactuals as a matter of Social Convention), although it's unfortunately stalled with coronavirus and my focus being mostly on attempting to write up my ideas from the AI safety program. There seem to be a bunch of properties that make a situation more or less likely to be accepted by humans as a valid counterfactual. I think it would be viable to identify the main factors, with the actual weighting being decided by each human. This would acknowledge both the subjective, constructed nature of counterfactuals, but also the objective elements with real implications that doesn't make this a completely arbitrary choice. I would be keen to discuss further/bounce ideas of each other if you'd be up for it.

Finally, when some counterfactual would be inconsistent with our model, we might take it for granted that we are supposed to relax M in some manner

This sounds very similar to the erasure approach I was previously promoting, but have shifted away from. Basically, I when I started thinking about it, I realised that only allowing counterfactuals to be constructed by erasing information didn't match how humans actually use counterfactuals.

Second, when doing counterfactuals, we might take it for granted that you are to replace the actual observation history o by some alternative o′

This is much more relevant to how I think now.

I think that "a typical AF reader" uses a model in which "a typical CDT adherent" can deliberate, come to the one-boxing conclusion, and find 1M in the box, making the options comparable for "typical AF readers". I think that "a typical CDT adherent" uses a model in which "CDT adherents" find the box empty while one-boxers find it full, thus making the options incomparable

I think that's an accurate framing of where they are coming from.

The third question I didn't understand.

What was unclear? I made one typo where I said an EDT agent would smoke when I meant they wouldn't smoke. Is it clearer now?