AI ALIGNMENT FORUM
AF

The issue with you-in-all-detail vs. your-decision-algorithm is that a decision algorithm can have different levels of updatelessness, it's unclear what the decision algorithm already knows vs. what a policy it chooses takes as input. So we pick some intermediate level that is updateless enough to allow acausal coordination among relevant entities (agents/predictors), and updateful enough to make a decision without running out of time/memory while being implemented in its instances. But that level/scope is different for different collections of entities being coordinated.

So I think a boundary shouldn't be drawn around "a decision algorithm", but around whatever common knowledge of each other the entities being acausally coordinated happen to have (where they don't need to have common knowledge of everything). When packaged as a decision algorithm, the common knowledge becomes an adjudicator, which these entities can allow influence over their actions. To the extent the influence they allow an adjudicator is common knowledge among them, it also becomes knowledge of the adjudicator, available for its decision making reasoning.

Importantly for the reframing, an adjudicator is not a decision algorithm belonging to either agent individually, it's instead a shared decision algorithm. It's a single decision algorithm purposefully built out of the agents' common knowledge of each other, rather than a collection of their decision algorithms that luckily happen to have common knowledge of each other. It's much easier for there to be some common knowledge than for there to be common knowledge of individually predefined decision algorithms that each agent follows.

Moderation Log

Who am I?

I think a bunch of your sense of oddness about the "magic" that "you can write on whiteboards light-years away" is stemming from a faulty framing you have. In particular, the part where the word "you" points to a single physical instantiation of your algorithm in the universe. I'd say: insofar as your algorithm is multiply instantiated throughout the universe, there is no additional fact about which one is really you.

For analogy, consider tossing a coin in a quantum-mechanical universe, and covering it with your hand. The coin is superpositioned between heads and tails, and once you look at it, you'll decohere into Joe-who-saw-heads and Joe-who-saw-tails, both of whom stem from Joe-who-hasn't-looked-yet. So, before you look, are you Joe-who-saw-heads or Joe-who-saw-tails?

Wrong question! These two entities have not yet diverged; the pasts of those two separate entities coincide. The word "you", at the time before you split, refers to ~one configuration. The time-evolution splits the amplitude on that configuration between ~two distinct future configurations, and once they've split (by making different observations), each will be able to say "me" in a way that refers to them and not the other, but before the split there is no distinction to be made, no extra physical fact, and no real question as to whether pre-split Joe "is" Joe-who-will-see-heads versus Joe-who-will-see-tails.

(It's also maybe informative to imagine what happens if the quantum coin is biased. I'd say, even when the coin is 99.99999% biased towards heads, it's still the case that there isn't a real question about whether Joe-who-has-not-looked-at-the-coin is Joe-who-will-see-heads versus Joe-will-see-tails. There is a question of to what degree Joe-who-has-not-looked becomes Joe-who-saw-heads versus Joe-who-saw-tails, but that's a different sort of question.)

One of my most-confident guesses about anthropics is that being multiply-instantiated in other ways is analogous. For instance, if there are two identical physical copies of you (in physical rooms that are identical enough that you're going to make the same observations for the length of the hypothetical, etc.), then my guess is that there isn't a real question about which one is you. They are both you. You are the pattern, not the meat.

This person may become multiple people in the future, insofar as they see different things in different places-that-embed-them. But before the differing observations come in, they're both you. You can tell because the situation is symmetric: once you know all the physical facts, there's no additional bit telling you which one is "you".

From this perspective, the "magic" is much less mysterious: whenever you are multiply-instantiated, your actions are also multiply-instantiated. If you're multiply-instantiated in two places separated by a 10-light-year gap, then when you act, the two meat-bodies move in the same way on each side of the gap. This is all much less surprising once you acknowledge that "you" refers to everything that instantiates you(-who-have-seen-what-you-have-seen). Which, notably, is a viewpoint more-or-less forced upon us by quantum mechanics anyway.

Also, a subtlety: literal multiple-instantiation of your entire mind (in a place with sufficiently similar physics) is what you need to get "You can draw a demon kitten eating a windmill. You can scream, and dance, and wave your arms around, however you damn well please. Feel the wind on your face, cowboy: this is liberty. And yet, he will do the same." But it's much easier to find other creatures that make the same choice in a limited decision problem, but that won't draw the same demon kitten.

In particular, the thing you need for rational cooperation in a one-shot prisoner's dilemma, is multiple instantiation of your decision algorithm, which is notably smaller than your entire mind. Imagining multiple-instantiation of your entire mind is a fine intuition-pump, but the sort of multiple-instantiation humans find in real life is just of the decision-making fragment (which is enough).

Corollary: To a first approximation, the answer to "Can you control the past?" is "Well, you can be multiply instantiated at different points in time, and control the regions afterwards of the places you’re instantiated, and it’s possible for some of those to be beforewards of other places you’re instantiated. But you can’t control anything beforewards of your earliest instantiation."

To a second approximation, the above is true not only of you (in all your detailed glory, having learned everything you've learned and seen everything you've seen), but of your decision algorithm — a much smaller fragment of you, that is instantiated much more often, and thus can readily affect regions beforewards of the earliest instantiation of you-in-all-your-glory. This is what’s going on in the version of Newcomb’s problem, for instance, where Omega doesn’t simulate you in all your glory, but does reason accurately about the result of your decision algorithm (thereby instantiating it in the relevant sense).

More generally, I think it's worth distinguishing you from your decision algorithm. You can let your full self bleed into your decision-making fragment, by feeling the wind on your face and using specifics of your recent train-of-thought to determine what you draw. Or you can prevent your full self from bleeding into your decision-making fragment, by boiling the problem before you down into a simple and abstract decision problem.

Consider Omega's little sister Omicron, who can't figure out what you'll draw, but has no problem figuring out whether you'll one-box. You-who-have-felt-the-wind-on-your-face are not instantiated in the past, but your decision algorithm on a simple problem could well be. It's the latter that controls things that are beforewards of you (but afterwards of Omicron).

I personally don't think I (Nate-in-all-his-glory) can personally control the past. I think that my decision-procedure can control the future laid out before each and every one of its instantiations.

Is the box in Newcomb's problem full because I one-box? Well, it's full because The Algorithm one-boxes, and I'm a full-ass person wrapped around The Algorithm, but I'm not the instance of The Algorithm that Omicron was looking at, so it seems a bit weird to blame it on me. Like how when you use a calculator to check whether 7 divides 1331 and use that knowledge to decide how to make a bet, and then later I use a different calculator to see whether 1331 is prime in a way that includes (as an intermediate step) checking whether 7 divides it, it's a bit weird to say that my longer calculation was the cause of your bet.

I'm a longer calculation than The Algorithm. It wasn't me who controlled the past, it was The Algorithm Omega looked at, and that I follow.

If you ever manage to get two copies of me (the cowboy who feels the wind on his face) at different times, then in that case I'll say that I (who am both copies) control the earlier-copy's future and the later-copy's past (necessarily in ways that the later copy has not yet observed, for otherwise we are not true copies). Till then, it is merely the past instances of my decision algorithm that control my past, not me.

(Which doesn't mean that I can choose something other than what my decision algorithm selects in any given case, thereby throwing off the yoke; that's crazytalk; if you think you can throw off the yoke of your own decision algorithm then you've failed to correctly identify the fragment of you that makes decisions.)

Joe Carlsmith:

You-who-have-felt-the-wind-on-your-face are not instantiated in the past, but your decision algorithm on a simple problem could well be. It's the latter that controls things that are beforewards of you (but afterwards of Omicron).

I currently expect this part to continue to feel kind of magical to me, due to my identification with the full-on self. E.g., if my decision algorithm is instantiated 10 lightyears away in a squid-person, it will feel like "I" can control "something else" very far away.

Nate Soares: If you were facing me in a game that turns out (after some simple arithmetic) to be isomorphic to a stag hunt, would you feel like you can control my action, despite me being on the other side of the room?

(What I'd say is that we both notice that the game is a stag hunt, and then do the same utility calculation + a bit of reasoning about the other player, and come to the same conclusion, and those calculations control both our actions, but neither of us controls the other player.)

(You can tell this in part from how our actions would not be synchronized in any choice that turns on a bunch of the extra details of Joe that Nate lacks. Like, if we both need to draw a picture that would make a child laugh, and we get an extra bonus from the pictures having identical content, then we might aim for Schelling drawings, but it's not going to work, because it was the simple stag-hunt calculation that was controlling both our actions, rather than all-that-is-Joe.)

(This is part of why I'd say, if your decision algorithm is instantiated 10 light-years away in a squid person, then you don't control them; rather, your shared decision algorithm governs the both of you. The only cases where you (in all your detailed glory) control multiple distant things are cases where exact copies of your brain occur multiple times, in which case it's not that one of you can control things 10ly away, it's that the term 'you' refers to multiple locations simultaneously)

(Of course, this could just ground out into a question of how we define 'you'. In which case I'd be happy to fall back to first (a) claiming that there's a concept ‘you' for which the above makes sense, and then separately (b) arguing that this is the correct way to rescue the English word "you" in light of multiple instantiation.)

Joe Carlsmith: Cool, the stag-hunt example is useful for giving me a sense of where you’re coming from. I can still imagine the sense that “if I hunt hare, the suitably-good-predictor of me will probably hunt hare too; and if I hunt stag, they will probably hunt stag too” giving me a sense of control over what they do, but it feels like we’ll quickly just run into debates about the best way to talk; your way seems coherent, and I’m not super attached to which is preferable from a “rescue” perspective.

Nate Soares: My reply: if a predictor is looking at you and copying your answer, then yes, you control them. But it's worth distinguishing between predictors that look at the-simple-shard-of-you-that-utility-maximizes-in-simple-games and you-in-all-your-detailed-glory. Like, in real life, it's much more common to find a predictor that can tell you'll go for a stag, than a predictor that can predict which drawing you'll make. And saying that 'you' control the former has some misleading implications, that are clarified away by specifying that the simple rules of decisionmaking are embedded in you and are all that the predictor needs to look at (in the former case) to get the right answer.

(We may already agree on these points, but also you might appreciate hearing my phrasing of the obvious reply, so \shrug)

Joe Carlsmith:

Well, it's full because The Algorithm one-boxes, and I'm a full-ass person wrapped around The Algorithm, but I'm not the instance of The Algorithm that Omicron was looking at, so it seems a bit weird to blame it on me.

Do you not control the output of the algorithm?

Nate Soares: In case it's not clear by this point, my reply is "the algorithm controls the output of me". Like, try as I might, I cannot make LDT 2-box on Newcomb's problem — I can't make 2-boxing be higher-utility, and I can't make LDT be anything other than utility-maximizing. I happen to make my choices according to LDT, in a way that is reflectively stable on account of all the delicious delicious utility I get that way.

From this point of view, the point where I'd start saying that it is "me" choosing something (rather than my simpler decision-making core) is when the decision draws on a bunch of extra personal details about Nate-in-particular.

There is of course another point of view, which says "the output of Joe in (say) Newcomb's problem is determined by Joe". This viewpoint is sometimes useful to give to people who are reflecting on themselves and struggling to decide between (say) CDT and LDT.

It's perhaps useful to note that these people tend to have complicated, messy, heuristical decision-procedures, that they're currently in the process of reflecting upon, in ways that are sensitive to various details of their personality and arguments they just heard. Which is to say, someone who's waffling on Newcomb's problem does have much more of their full self engaged in the choice than (say) I do. Their decision procedure is much more unique to them; it involves much more of their true name; all-that-is-them is much more of an input to it.

At that point, "their decision algorithm" and "them" are much closer to synonymous, and I won't quibble much if we say "their algorithm is what determines them" or "they are what determines the output of their algorithm". But in my case, having already passed through the reflective gauntlet, it's much clearer that the algorithm guides me, than that the parts of me wrapped around the algorithm guide it.

(Of course, the algorithm is also part of me, as it is part of many, and so it is still true that some part of me controls the output of The Algorithm. Namely, The Algorithm controls the output of The Algorithm.)

LDT doesn’t pass up guaranteed payoffs

Logical decision theorists firmly deny that they pass up guaranteed payoffs. (I can't quite tell from a skim whether you understand this; apologies if I missed the parts where you acknowledge this.)

As you probably know, in a twin PD problem, a CDT agent might protest that by cooperating you pass up a guaranteed payoff, because (they say) defecting is a dominant strategy. A logical decision theorist counters that the CDT agent has made an error, by imagining that "I defect while my twin cooperates" is a possibility, when in fact it is not.

In particular, when the CDT agent closes their eyes and imagines defecting, they (wrongly) imagine that the action of their twin remains fixed. Among the actual possibilities (cooperate, cooperate) and (defect, defect), the former clearly dominates. The disagreement is not about whether to take dominated strategies, but about what possibilities to admit in the matrix from which we calculate what is dominated and what is not.

Now consider Parfit's hitchhiker. An LDT agent withdraws the $10k and gives it to the selfish man. Will MacAskill objects, "you're passing up a guaranteed payoff of $10k, now that you're certain you're in the city!". The LDT agent says "you have made an error, by imagining ‘I fail to pay while being in the city’ is a possibility, when in fact it is not. In particular, when you close your eyes and imagine not paying, you (wrongly) imagine that your location remains fixed, and wind up imagining an impossibility."

Objecting “it's crazy to imagine your location changing if you fail to pay” is a fair criticism. Objecting that logical decision theorists pass up guaranteed payoffs is not.

The whole question at hand is how to evaluate the counterfactuals. Causal decision theorists say "according to my counterfactuals, if you pay you lose $10k, thus passing up a guaranteed payoff", whereas logical decision theorists say "your counterfactuals are broken, if I don't pay then I die, life is worth more than $10k to me, I am taking the action with the highest payoff". You're welcome to argue that logical decision theorists calculate their counterfactuals wrong, if you think that, but saying we pass up guaranteed payoffs is either confused or disingenuous.

Joe Carlsmith:

(I can't quite tell from a skim whether you understand this; apologies if I missed the parts where you acknowledge this.)

I think I could’ve been clearer about it in the piece, and in my own head. Your comments here were useful on that front.

Joe Carlsmith:

Objecting “it’s crazy to imagine your location changing if you fail to pay” is a fair criticism.

Yeah I suppose this is where my inner “guaranteed payoffs” objector would go next. Could imagine thinking: “well, that just seems flat out metaphysically wrong, and in this sense worse than violating guaranteed payoffs, because just saying false stuff about what happens if you do X is worse than saying weird stuff about what’s ‘rational.’”

Nate Soares: I agree "you're flat-out metaphysically wrong (in a way that seems even worse than violating guaranteed payoffs)" is a valid counterargument to my actual position (in a way that "you violate guaranteed payoffs" is not). :-)

Parfit’s hitchhiker and contradicting the problem statement

There's a cute theorem I've proven (or, well, I've jotted down what looks to me like a proof somewhere, but haven't machine-checked it or anything), which says that if you want to disagree with logical decision theorists, then you have to disagree in cases where the predictor is literally perfect. The idea is that we can break any decision problem down by cases (like "insofar as the predictor is accurate, ..." and "insofar as the predictor is inaccurate, ...") and that all the competing decision theories (CDT, EDT, LDT) agree about how to aggregate cases. So if you want to disagree, you have to disagree in one of the separated cases. (And, spoilers, it's not going to be the case where the predictor is on the fritz.)

I see this theorem as the counter to the decidedly human response "but in real life, predictors are never perfect". "OK!", I respond, "But decomposing a decision problem by cases is always valid, so what do you suggest we do under the assumption that the predictor is accurate?"

Even if perfect predictors don't exist in real life, your behavior in the more complicated probabilistic setting should be assembled out of a mixture of ways you'd behave in simpler cases. Or, at least, so all the standard leading decision theories prescribe. So, pray tell, what do you do insofar as the predictor reasoned accurately?

I think this is a good intuition pump for the thing where logical decision theorists are like "if I imagine stiffing the driver, then I imagine dying in the desert." Insofar as the predictor is accurate, imagining being in the city after stiffing the driver is just as bonkers as imagining defecting while your twin cooperates.

One way I like to think about it is, this decision problem is set up in a fashion that purports to reveal the agent's choice to them before they make it. What, then, happens in the case where the agent acts inconsistently with this revelation? The scenario is ill-defined.

Like, consider the decision problem "You may have either a cookie or a bonk on the head, and you're going to choose the bonk on the head. Which do you choose?" The cookie might seem more appealing than the bonk, but observe that taking the cookie refutes the problem statement. It's at least a little weird to confidently assert that, in that case, you get a cookie. What you really get is a contradiction. And sure, ex falso quodlibet, but it seems a bit strange to anchor on the cookie.

It's not the fault of the agent that this problem statement is refutable by some act of the agent! The problem is ill-defined without someone telling us what actually happens if we refute the problem statement. If you try to take the cookie, you don’t actually wind up with a cookie; you yeet yourself clean out of the hypothetical. To figure out whether to take the cookie, you need to know where you'd land.

Parfit's hitchhiker, at the point where you're standing at the ATM, is much like this. The alleged problem statement is "you may either lose $0 or $10,000, and you're going to choose to lose $10,000". At which point we're like "Hold on a sec, the problem statement makes an assertion about my choice, which I can refute. What happens if I refute the problem statement?" At which point the question-poser is like "haha oops, yeah, if you refute the problem statement then you die alone in the desert". At which point, yeah, when the logical decision theorist closes their eyes and imagines stiffing the driver, then (under the assumption that the driver is accurate) they're like "oh dang, this would refute my observations; what happens in that case again? right, I'd die alone in the desert, which is worse than losing $10,000", and then they pay.

(I also note that this counterfactual they visualize is correct. Insofar as the predictor is accurate, if they wouldn't pay, then they would die alone in the desert instead. That is, in real life, what happens to non-payers who face accurate predictors. The "$0" was a red herring; that case is contradictory and cannot actually be attained.)

(In the problem where you may have either a cookie or a bonk, and you're going to take the bonk, but if you render the problem inconsistent then you get two cookies, by all means, take the cookie. But in the problem where you may have either a cookie or a bonk, and you're going to take the bonk, but if you render the problem inconsistent then you die alone in the desert, then take the dang bonk.)

This sort of thing definitely runs counter to some human intuitions — presumably because, in real life, we rarely observe consequences of actions we haven't made yet.

(Well, except for in a variety of social settings, where we have patches such as "honor" and "reputation" that, notably, give the correct answer in this case, but I digress.)

This is where I think my cute theorem makes it easier to see what's going on: insofar as the predictor is perfect, it doesn't make sense to visualize being in the city after stiffing the driver. When you're standing in front of the ATM, and you screw your eyes shut and imagine what happens if you just run off instead of withdrawing the money, then in the case where the predictor reasoned correctly, your visualizer should be like ERROR ERROR HOW DID WE GET TO THE CITY?, and then fall back to visualizing you dying alone in the desert.

Is it weird that your counterfactual-visualizer paints pictures of you being in the desert, even though you remember being driven to the city? Yep. But it's not the agent's fault that they were shown a consequence of their choice before making their choice; they're not the one who put the potential for contradiction into the decision problem. Avoiding contradiction isn’t their problem. One of their available choices is contradictory with observation (at least under the assumption that the predictor is accurate), and they need to handle the contradiction somehow, and the problem says right there on the tin that if you would cause a contradiction then you die alone in the desert instead.

(Humans, of course, implement the correct decision in this case via a sense of honor or suchlike. Which is astute! "I will pay, because I said I would pay and I am a man of my word" can be seen as a shadow of the correct line of reasoning, cast onto monkey brains that were otherwise ill-suited for it. I endorse the practice of recruiting your intuitions about honor to perform correct counterfactual reasoning.)

(And these counterfactuals are true, to be clear. You can't go find people who were accurately predicted, driven to the city, and then stiffed the driver. There are none to be found.)

Do you see how useful this cute little theorem is? I love it. Instead of worrying about "but what if the driver was simply a fool, and I can save $10k?", we get to decompose the decision problem down into cases, one where the driver was incorrect, and one where they were correct. We all agree that insofar as they're incorrect you have to stiff them, and we all agree about how to aggregate cases, so the remaining question is what you do insofar as they're accurate. And insofar as they're accurate, the contradiction is laid bare. And the "stand in front of the ATM, but visualize yourself dying in the desert" thing feels quite justified, at least to me, as a response to a full-on contradiction.

Just remember that it's not your job to render the universe consistent, and that contradictions can't actually happen. Insofar as the predictor is accurate, imagining yourself surviving and then stiffing the driver makes just as much sense as imagining yourself defecting against your cooperating clone.

Joe Carlsmith:

"You may have either a cookie or a bonk on the head, and you're going to choose the bonk on the head. Which do you choose?"

I think this is a useful way of illustrating some of the puzzles that come up with transparent-Newcomb-like cases.

Joe Carlsmith:

we get to break the decision problem down into cases, one where the driver was incorrect, and one where they were correct

Do you have something like "reliable" in mind, here, rather than "correct"? E.g., presumably you don't care if he's correct, but he flipped a coin to determine his prediction. It seems like what matters is whether his prediction was sensitive to your choice or not — a modal thing.

Nate Soares: Yeah, that's actually my preferred way to think about it. That adds some extra subtleties that turn out to make no difference, though, so skipped over it for the sake of exposition.

(Like, an easy way to do it is to say "I think there's a 95% chance they reason correctly about me, and a 5% chance they make at least one reasoning error, and in the latter case it's equally likely (in a manner uncorrelated with my action) that the error pushes them to an invalid true conclusion as an invalid false conclusion, and so we can model this as one case where they're correct, and one case where they toss a coin and guess accordingly". And this turns out to be equivalent to assuming that they're 97.5% right and 2.5% wrong, which is why it makes no difference. But this still doesn't match real life, because in real life they're using fallible stuff like intuition and plausible-seeming deductive leaps, but whatever, I claim it still basically comes down to "were they taking the relevant considerations about me into account, and reasoning validly to their conclusion, or not?" \shrug)

Joe Carlsmith: Cool, would like to think about this more (I do feel like being X% percent accurate won't always be relevantly equivalent to being Y% infallible and Z% something else), but breaking things down into cases like this seems useful regardless. In particular, seems like the "can't I just control whether he's accurate" response discussed below should apply in the Y%-infallible-Z%-something-else case.

Nate Soares: (I agree it won't always be relevantly equivalent. It happens to be equivalent in this case, and in most other simple decision problems where you care only about whether (and not why) the predictor got the answer right. Which is not supposed to be terribly obvious, and I'll consider myself to have learned a lesson about using expositional simplifications where the fact that it is a simplification is not trivial. :-p)

Joe Carlsmith:

We all agree that insofar as they're incorrect you have to stiff them, and we all agree about how to aggregate cases, so the remaining question is what you do insofar as they're accurate. And insofar as they're accurate, the contradiction is laid bare. And the "stand in front of the ATM, but visualize yourself dying in the desert" thing feels quite justified, at least to me, as a response to a full-on contradiction.

Rephrasing to make sure I understand (using the "reliable/sensitive" interpretation I flagged above): “You stand in front of the ATM. Thus, he’s predicted that you pay. Now, either it’s the case that, if it weren’t the case that you pay, you’d be in the desert dead; or it’s the case that, if it weren’t the case that you pay, you’d still be at the ATM. In the former case, not paying is a contradiction. In the latter case, you should not pay.”

I wonder if the one-boxer could accept this but say: “OK, but given that I’m standing in front of the ATM, if I don’t pay, then I’m in the case where I should not pay, so it’s fine to not pay, so I won’t." E.g., by not paying in the city, you can "make it not the case" that if you don't pay, you die in the desert five hours ago — after all, you're alive in the city now.

Nate Soares:

Rephrasing to make sure I understand [...]

That's right!

I wonder if the one-boxer could accept this but say [...]

There are decision theories that have this behavior! (With some caveats.) Note that this corresponds to an agent that 1-boxes in Newcomb's problem, but 2-boxes in the transparent Newcomb's problem. I don't know of anyone who seriously advocates for that theory, but it's a self-consistent middle-ground.

One caveat is that this isn't reflectively consistent (e.g., such agents expect to die in the desert in any future Parfit's hitchhiker, and would pay in advance to self-modify into something that pays the driver if the driver makes their prediction after the moment of modification). Another caveat is that such agents are easily exploitable by blackmail.

I also suspect that this decision theory violates the principle where you can break down a decision problem by cases? But i'm not sure. You can almost surely get them to pay you to not reveal information. You can maybe money pump them, though I haven't tried.

But those aren't quite my true objection to this sort of thinking. And indeed, the error in this line of thinking ("if I stiff the driver, then I must thereby render them inaccurate, because I've already seen the ATM") is precisely what my lemma about problem decomposition is intended to ward off.

Like, one thing that's wrong with this sort of thinking is that it's hallucinating that the driver's accuracy is under your (decision algorithm's) control. It isn't (and I suspect that the mistake can be money-pumped).

Another thing that's wrong with it is that it's comparing counterfactuals with different degrees of consistency.

Like, consider the problem "you can choose a cookie or a bonk on the head; also, I tossed a coin that comes up 'bonk' 99.9999% of the time and 'cookie' 0.00001% of the time, and your choice matches the coin."

Now, choosing 'cookie' only has a 99.9999% chance of being inconsistent with the problem statement, but this doesn't put the two choices on equal footing. Like, yes, now you can only probabilistically render this problem-statement false, but it's still pretty weird that you can probably render this problem-statement false! And the fact that I mixed in a little uncertainty, doesn't mean that you can now make your choice without knowing what happens if you render the problem statement false! The fact that we mixed in a little uncertainty doesn't justify comparing a bonk directly to a cookie; the problem statement is still incomplete; you still need to know what would actually happen insofar as your action contradicts the allegation that it matches the biased coin.

And, like, there's an intuition that it would be pretty weird, given that problem-statement, to imagine that your choice controls the coin. The coin isn't about you; it's not about your algorithm; there's nothing linking your action to the coin. The weird thing about this problem-statement is the bizarre assertion that your action is known to match the coin. Like… whichever way the coin came up, what if you did the opposite of that?

This is an intuition behind the idea that we should be able to case on the value of the coin and consider each of the cases independently. Like, no matter what the value of the coin is, one of our actions reveals the problem statement to be bogus. And someone needs to tell us what happens if we render the whole problem-statement bogus. And so even when there's uncertainty, we need to know the consequences of refuting the problem statement in order to choose our action.

Joe Carlsmith:

Note that this corresponds to an agent that 1-boxes in Newcomb's problem, but 2-boxes in the transparent Newcomb's problem. I don't know of anyone who seriously advocates for that theory, but it's a self-consistent middle-ground."

EDT 1-boxes in Newcomb's, but 2-boxes in transparent Newcomb's, no?

You can almost surely get them to pay you to not reveal information.

Agree, I feel like avoiding this is one of the key points of being "updateless." E.g., because you're able to act as you would've committed to acting prior to learning the information, it's fine to learn it. Also agree re: exploitable via blackmail (e.g. EDT's XOR blackmail problems).

one thing that's wrong with this sort of thinking is that it's hallucinating that the driver's accuracy is under your (decision algorithm's) control. It isn't (and I suspect that the mistake can be money-pumped).

Flagging that I still feel confused about this, and it feels like it rhymes a bit with stuff about ‘can you control the base rate of lesions’ in smoking lesion that I discuss in the post. (I expect you want to say no, and that this is connected to why you want to smoke in smoking lesion — but in cases where your smoking is genuinely evidence that you’ve got the lesion, I’m not sure this is the right verdict.) I'm wondering if there's something generally weird going on in terms "having a problem-set-up" that can be violated or not.

the fact that I mixed in a little uncertainty, doesn't mean that you can now make your choice without knowing what happens if you render the problem statement false!

Cool, this helps give me a sense of where you're coming from. In particular, even if the predictor isn't always accurate, sounds like you want to interpret “I’m in the city and successfully don’t pay” as having some probability of rendering the problem-statement false, as opposed to being certain to put you in the worlds where the predictor was wrong.

Nate Soares:

EDT 1-boxes in Newcomb's, but 2-boxes in transparent Newcomb's, no?

You're right, I should have thrown in some extra things that rule out EDT. I think that thing refuses XOR blackmail, 1-boxes in Newcomb's problem, and 2-boxes in transparent Newcomb's? (Though I haven't checked.) Which is the sort of theory that, like, only locals would consider, and I don't know any local who takes it seriously, on account of the exploitability and reflective inconsistency and stuff.

I don't have the smoking lesion problem mentally loaded up (I basically think it's just a confused problem statement), but my cached thought is that I give the One True Rescuing of that problem in the "The Smoking Lesion Problem" section of https://arxiv.org/pdf/1710.05060.pdf :-p. And I agree with the diagnosis that there's generally something weird going on when the problem set-up can be violated.

In particular, even if the predictor isn't always accurate, sounds like you want to interpret “I’m in the city and successfully don’t pay” as having some probability of rendering the problem-statement false, as opposed to being certain to put you in the worlds where the predictor was wrong.

Yep! With the justification being that (a) you obviously need to do this when things are certain, and (b) there shouldn't be some enormous change in your behavior when we replace "certain" with "with probability ". Doubly so on account of how you should be able to reason by cases.

Like, if you buy that shit is weird when you can certainly render the problem statement false, and if you buy that either you should be able to reason by cases or you shouldn't have some giant discontinuity at literal certitude, then you're basically funneled into believing that you have to consider (when at the ATM) that failing to pay could render the whole set-up false, at which point you need some extra rule for how to reason in that case.

Where CDT says "assume you live and don't pay" and LDT says "assume you die in the desert", and both agree that the rest of the choice is determined given how you respond to the literal contradiction in the flatly contradictory case.

At which point it's my turn to assert that CDT is flat-out metaphysically wrong, because it's hallucinating that flat contradictions are relevantly possible.

Finally, a minor note: I think the twin clone prisoner's dilemma is sufficient to kill CDT. But if you want to kill it extra dead, you might be interested in the fact that you can turn CDT into a money pump whenever you have a predictor that's more accurate than chance, using some cleverness and the fact that you can expand CDT's action space by also offering it contracts that pay out in counterfactuals that are less possible than CDT pretends they are.

Joe Carlsmith: Sounds interesting — is this written up anywhere?

Nate Soares: Maybe in the Death in Damascus paper? Regardless, my offhand guess is that the result is due to Ben Levenstein so if it's not in that paper then it might be in some other paper of Ben's.

Joe Carlsmith: Thanks again for this! I do hope you publish — I'd like to be able to cite your comments in future.