Dweomite

2y3

I agree that figuring out what you "should have" precommitted can be fraught.

One possible response to that problem is to set aside some time to think about hypotheticals and figure out * now* what precommitments you would like to make, instead of waiting for those scenarios to actually happen. So the perspective is "actual you, at this exact moment".

I sometimes suspect you could view MIRI's decision theories as an example of this strategy.

Alice: Hey, Bob, have you seen this "Newcomb's problem" thing?

Bob: Fascinating. As we both have unshakable faith in CDT, we can easily agree that two-boxing is correct if you are surprised by this problem, but that you should precommit to one-boxing if you have the opportunity.

Alice: I was thinking--now that we've realized this, why not precommit to one-boxing right now? You know, just in case. The premise of the problem is that Omega has some sort of access to our actual decision-making algorithm, so in principle we can precommit just by deciding to precommit.

Bob: That seems unobjectionable, but not very useful in expectation; we're very unlikely to encounter this exact scenario. It seems like what we really ought to do is make a precommitment for the whole class of problems of which Newcomb's problem is just one example.

Alice: Hm, that seems tricky to formally define. I'm not sure I can stick to the precommitment unless I understand it rigorously. Maybe if...

--Alice & Bob do a bunch of math, and eventually come up with a decision strategy that looks a lot like MIRI's decision theory, all without ever questioning that CDT is absolutely philosophically correct?--

Possibly it's not that simple; I'm not confident that I appreciate all the nuances of MIRI's reasoning.

2y-2

Suppose you run your twins scenario, and the twins both defect. You visit one of the twins to discuss the outcome.

Consider the statement: "If you had cooperated, your twin would also have cooperated, and you would have received $1M instead of $1K." I think this is formally provable, given the premises.

Now consider the statement: "If you had cooperated, your twin would still have defected, and you would have received $0 instead of $1K." I think this is **also** formally provable, given the premises. Because we have assumed a deterministic AI that we already know will defect given this particular set of inputs! Any statement that begins "if you had cooperated..." is assuming a contradiction, from which literally anything is formally provable.

You say in the post that only the cooperate-cooperate and defect-defect outcomes are on the table, because cooperate-defect is impossible by the scenario's construction. I think that cooperate-cooperate and defect-defect aren't **both** on the table, either. Only one of those outcomes is consistent with the AI program that you already copied. If we can say you don't need to worry about cooperate-defect because it's impossible by construction, then in precisely what sense are cooperate-cooperate and defect-defect both still "possible"?

I feel like most people have a mental model for deterministic systems (billiard balls bouncing off each other, etc.) and a separate mental model for agents. If you can get your audience to invoke both of these models at once, you have probably instantiated in their minds a combined model with some latent contradiction in it. Then, by leading your audience down a specific path of reasoning, you can use that latent contradiction to prove essentially whatever you want.

(To give a simple example, I've often seen people ask variations of "does (some combinatorial game) have a 50/50 win rate if both sides play optimally?" A combinatorial game, played optimally, has only one outcome, which must occur 100% of the time; but non-mathematicians often fail to notice this, and apply their usual model of "agents playing a game" even though the question constrained the "agents" to optimal play.)

I notice this post uses a lot of phrases like "it actually works" and "try it yourself" when talking about the twins example. Unless there's been a recent breakthrough in mind uploading that I haven't heard about, this wording implies empirical confirmation that I'm pretty confident you don't have (and can't get).

If you were forced to express your hypothetical scenarios in computer source code, instead of informal English descriptions, I think it would probably be pretty easy to run some empirical tests and see which strategies actually get better outcomes. But I don't know, and I suspect you don't know, how to "faithfully" represent any of these examples as source code. This leaves me suspicious that perhaps all the interesting results are just confusions, rather than facts about the universe.

Rather than talking about reversibility, can this situation be described just by saying that the probability of certain opportunities is zero? For example, if John and David somehow know in advance that no one will ever offer them pepperoni in exchange for anchovies, then the maximum amount of probability mass that can be shifted from mushrooms to pepperoni by completing their preferences happens to be zero. This doesn't need to be a physical law of anchovies; it could just be a characteristic of their trade partners.

But in this hypothetical, their preferences are effectively no longer strongly incomplete--or at least, their

trade policyis no longer strongly incomplete. Since we've assumed away the edge between pepperoni and anchovies, we can (vacuously) claim that John and David will collectively accept 100% of the (non-existent) trades from anchovies to pepperoni, and it becomes possible to describe their trade policy as being a utility maximizer. (Specifically, we can say anchovies = mushrooms because they won't trade between them, and say pepperoni > mushrooms because they will trade mushrooms for pepperoni. The original problem was that this implies that pepperoni > anchovies, which is false in their preferences, but it is now (vacuously) true in their trade policy if such opportunities have probability zero.)