New Comment
14 comments, sorted by Click to highlight new comments since:

Son-of-CDT would probably make the wrong choices for Newcomblike problems that its parent, CDT, was "born into." This is because it has no incentive to actually make Son-of-CDT make the right choices in any dilemma that it already being confronted with. One example of a Newcomblike problem which we are all born into is multiverse-wide Prisoner's Dilemmas.

Why do you say "probably"? If there exists an agent that doesn't make those wrong choices you're describing, and if the CDT agent is capable of making such an agent, why wouldn't the CDT agent make an agent that makes the right choices?

It could just be that you have a preference for CDT, as you wrote "Son-of-CDT is the agent with the source code such that running that source code is the best way to convert [energy + hardware + actuators + sensors] into utility." This is not true if you consider logical counterfactuals. But if you were only concerned about affecting the future via analyzing causal counterfactuals, then what you wrote would be accurate.

Personally, I think FDT performs better, not simply because I'd want to precommit to being FDT, but instead because I think it is better philosophically to consider logical counterfactuals rather than causal counterfactuals.

You're taking issue with my evaluating the causal consequences of our choice of what program to run in the agent rather than the logical consequences? These should be the same in practice when we make an AGI, since we're not in some weird decision problem at the moment, so far as I can tell. Or if you think I'm missing something, what are the non-causal, logical consequences of building a CDT AGI?

what are the non-causal, logical consequences of building a CDT AGI?

As stated elsewhere in these comments, I think multiverse cooperation is pretty significant and important. And of course, I am also just concerned with normal Newcomblike dilemmas which might occur around the development of AI, when we can actually run its code to predict its behavior. On the other hand, there seems to me to be no upside to running CDT rather than FDT, conditional on us solving all of the problems with FDT.

I said probably because CDT could self modify into an FDT agent (if that was desirable) but it could also modify into some other agent that took different choices than an FDT agent. Consider if CDT was born into a twin prisoner's dilemma and was allowed to self modify. I don't see a reason why it would self modify into FDT and therefore cooperate in that instance.

Side note: I think the term "self-modify" confuses us. We might as well say that agent's don't self-modify; all they can do is cause other agents to come into being and shut themselves off.

The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner's dilemma), but after a short period of time, it won't matter how it behaves. Some better agent will be created and take over from there.

Finally, if you think an FDT agent will perform very well in this world, then you should also expect Son-of-CDT to look a lot like an FDT agent.

We might as well say that agent's don't self-modify; all they can do is cause other agents to come into being and shut themselves off.

I agree this is helpful to imagine.

The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner's dilemma), but after a short period of time, it won't matter how it behaves.

It depends on the scope of the dilemma you are concerned with. As in the case of multiverse cooperation, the CDT agent will never leave it, nor will any of its successors. So, if we built a CDT agent, we could never obtain utility from causally disjoint areas of the multiverse, except by accident. If you hold the view (as I do) that the multiverse contains most of the potential value we could create, this could be a catastrophic loss of utility!

Ah. I agree that this proposal would not optimize causally inaccessible areas of the multiverse, except by accident. I also think that nothing we do optimizes causally inaccessible areas of the multiverse, and we could probably have a long discussion about that, but putting a pin in that,

Let's take things one at a time. First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.

Regarding the longer discussion, and sorry if this below my usual level of clarity: what do we have at our disposal to make counterfactual worlds with low utility inconsistent? Well, all that we humans have at our disposal is choices about actions. One can play with words, and say that we can choose not just what to do, but also who to be, and choosing who to be (i.e. editing our decision procedure) is supposed by some to have logical consequences, but I think that's a mistake. 1) Changing who we are is an action like any other. Actions don't have logical consequences, just causal consequences. 2) We might be changing which algorithm our brain executes, but we are not changing the output of any algorithm itself, the latter possibility being the thing with supposedly far-reaching (logical) consequences on hypothetical worlds outside of causal contact. In general, I'm pretty bearish on the ability of humans to change math.

Consider the CDT person who adopts FDT. They are probably interested in the logical consequences of the fact their brain in this world outputs certain actions. But no mathematical axioms have changed along the way, so no propositions have changed truth value. The fact that their brain now runs a new algorithm implies that (the math behind) physics ended up implementing that new algorithm. I don't see how it implies much else, logically. And I think the fact that no mathematical axioms have changes supports that intuition quite well!

The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing. So if you have ambitions of rendering low-utility world inconsistent, I guess my question is this: which axioms of Math would you like to change and how? I understand you don't hope to causally affect this, but how could you even hope to affect this logically? (I'm struggling to even put words to that; the most charitable phrasing I can come up with, in case you don't like "affect this logically", is "manifest different logic", but I worry that phrasing is Confused.) Also, I'm capitalizing Math there because this whole conversation involves being Platonists about math, where Math is something that really exists, so you can't just invent a new axiomatization of math and say the world is different now.

I also think that nothing we do optimizes causally inaccessible areas of the multiverse

If that's the case, then I assume that you defect in the twin prisoner's dilemma. After all, under your reasoning, your action is independent of your twin because you are causally disjoint. This is true even despite the fact that you are both running identical decision procedures.

Now, if it's the case that you would defect in the twin prisoner's dilemma, I recommend writing up a more general critique of FDT or even EDT, because both recommend cooperating. That would probably better reflect the heart of your view on decision theory.

First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.

I agree that starting with the assumption of avoiding catastrophe is good, but when we could quite literally lose almost all the available value that we could potentially create by switching to CDT, don't you think that's at least worth looking into? On the flip side, I don't yet see why creating a CDT agent avoids catastrophe better than FDT.

The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing.

I agree :). This is the problem with theories of counterpossible reasoning. However, it's not clear that this is more of a problem for FDT than for CDT. After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals. Physics is either random or deterministic (unless we are mistaken about reality), and in neither case are there real mind independent counterfactuals. Whether or not you take an action is just a fact about the environment.

So, there is no greater problem for FDT; it's just a different problem, and perhaps not even very different. Which is not to say that it's not a big issue -- that's why MIRI is working on it.

If that's the case, then I assume that you defect in the twin prisoner's dilemma.

I do. I would rather be someone who didn't. But I don't see path to becoming that person without lobotomizing myself. And it's not a huge concern of mine, since I don't expect to encounter such a dilemma. (Rarely am I the one pointing out that a philosophical thought experiment is unrealistic. It's not usually the point of thought experiments to be realistic--we usually only talk about them to evaluate the consequences of different positions. But it is worth noting here that I don't see this as a major issue for me.) I haven't written this up because I don't think it's particularly urgent to explain to people why I think CDT is correct over FDT. Indeed, in one view, it would be cruel of me to do so! And I don't think it matters much for AI alignment.

Don't you think that's at least looking into?

This was partly why I decided to wade into the weeds, because absent a discussion of how plausible it is that we could affect things non-causally, yes, one's first instinct would be that we should look at least into it. And maybe, like, 0.1% of resources directed toward AI Safety should go toward whether we can change Math, but honestly, even that seems high. Because what we're talking about is changing logical facts. That might be number 1 on my list of intractable problems.

After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals.

This is getting subtle :) and it's hard to make sure our words mean things, but I submit that causal counterfactuals are much less fictitious than logical counterfactuals! I submit that it is less extravagant to claim we can affect this world than it is to claim that we can affect hypothetical worlds with which we are not in causal contact. No matter what action I pick, math stays the same. But it's not the case that no matter what action I pick, the world stays the same. (In the former case, which action I pick could in theory tell us something about what mathematical object the physical universe implements, but it doesn't change math.) In both cases, yes, there is only one action that I do take, but assuming we can reason both about causal and logical counterfactuals, we can still talk sensibly about the causal and logical consequences of picking actions I won't in fact end up picking. I don't have a complete answer to "how should we define causal/logical counterfactuals" but I don't think I need to for the sake of this conversation, as long as we both agree that we can use the terms in more or less the same way, which I think we are successfully doing.

I don't yet see why creating a CDT agent avoids catastrophe better than FDT.

I think running an aligned FDT agent would probably be fine. I'm just arguing that it wouldn't be any better than running a CDT agent (besides for the interim phase before Son-of-CDT has been created). And indeed, I don't think any new decision theories will perform any better than Son-of-CDT, so it doesn't seem to me to be a priority for AGI safety. Finally, the fact that no FDT agent has actually been fully defined certainly weighs in favor of just going with a CDT agent.

If that’s the case, then I assume that you defect in the twin prisoner’s dilemma.

I do. I would rather be someone who didn’t. But I don’t see path to becoming that person without lobotomizing myself.

You could just cooperate, without taking such drastic measures, no?

I jumped off a small cliff into a lake once, and when I was standing on the rock, I couldn't bring myself to jump. I stepped back to let another person go, and then I stepped onto the rock and jumped immediately. I might be able to do something similar.

But I wouldn't be able to endorse such behavior while reflecting on it if I were in that situation, given my conviction that I am unable to change math. Indeed, I don't think it would be wise of me to cooperate in that situation. What I really mean when I say that I would rather be someone who cooperated in a twin prisoners dilemma is "conditioned the (somewhat odd) hypothetical that I will at some point end up in a high stakes twin prisoner's dilemma, I would rather it be the case that I am the sort of person who cooperates", which is really saying that I would rather play a twin prisoner's dilemma game against a cooperator than against a defector, which is just an obvious preference for a favorable event to befall me rather than an unfavorable one. In similar news, conditioned on my encountering a situation in the future where somebody checks to see if am I good person, and if I am, they destroy the world, then I would like to become a bad person. Conditioned on my encountering a situation in which someone saves the world if I am devout, I would like to become a devout person.

If I could turn off the part of my brain that forms the question "but why should I cooperate, when I can't change math?" that would be a path to becoming a reliable cooperator, but I don't see a path to silencing a valid argument in my brain without a lobotomy (short of possibly just cooperating really fast without thinking, and of course without forming the doubt "wait, why am I trying to do this really fast without thinking?").

I think it's worth pointing out that I agree that you can't change math. I don't think I can change math. Yet, I would still cooperate. The whole thing about whether we can literally change math is missing a crux. Thankfully, logical counterfactuals are not construed in such a silly way.

This is similar to the debate over whether free will exists when physics is deterministic. "You can't change the future. It is already fixed..." the poor soul said, before walking off a cliff.