# 2

I'll argue here that we should make an aligned AI which is a causal decision theorist.

# Son-of-CDT

Suppose we are writing code for an agent with an action space and an observation space . The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in.

Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account.

# Vendettas against Son-of-CDT?

CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most $10 to agent B, and if agent B accepts, then agent A gets$10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than $9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers$9.99.

Eliezer claims that a "[competent] agent [chooses] to reject offers short of $9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of$9.99 from its offspring").

In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of $9.99 from the "offspring" of a CDT agent, it would have to reject offers short of$9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone.

After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true that competent agents bully Son-of-CDT, then it must be true that competent agents bully all agents whose probability of birth could have been affected by any pre-existing CDT agent.

Perhaps a competent agent chooses to reject offers short of $9.99 from any agents that come into existence after a CDT agent exists if they have a similar utility function to that CDT agent. If so, then we're cooked. CDT humans have existed, so this would imply that we can never create an agent with a human-like utility function that is not bullied by competent agents. Perhaps a competent agent chooses to reject offers short of$9.99 from any agents that it deems, using some messy heuristic, to have been made "on purpose" as a result of some of the actions of a CDT agent, and also from any agents that were made "on purpose" by any of those agents, and so on. (The recursion is necessary for the CDT agent to lack the incentive to make descendants which don't get bullied; that property underlay the claim that competent agents bully Son-of-CDT). If this is indeed what competent agents do to purposeful descendants of causal decision theorists, then if any researchers or engineers contributing to AGI are causal decision theorists, or if they once were, but changed their decision theory purposefully, or if they have any ancestors who were causal decision theorists (and no births along the way from that ancestor were accidents), then no matter what code is run in that AGI, the AGI would get bullied. This is according to the claim that Son-of-CDT gets bullied under a third possible definition of "offspring". I believe there are people attempting to make AGI whose research will end up being relevant to AGI who are CDT, or once were, or had parents who were, etc. So we're cooked in that case too.

But more realistically (and optimistically), I am very skeptical of the claim that competent agents bully everyone in practice.

# Fair Tests

Incidentally, the proposed treatment of Son-of-CDT falls under MIRI's category of an "unfair problem". A decision problem is "fair" if "the outcome depends only on the agent’s behavior in the dilemma at hand" (FDT, Section 9). Disregarding unfair problems is a precondition for progress in decision theory (in the MIRI view of what progress in decision theory entails) since it allows one to ignore objections like "Well, what if there is an agent out there who hates FDT agents? Then you wouldn't want your daughter to be an FDT agent, would you?" I'm skeptical of the relevance of research that treats unfair problems as non-existent, so in my view, this section is ancillary, but maybe some people will find it convincing. In any case, any bullying done to Son-of-CDT by virtue of the existence of a certain kind of agent that took actions which affected its birth certainly qualifies as "unfair".

# Implications

We want to create an agent with some source code such that our utility becomes optimized. Given that choices about the source code of an agent have consequences other than how that code outputs actions, this might not be a causal decision theorist. However, by definition, Son-of-CDT is the agent that meets this description: Son-of-CDT is the agent with the source code such that running that source code is the best way to convert [energy + hardware + actuators + sensors] into utility. How do we make Son-of-CDT? We just run a causal decision theorist, and let it make Son-of-CDT.

# 2

Pingbacks
New Comment

Son-of-CDT would probably make the wrong choices for Newcomblike problems that its parent, CDT, was "born into." This is because it has no incentive to actually make Son-of-CDT make the right choices in any dilemma that it already being confronted with. One example of a Newcomblike problem which we are all born into is multiverse-wide Prisoner's Dilemmas.

Why do you say "probably"? If there exists an agent that doesn't make those wrong choices you're describing, and if the CDT agent is capable of making such an agent, why wouldn't the CDT agent make an agent that makes the right choices?

It could just be that you have a preference for CDT, as you wrote "Son-of-CDT is the agent with the source code such that running that source code is the best way to convert [energy + hardware + actuators + sensors] into utility." This is not true if you consider logical counterfactuals. But if you were only concerned about affecting the future via analyzing causal counterfactuals, then what you wrote would be accurate.

Personally, I think FDT performs better, not simply because I'd want to precommit to being FDT, but instead because I think it is better philosophically to consider logical counterfactuals rather than causal counterfactuals.

You're taking issue with my evaluating the causal consequences of our choice of what program to run in the agent rather than the logical consequences? These should be the same in practice when we make an AGI, since we're not in some weird decision problem at the moment, so far as I can tell. Or if you think I'm missing something, what are the non-causal, logical consequences of building a CDT AGI?

what are the non-causal, logical consequences of building a CDT AGI?

As stated elsewhere in these comments, I think multiverse cooperation is pretty significant and important. And of course, I am also just concerned with normal Newcomblike dilemmas which might occur around the development of AI, when we can actually run its code to predict its behavior. On the other hand, there seems to me to be no upside to running CDT rather than FDT, conditional on us solving all of the problems with FDT.

I said probably because CDT could self modify into an FDT agent (if that was desirable) but it could also modify into some other agent that took different choices than an FDT agent. Consider if CDT was born into a twin prisoner's dilemma and was allowed to self modify. I don't see a reason why it would self modify into FDT and therefore cooperate in that instance.

Side note: I think the term "self-modify" confuses us. We might as well say that agent's don't self-modify; all they can do is cause other agents to come into being and shut themselves off.

The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner's dilemma), but after a short period of time, it won't matter how it behaves. Some better agent will be created and take over from there.

Finally, if you think an FDT agent will perform very well in this world, then you should also expect Son-of-CDT to look a lot like an FDT agent.

We might as well say that agent's don't self-modify; all they can do is cause other agents to come into being and shut themselves off.

I agree this is helpful to imagine.

The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner's dilemma), but after a short period of time, it won't matter how it behaves.

It depends on the scope of the dilemma you are concerned with. As in the case of multiverse cooperation, the CDT agent will never leave it, nor will any of its successors. So, if we built a CDT agent, we could never obtain utility from causally disjoint areas of the multiverse, except by accident. If you hold the view (as I do) that the multiverse contains most of the potential value we could create, this could be a catastrophic loss of utility!

Ah. I agree that this proposal would not optimize causally inaccessible areas of the multiverse, except by accident. I also think that nothing we do optimizes causally inaccessible areas of the multiverse, and we could probably have a long discussion about that, but putting a pin in that,

Let's take things one at a time. First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.

Regarding the longer discussion, and sorry if this below my usual level of clarity: what do we have at our disposal to make counterfactual worlds with low utility inconsistent? Well, all that we humans have at our disposal is choices about actions. One can play with words, and say that we can choose not just what to do, but also who to be, and choosing who to be (i.e. editing our decision procedure) is supposed by some to have logical consequences, but I think that's a mistake. 1) Changing who we are is an action like any other. Actions don't have logical consequences, just causal consequences. 2) We might be changing which algorithm our brain executes, but we are not changing the output of any algorithm itself, the latter possibility being the thing with supposedly far-reaching (logical) consequences on hypothetical worlds outside of causal contact. In general, I'm pretty bearish on the ability of humans to change math.

Consider the CDT person who adopts FDT. They are probably interested in the logical consequences of the fact their brain in this world outputs certain actions. But no mathematical axioms have changed along the way, so no propositions have changed truth value. The fact that their brain now runs a new algorithm implies that (the math behind) physics ended up implementing that new algorithm. I don't see how it implies much else, logically. And I think the fact that no mathematical axioms have changes supports that intuition quite well!

The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing. So if you have ambitions of rendering low-utility world inconsistent, I guess my question is this: which axioms of Math would you like to change and how? I understand you don't hope to causally affect this, but how could you even hope to affect this logically? (I'm struggling to even put words to that; the most charitable phrasing I can come up with, in case you don't like "affect this logically", is "manifest different logic", but I worry that phrasing is Confused.) Also, I'm capitalizing Math there because this whole conversation involves being Platonists about math, where Math is something that really exists, so you can't just invent a new axiomatization of math and say the world is different now.

I also think that nothing we do optimizes causally inaccessible areas of the multiverse

If that's the case, then I assume that you defect in the twin prisoner's dilemma. After all, under your reasoning, your action is independent of your twin because you are causally disjoint. This is true even despite the fact that you are both running identical decision procedures.

Now, if it's the case that you would defect in the twin prisoner's dilemma, I recommend writing up a more general critique of FDT or even EDT, because both recommend cooperating. That would probably better reflect the heart of your view on decision theory.

First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.

I agree that starting with the assumption of avoiding catastrophe is good, but when we could quite literally lose almost all the available value that we could potentially create by switching to CDT, don't you think that's at least worth looking into? On the flip side, I don't yet see why creating a CDT agent avoids catastrophe better than FDT.

The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing.

I agree :). This is the problem with theories of counterpossible reasoning. However, it's not clear that this is more of a problem for FDT than for CDT. After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals. Physics is either random or deterministic (unless we are mistaken about reality), and in neither case are there real mind independent counterfactuals. Whether or not you take an action is just a fact about the environment.

So, there is no greater problem for FDT; it's just a different problem, and perhaps not even very different. Which is not to say that it's not a big issue -- that's why MIRI is working on it.

If that's the case, then I assume that you defect in the twin prisoner's dilemma.

I do. I would rather be someone who didn't. But I don't see path to becoming that person without lobotomizing myself. And it's not a huge concern of mine, since I don't expect to encounter such a dilemma. (Rarely am I the one pointing out that a philosophical thought experiment is unrealistic. It's not usually the point of thought experiments to be realistic--we usually only talk about them to evaluate the consequences of different positions. But it is worth noting here that I don't see this as a major issue for me.) I haven't written this up because I don't think it's particularly urgent to explain to people why I think CDT is correct over FDT. Indeed, in one view, it would be cruel of me to do so! And I don't think it matters much for AI alignment.

Don't you think that's at least looking into?

This was partly why I decided to wade into the weeds, because absent a discussion of how plausible it is that we could affect things non-causally, yes, one's first instinct would be that we should look at least into it. And maybe, like, 0.1% of resources directed toward AI Safety should go toward whether we can change Math, but honestly, even that seems high. Because what we're talking about is changing logical facts. That might be number 1 on my list of intractable problems.

After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals.

This is getting subtle :) and it's hard to make sure our words mean things, but I submit that causal counterfactuals are much less fictitious than logical counterfactuals! I submit that it is less extravagant to claim we can affect this world than it is to claim that we can affect hypothetical worlds with which we are not in causal contact. No matter what action I pick, math stays the same. But it's not the case that no matter what action I pick, the world stays the same. (In the former case, which action I pick could in theory tell us something about what mathematical object the physical universe implements, but it doesn't change math.) In both cases, yes, there is only one action that I do take, but assuming we can reason both about causal and logical counterfactuals, we can still talk sensibly about the causal and logical consequences of picking actions I won't in fact end up picking. I don't have a complete answer to "how should we define causal/logical counterfactuals" but I don't think I need to for the sake of this conversation, as long as we both agree that we can use the terms in more or less the same way, which I think we are successfully doing.

I don't yet see why creating a CDT agent avoids catastrophe better than FDT.

I think running an aligned FDT agent would probably be fine. I'm just arguing that it wouldn't be any better than running a CDT agent (besides for the interim phase before Son-of-CDT has been created). And indeed, I don't think any new decision theories will perform any better than Son-of-CDT, so it doesn't seem to me to be a priority for AGI safety. Finally, the fact that no FDT agent has actually been fully defined certainly weighs in favor of just going with a CDT agent.

If that’s the case, then I assume that you defect in the twin prisoner’s dilemma.

I do. I would rather be someone who didn’t. But I don’t see path to becoming that person without lobotomizing myself.

You could just cooperate, without taking such drastic measures, no?

I jumped off a small cliff into a lake once, and when I was standing on the rock, I couldn't bring myself to jump. I stepped back to let another person go, and then I stepped onto the rock and jumped immediately. I might be able to do something similar.

But I wouldn't be able to endorse such behavior while reflecting on it if I were in that situation, given my conviction that I am unable to change math. Indeed, I don't think it would be wise of me to cooperate in that situation. What I really mean when I say that I would rather be someone who cooperated in a twin prisoners dilemma is "conditioned the (somewhat odd) hypothetical that I will at some point end up in a high stakes twin prisoner's dilemma, I would rather it be the case that I am the sort of person who cooperates", which is really saying that I would rather play a twin prisoner's dilemma game against a cooperator than against a defector, which is just an obvious preference for a favorable event to befall me rather than an unfavorable one. In similar news, conditioned on my encountering a situation in the future where somebody checks to see if am I good person, and if I am, they destroy the world, then I would like to become a bad person. Conditioned on my encountering a situation in which someone saves the world if I am devout, I would like to become a devout person.

If I could turn off the part of my brain that forms the question "but why should I cooperate, when I can't change math?" that would be a path to becoming a reliable cooperator, but I don't see a path to silencing a valid argument in my brain without a lobotomy (short of possibly just cooperating really fast without thinking, and of course without forming the doubt "wait, why am I trying to do this really fast without thinking?").

I think it's worth pointing out that I agree that you can't change math. I don't think I can change math. Yet, I would still cooperate. The whole thing about whether we can literally change math is missing a crux. Thankfully, logical counterfactuals are not construed in such a silly way.

This is similar to the debate over whether free will exists when physics is deterministic. "You can't change the future. It is already fixed..." the poor soul said, before walking off a cliff.