by jaek

# 9

I'm writing this for blog day at MSFP. I thought about a lot of things here like category theory, the 1-2-3 conjecture and Paul Christiano's agenda. I want to start by thanking everyone for having me and saying I had a really good time. At this point I intend to go back to thinking about the stuff I was thinking about before MSFP (random matrix theory). But I learned a lot and I'm sure some of it will come to be useful. This blog is about (my confusion of) decision theory.

Before the workshop I hadn't read much besides Eliezer's paper on FDT and my impression was that it was mostly a good way of thinking about making decisions and at least represented progress over EDT and CDT. After thinking more carefully about some canonical thought experiments I'm no longer sure. I suspect many of the concrete thoughts which follow will be wrong in ways that illustrate very bad intuitions. In particular I think I am implicitly guided by non-example number 5 of an aim of decision theory in Wei Dai's post on the purposes of decision theory. I welcome any corrections or insights in the comments.

## The Problem of Decision Theory

First I'll talk about what I think decision theory is trying to solve. Basically I think decision theory is the theory of how one should[1] decide on an action after one already understands: The actions available, the possible outcomes of actions, the probabilities of those outcomes and the desirability of those outcomes. In particular the answers to the listed questions are only adjacent to decision theory. I sort of think answering all of those questions is in fact harder than the question posed by decision theory. Before doing any reading I would have naively expected that the problem of decision theory, as stated here, was trivial but after pulling on some edge cases I see there is room for a lot of creative and reasonable disagreement.

A lot of the actual work in decision theory is the construction of scenarios in which ideal behavior is debatable or unclear. People choose their own philosophical positions on what is rational in these hairy situations and then construct general procedures for making decisions which they believe behave rationally in a wide class of problems. These constructions are a concrete version of formulating properties one would expect an ideal decision theory to have.

One such property is that an ideal decision theory shouldn't choose to self modify in some wide vaguely defined class of "fair" problems. An obviously unfair problem would be one in which the overseer gives CDT $10 and any other agent$0. One of my biggest open questions in decision theory is where this line between fair and unfair problems should lie. At this point I am not convinced any problem where agents in the environment have access to our decision theory's source code or copies of our agent are fair problems. But my impression from hearing and reading what people talk about is that this is a heretical position.

## Newcomb's Problem

Let's discuss Newcomb's problem in detail. In this problem there are two boxes one of which you know contains a dollar. In the other box an entity predicting your action may or may not put a million dollars. They put a million dollars if and only if they predict you will only take one box. What do you do if the predictor is 99 percent accurate? How about if it is perfectly accurate? What if you can see the content of the boxes before you make your decision?

An aside on why Newcomb's problem seems important: It is sort of like a prisoner's dilemma. To see the analogy imagine you're playing a classical prisoner's dilemma against a player who can reliably predict your action and then chooses to match it. Newcomb's problem seems important because prisoner's dilemmas seem like simplifications of situations which really do occur in real life. The tragedy of prisoner dilemmas is that game theory suggests you should defect but the real world seems like it would be better if people cooperated.

Newcomb's problem is weird to think about because the predictor and agent's behaviors are logically connected but not causally. That is, if you tell me what the agent does or what the predictor predicts as an outside observer I can guess what the other does with high probability. But once the predictor predicts the agent could still take either option and flip flopping won't flip flop the predictor. Still one may argue you should one box because being a one boxer going into the problem means you will likely get more utility. I disagree with this view and see Newcomb's problem as punishing rational agents.

If Newcomb's problem is ubiquitous and one imagines an agent walking down the street constantly being Newcombed it is indeed unfortunate if they are doomed to two box. They'll end up with far fewer dollars. But this thought experiment is missing an important part of real world detail in my view. How the predictors predict the agents behavior. There are three possibilities:

• The predictors have a sophisticated understanding of the agent's inner workings and use it to simulate the agent to high fidelity.
• The predictors have seen many agents like our agent doing problems like this problem and use this to compute a probability of our agent's choice and compare it to a decision threshold.
• The predictor has been following the behavior of our agent and uses this history to assign its future behavior a probability.

In the third bullet the agent should one box if they predict they are likely to be Newcombed often[2]. In the second bullet they should one box if they predict that members of their population will be Newcombed often and they derive more utility from the extra dollars their population will get than the extra dollar they could get for themselves. I have already stated I see the first bullet as an unfair problem.

My big complaint with mind reading is that there just isn't any mind reading. All my understandings of how people behave comes from observing how they behave in general, how the human I'm trying to understand behaves specifically, whatever they have explicitly told me about their intentions and whatever self knowledge I have I believe is applicable to all humans. Nowhere in the current world do people have to make decisions under the condition of being accurately simulated.

Why then do people develop so much decision theory intended to be robust in the presence of external simulators? I suppose its because there's an expectation that this will be a major problem in the future which should be solved philosophically before it is practically important. Mind reading could become important to humans if mind surveillance because possible and deployed. I don't think such a thing is possible in the near term or likely even in the fullness of time. But I also can't think of any insurmountable physical obstructions so maybe I'm too optimistic.

Mind reading is relevant to AI safety because whatever AGI is created will likely be a program on a computer somewhere which could reason its program stack is fully transparent or its creators are holding copies of it for predictions.

## Conclusion

Having written that last paragraph I suddenly understand why decision theory in the AI community is the way it is. I guess I wasn't properly engaging with the premises of the thought experiment. If one actually did tell me I was about to do a Newcomb experiment I would still two box because knowing I was in the real world I wouldn't really believe that an accurate predictor would be deployed against me. But an AI can be practically simulated and what's more can reason that it is just a program run by a creator that could have created many copies of it.

I'm going to post this anyway since it's blog-day and not important-quality-writing day but I'm not sure this blog has much of a purpose anymore.

1. This may read like I'm already explicitly guided by the false purpose Wei Dai warned against. My understanding is that the goal is to understand ideal decision making. Just not for the purposes of implementation. ↩︎

2. I don't really know anything but I imagine the game theory of reputation is well developed ↩︎

# 9

Mentioned in
New Comment
One of my biggest open questions in decision theory is where this line between fair and unfair problems should lie.

I think the current piece that points at this question most directly is Success-First Decision Theories by Preston Greene.

At this point I am not convinced any problem where agents in the environment have access to our decision theory's source code or copies of our agent are fair problems. But my impression from hearing and reading what people talk about is that this is a heretical position.

It seems somewhat likely to me that agents will be reasoning about each other using access to source code fairly soon (if just human operators evaluating whether or not to run intelligent programs, or what inputs to give to those programs). So then the question is something like: "what's the point of declaring a problem unfair?", to which the main answer seems to be "to spend limited no free lunch points." If I perform poorly on worlds that don't exist in order to perform better on worlds that do exist, that's a profitable trade.

I disagree with this view and see Newcomb's problem as punishing rational agents.
...
My big complaint with mind reading is that there just isn't any mind reading.

One thing that seems important (for decision theories implemented by humans or embedded agents, as distinct from decision theories implemented by Cartesian agents) is whether or not the decision theory is robust to ignorance / black swans. That is, if you bake into your view of the world that mind reading is impossible, then you can be durably exploited by any actual mind reading (whereas having some sort of ontological update process or low probability on bizarre occurrences allows you to only be exploited a finite number of times).

But note the connection to the earlier bit--if something is actually impossible, then it feels costless to give up on it in order to perform better in the other worlds. (My personal resolution to counterfactual mugging, for example, seems to rest on an underlying belief that it's free to write off logically inconsistent worlds, in a way that it's not free to write off factually inconsistent worlds that could have been factually consistent / are factually consistent in a different part of the multiverse.)

I think people make decisions based on accurate models of other people all the time. I think of Newcomb's problem as the limiting case where Omega has extremely accurate predictions, but that the solution is still relevant even when "Omega" is only 60% likely to guess correctly. A fun illustration of a computer program capable of predicting (most) humans this accurately is the Aaronson oracle.

Well, there are other problems besides Newcomb. Something like UDT can be motivated by simulations, or amnesia, or just multiple copies of the AI trying to cooperate with each other. All these lead to pretty much the same theory, that's why it's worth thinking about.

Thanks for your comment. I'll look into those other problems.

I'm going to post this anyway since its blog-day and not important-quality-writing day but I'm not sure this blog has much of a purpose anymore.

I liked the characterization of decision theory and the comment that the problem naively seems trivial from this perspective. Also liked the description of Newcomb's problem as a version of the prisoners dilemma. So it totally had a purpose!

I have already stated I see the third bullet as an unfair problem.

Should this be "the first bullet"?

Fixed

As you may know, CDT has a lot of fans in academia. It might be interesting to consider what they have to say about Newcomb's Problem (and other supposed counter-examples to CDT).

In "The Foundations of Causal Decision Theory", James Joyce argues that Newcomb's Problem is "unfair" on the grounds that it treats EDT and CDT agents differently. An EDT agent is given two good choices ($1,000,000 and$1,001,000) whereas a CDT agent is given two bad choices ($0 and$1,000). If you wanted to represent Newcomb's Problem as a Markov Decision Process then you would have to put EDT and CDT agents in different MDPs. Lo and behold, the EDT agent gets more money, but this is (according to Joyce) just because it is given an unfair advantage. Hence Newcomb's Problem isn't really too different from the obviously unfair "decision" problem you gave above, the unfairness is just obfuscated. The fact that EDT outperforms CDT in a situation in which EDT agents are unconditionally given more money than CDT agents is not an interesting objection to CDT, and so Newcomb's Problem is not an interesting objection to CDT (according to Joyce).

It might be worth thinking about this argument. Note that this argument operates at the level of individual decision problems, and doesn't say anything about whether its worth taking into account the possibility that different sorts of agents might tend end up in different sorts of situations. It also presumes a particular way of answering the question of whether two decision problems are "the same" problem.

I also want to note that you don't need perfect predictors, or anything even close to that, to create Newcomblike situations. Even if the Predictor's accuracy is only somewhat better than a coin flip this is sufficient to make the causal expected utility different from the evidential expected utility. The key property is that which action you take constitutes evidence about the state of the environment, which can happen in many ways.