Prerequisites: Provability logic, Updateless decision theory


  • Introduction
  • Formal setting
  • Main claim describing halting agent behavior
  • Example
  • Proof of main claim
  • Corollary on uniqueness of UDT
  • Discussion of assumptions


(This post was going to be a description of my paper UDT with known search order, but instead it obsoletes that paper.)

Consider the following argument that UDT is the unique decision procedure that performs optimally in simple universes.

Suppose an agent , with a provability oracle for statements like "if does action , then utility results," doesn't play chicken for some possible action . In other words, the agent is not committed to taking action whenever it sees that provably does not take action . Then this fact would be visible to ; using Lob's theorem, we could prove that actually does not take action . This argument looks like it is independent of the universe, so in some simple universes where action would give a large reward, actually fails to take . Therefore optimal agents must diagonalize against actions that they provably don't take.

This post will carry through a precise version of this argument. First we'll analyze exactly how agents behave when they only have to worry about finitely many possible outcomes. Then we'll apply this to the case where there are unboundedly many possible outcomes.

The busy reader could read just the description of the formal setting, and then skip to the corollary towards the end. The only crucial thing for the statement of the corollary is the notion of an agent that halts for every "possible" consequences oracle. However, the analysis in the main claim gives more information; namely, it says how to compute the behavior of any halting agent in any universe, as long as that universe is transparent in a sense to be defined later.

Formal setting

A universe is any algorithm that outputs a utility in the set . A universe is parameterized by the choice of an agent , and it may make calls to some level of halting oracle to compute what does. The set of possible utilities may be infinite, and we assume that our preference ordering on is the ordering on . An agent is an algorithm (usually written ) that selects some action from the finite set . The universe can observe the behavior of , but the agent only gains information about through its oracle, which depends on .

The agent makes calls to a consequences oracle $C : A \times U \to {0,1} $, which is always implicitly dependent on and . We'll work in , which can talk about and via quining. So, the oracle call returns 1 if , and returns 0 otherwise.

We make a transparency assumption on the universe: This is a strong assumption, but we are trying to prove that UDT is \emph{required} for optimal behavior in simple universes. We could just as well assume that the universe is a simple map , i.e. simply checks the action of and then looks up a corresponding utility in a table. The transparency assumption above is all we need to analyze the behavior of our agents, but we will see that UDT is necessary for optimal behavior even in the simple universes.

For each action , if for some , then , and in fact for all . By the transparency of , for each we have either for all , or else for exactly one . So, we define a consequence profile for to be any map such that The interpretation of is that for only, and the interpretation of is that for all ; consequence profiles encode all the information that might contain. We say that a consequence profile describes an oracle if for all , ( for only ), and ( for all ).

A consequence profile is full if is not in its range, i.e. it gives an actual utility for each action. Two consequence profiles and agree if there is no such that ; in other words, they agree if they assign the same utility to an action whenever they both assign an actual utility to that action, rather than assigning . Given and , the transparency assumption lets us pick a (possibly non-unique) full consequence profile, which we will denote by .

We require that halts as long as halts. We also require that halts given any consequence profile for and . Such an agent is called a halting agent.

Main claim describing halting agent behavior

We begin with the case where the algorithm only ever mentions finitely many utilities. Fix , and obtain a full consequence profile for and . It is provable in that some that agrees with describes , and so determines 's action. So for any such , we can write to mean that outputs if it is given an oracle described by . By the assumption that is a halting agent, is defined, and by completeness, can see this fact for any specific .

Claim: Define a sequence of consequence profiles that agree with , and a sequence of actions. First set for all , and set . Given and , define , i.e. is except that it maps to . Define to be . Then it is the case that , where is the least index such that for some , . Furthermore, describes for and . (The and do not depend on the choice of , but we will use in the proof.)


(This section is skippable.)

Before we prove the claim, an example is perhaps in order. Consider a universe and an agent with , and fix a full consequence profile for and . What we are looking at here is the strategy that follows, given the information it might receive from the oracle . (The boxes and bullets are just markers for us to refer to.) The table is meant to be a cube, where the boxed () is the lower left corner closest to us, and the boxed is the upper right corner farthest from us. Each entry in this cube gives the action for some that agrees with . The -th dimension corresponds to whether or not .

For each action , might see either just , or else for all . The up-down dimension corresponds to whether or not holds for all ; the back-front dimension corresponds to whether or not holds for all ; and the left-right dimension is for . So, the boxed indicates that will perform action if returns 1 for every query; the to the left of the boxed indicates that does if returns 1 for every query about or , and returns 0 for if ; the boxed indicates that does if returns 0 for if , for ; and so on.

Then the claim says that we start at the boxed , with , and follow the bullets. First ; then ; and finally takes us to . The conclusion of the claim is that does , since that is the first entry in the bulleted descent that repeats an action. (This is desirable behavior, as we can choose to be the with the greatest . The claim says that in this situation, describes the true oracle , so this will be optimal.)

If we replaced with , the claim says that does , as that is the first repeated action. In this case, we could have reached the conclusion pretty easily: by the previous discussion, knows that does the action dictated by this table. So can prove that if $\square_{\textsf{PA}}\ulcorner {\cal A}[{\cal U}] \ne a_1 \wedge {\cal A}[{\cal U}] \ne a_2 \urcorner $, then does some action in the back right of the cube. Both of those actions are , so does not do or . Hence and by Lob's theorem, Running the same argument again, we see that in fact does .

Proof of main claim

Recall the claim: we have a sequence of actions and a sequence of consequence profiles that agree with . We chose , for all , and . The claim says that for the least such that for some , and the oracle is described by . Note that, by definition, for , and for .

Proof of claim. First we prove, by induction on starting at zero, the following statement: This is vacuous for . So suppose we have the hypothesis for some . We want to show that Reasoning in , suppose that , and suppose also that .

Let be the smallest number such that . By our suppositions, . By the inductive hypothesis, then, . This tells us everything we need to know about to deduce what does; we know that is described by . Therefore the action is . In particular, .

We have shown that This argument goes through if everything is "one level deeper," i.e. wrapped in an additional (or, we can apply completeness of and distributivity of over $\to $). This gives By formalized Lob's theorem, we have the desired conclusion: Now, let be as in the claim, so it is the least index such that for some . Reasoning again in , suppose that . Then . Using the statement just proved and the previous argument about what can look like, this gives that . Since all such actions are in , this shows that .

We just proved that so by Lob's theorem, Running the same argument again, we know that , and we know that is described by for some . But for , and . So by soundness, no such can describe . Hence describes and . QED

Corollary on uniqueness of UDT

Let us consider universes with infinite---there are infinitely many possible outcomes. (Each universe only has finitely many accessible outcomes; we are looking at a set of universes that are allowed to have outcomes in , so the agent will have to ask about unboundedly many pairs.) We restrict to universes that are very simple: they are just maps from to that are easy to reason about. Recall that we assumed that agents are algorithms, modulo their oracles. In other words, for any consequence profile, they eventually halt given an oracle described by that profile. Now we give a corollary of the above claim, stating that UDT is unique among halting agents that are optimal for simple universes.

Corollary. Suppose a halting agent is optimal for all universes that implement a simple function . Then implements UDT, in the sense that checks if each action is provably not taken; takes any such action; and otherwise takes the action with the highest provable consequence. The order in which actions are diagonalized against can only depend on the actual consequences of actions that have already been checked. (So there is some action that always diagonalizes against; then for each , there is an action such that whenever , diagonalizes against ; for each , there is an action such that whenever and , diagonalizes against ; and so on.)

Proof. Fix some agent with . We will show that either fails on some simple universe , or else is UDT in the above sense. Suppose that for some simple that implements the function , the oracle of is not full, meaning that for some , for all . So , where is the index given by the main claim such that .

Notice that the behavior dictated by the claim does not depend on the "true" consequences of all the actions in ; we can show that the action happens just knowing what does for , . So let's modify to obtain a simple universe implementing , where is an action for which , and is some utility larger than any number in the range of . Now , and so is suboptimal in a simple universe.

This characterizes the behavior of any optimal : must play chicken against each action in turn, and then choose the action with the highest unique consequence. Here, "playing chicken" means that , even though .

When choosing which action to diagonalize against next, must choose only based on the provable consequences of the actions it has already diagonalized against; this is just saying that is well-defined. If does not vary this choice, then there is a fixed enumeration of such that plays chicken against each in order, in every universe. This recovers the usual formulation of UDT.

Discussion of assumptions

The assumption that must halt can be slightly relaxed; it only needs to halt for each . However, this might still be too restrictive. It is possible, as far as I know, that there is an agent that doesn't halt given some of those "possible" oracles, and yet performs optimally on simple universes (because none of those oracles ever actually happen, or when they do the default action is the correct one).

The transparency assumption on the universe doesn't weaken the conclusion about uniqueness of UDT, but it prevents applying the analysis of oracle-agent behavior to multi-agent situations, which are the most interesting universes. If there are multiple agents, it can be impossible to prove any consequences, because the other agent also has a provability oracle. Then can't even prove that the other agent's oracle won't always return 1 (because is inconsistent), so it can't in general prove what happens in the universe.

[Tangent: Restricting to an oracle for statements of the form is meant to model an agent that has an oracle for expected utility of actions. Allowing an abstract decision procedure to have unrestricted oracles (or rather, any access to the code for ) makes it difficult (for me) to define UDT in a reasonable way. For example, say there are two agents, and , both using UDT. When considers what strategy would be best for UDT to follow, it decides that it would be a great idea if any UDT that finds itself in the shoes of immediately donates all its resources to . This agent is delusional. One way to fix this problem is to make a decision procedure be a function of only the expected utility of actions (or some other sort of information mentioning only the relationship between actions and utility.) The decision procedure should be abstracted from the utility function, so it should be abstracted away from the concrete world model.]

Personal Blog


New Comment