Prerequisites: Provability logic, Updateless decision theory

Outline

Introduction
Formal setting
Main claim describing halting agent behavior
Example
Proof of main claim
Corollary on uniqueness of UDT
Discussion of assumptions

Introduction

(This post was going to be a description of my paper UDT with known search order, but instead it obsoletes that paper.)

Consider the following argument that UDT is the unique decision procedure that performs optimally in simple universes.

Suppose an agent $A$ , with a provability oracle for statements like "if $A$ does action $a$ , then utility $u$ results," doesn't play chicken for some possible action $a$ . In other words, the agent is not committed to taking action $a$ whenever it sees that $A$ provably does not take action $a$ . Then this fact would be visible to $PA$ ; using Lob's theorem, we could prove that $A$ actually does not take action $a$ . This argument looks like it is independent of the universe, so in some simple universes where action $a$ would give a large reward, $A$ actually fails to take $a$ . Therefore optimal agents must diagonalize against actions that they provably don't take.

This post will carry through a precise version of this argument. First we'll analyze exactly how agents behave when they only have to worry about finitely many possible outcomes. Then we'll apply this to the case where there are unboundedly many possible outcomes.

The busy reader could read just the description of the formal setting, and then skip to the corollary towards the end. The only crucial thing for the statement of the corollary is the notion of an agent that halts for every "possible" consequences oracle. However, the analysis in the main claim gives more information; namely, it says how to compute the behavior of any halting agent in any universe, as long as that universe is transparent in a sense to be defined later.

Formal setting

A universe $U [A]$ is any algorithm that outputs a utility in the set $U \subset N$ . A universe is parameterized by the choice of an agent $A$ , and it may make calls to some level of halting oracle to compute what $A$ does. The set $U$ of possible utilities may be infinite, and we assume that our preference ordering on $U$ is the ordering $<$ on $N$ . An agent is an algorithm $A [U]$ (usually written $A$ ) that selects some action from the finite set $A$ . The universe can observe the behavior of $A$ , but the agent only gains information about $U$ through its oracle, which depends on $U$ .

The agent $A$ makes calls to a consequences oracle $C : A \times U \to {0,1} $, which is always implicitly dependent on $A$ and $U$ . We'll work in $PA$ , which can talk about $U [A]$ and $A [U]$ via quining. So, the oracle call $C (a, u)$ returns 1 if $PA ⊢ A [U] = a \to U [A] = u$ , and $C (a, u)$ returns 0 otherwise.

We make a transparency assumption on the universe: $\forall a \in A \exists u \in U : PA ⊢ A [U] = a \to U [A] = u .$ This is a strong assumption, but we are trying to prove that UDT is \emph{required} for optimal behavior in simple universes. We could just as well assume that the universe is a simple map $A \to U$ , i.e. $U$ simply checks the action of $A$ and then looks up a corresponding utility in a table. The transparency assumption above is all we need to analyze the behavior of our agents, but we will see that UDT is necessary for optimal behavior even in the simple universes.

For each action $a$ , if $C (a, u_{1}) = C (a, u_{2}) = 1$ for some $u_{1} \neq u_{2}$ , then $PA ⊢ A \neq a$ , and in fact $C (a, u) = 1$ for all $u \in U$ . By the transparency of $U$ , for each $a \in A$ we have either $C (a, u) = 1$ for all $u \in U$ , or else $C (a, u) = 1$ for exactly one $u$ . So, we define a consequence profile for $A, U$ to be any map $γ : A \to U \cup {⊥}$ such that $\forall a \in A, u \in U : γ (a) = u \Rightarrow C (a, u) = 1 .$ The interpretation of $γ (a) = u$ is that $C (a, u) = 1$ for $u$ only, and the interpretation of $γ (a) = ⊥$ is that $C (a, u) = 1$ for all $u \in U$ ; consequence profiles encode all the information that $C$ might contain. We say that a consequence profile $γ$ describes an oracle $C$ if for all $a \in A$ , $γ (a) = u \Rightarrow$ ( $C (a, u) = 1$ for only $u$ ), and $γ (a) = ⊥ \Rightarrow$ ( $C (a, u) = 1$ for all $u$ ).

A consequence profile is full if $⊥$ is not in its range, i.e. it gives an actual utility for each action. Two consequence profiles $γ$ and $δ$ agree if there is no $a \in A$ such that $⊥ \neq γ (a) \neq δ (a) \neq ⊥$ ; in other words, they agree if they assign the same utility to an action whenever they both assign an actual utility to that action, rather than assigning $⊥$ . Given $A$ and $U$ , the transparency assumption lets us pick a (possibly non-unique) full consequence profile, which we will denote by $σ$ .

We require that $U [A]$ halts as long as $A$ halts. We also require that $A$ halts given any consequence profile $γ$ for $A$ and $U$ . Such an agent is called a halting agent.

Main claim describing halting agent behavior

We begin with the case where the algorithm $A$ only ever mentions finitely many utilities. Fix $U$ , and obtain a full consequence profile $σ$ for $A$ and $U$ . It is provable in $PA$ that some $γ$ that agrees with $σ$ describes $C$ , and so $γ$ determines $A$ 's action. So for any such $γ$ , we can write $A (γ) = a$ to mean that $A$ outputs $a$ if it is given an oracle described by $γ$ . By the assumption that $A$ is a halting agent, $A (γ)$ is defined, and by $Σ_{1}$ completeness, $PA$ can see this fact for any specific $γ$ .

Claim: Define a sequence $⟨ γ_{i} ⟩$ of consequence profiles that agree with $σ$ , and a sequence $⟨ a_{i} ⟩$ of actions. First set $γ_{0} (a) = ⊥$ for all $a$ , and set $a_{0} = A (γ_{0})$ . Given $γ_{n}$ and $a_{n}$ , define $γ_{n + 1} = γ_{n} [a_{n} \mapsto σ (a_{n})]$ , i.e. $γ_{n + 1}$ is $γ_{n}$ except that it maps $a_{n}$ to $σ (a_{n})$ . Define $a_{n + 1}$ to be $A (γ_{n + 1})$ . Then it is the case that $A = a_{k}$ , where $k$ is the least index such that for some $i < k$ , $a_{i} = a_{k}$ . Furthermore, $γ_{k}$ describes $C$ for $U$ and $A$ . (The $γ_{i}$ and $a_{i}$ do not depend on the choice of $σ$ , but we will use $σ$ in the proof.)

Example

(This section is skippable.)

Before we prove the claim, an example is perhaps in order. Consider a universe $U$ and an agent $A$ with $| A | = 3$ , and fix a full consequence profile $σ$ for $A$ and $U$ . What we are looking at here is the strategy that $A$ follows, given the information it might receive from the oracle $C$ . $\begin{matrix} a_{2} & a_{0} ∙ a_{0} & a_{1} ∙ a_{1} & a_{1} a_{x} ∙ & a_{2} ∙ \end{matrix}$ (The boxes and bullets are just markers for us to refer to.) The table is meant to be a cube, where the boxed $a_{x}$ ( $x \in {0, 1, 2}$ ) is the lower left corner closest to us, and the boxed $a_{0}$ is the upper right corner farthest from us. Each entry in this cube gives the action $A (γ)$ for some $γ$ that agrees with $σ$ . The $i$ -th dimension corresponds to whether or not $γ (a_{i}) = ⊥$ .

For each action $a$ , $A$ might see either just $C (a, σ (a)) = 1$ , or else $C (a, u) = 1$ for all $u$ . The up-down dimension corresponds to whether or not $C (a_{0}, u)$ holds for all $u$ ; the back-front dimension corresponds to whether or not $C (a_{1}, u)$ holds for all $u$ ; and the left-right dimension is for $a_{2}$ . So, the boxed $a_{0}$ indicates that $A$ will perform action $a_{0}$ if $C$ returns 1 for every query; the $a_{2}$ to the left of the boxed $a_{0}$ indicates that $A$ does $a_{2}$ if $C$ returns 1 for every query about $a_{0}$ or $a_{1}$ , and returns 0 for $(a_{2}, u)$ if $u \neq σ (a_{2})$ ; the boxed $a_{x}$ indicates that $A$ does $a_{x}$ if $C$ returns 0 for $(a_{i}, u)$ if $u \neq σ (a_{i})$ , for $i \in {0, 1, 2}$ ; and so on.

Then the claim says that we start at the boxed $a_{0}$ , with $γ_{0} \equiv ⊥$ , and follow the bullets. First $γ_{1} (a_{0}) = σ (a_{0})$ ; then $γ_{2} (a_{1}) = σ (a_{1})$ ; and finally $γ_{3} (a_{2}) = σ (a_{2})$ takes us to $a_{x}$ . The conclusion of the claim is that $A$ does $a_{x}$ , since that is the first entry in the bulleted descent that repeats an action. (This is desirable behavior, as we can choose $a_{x}$ to be the $a_{i}$ with the greatest $σ (a_{i})$ . The claim says that in this situation, $γ_{3} = σ$ describes the true oracle $C$ , so this $a_{i}$ will be optimal.)

If we replaced $a_{1} ∙$ with $a_{0} ∙$ , the claim says that $A$ does $a_{0}$ , as that is the first repeated action. In this case, we could have reached the conclusion pretty easily: by the previous discussion, $PA$ knows that $A$ does the action dictated by this table. So $PA$ can prove that if $\square_{\textsf{PA}}\ulcorner {\cal A}[{\cal U}] \ne a_1 \wedge {\cal A}[{\cal U}] \ne a_2 \urcorner $, then $A$ does some action in the back right of the cube. Both of those actions are $a_{0}$ , so $A$ does not do $a_{1}$ or $a_{2}$ . Hence $PA ⊢ □_{PA} ┌ A [U] \notin {a_{1}, a_{2}} ┐ \to A [U] \notin {a_{1}, a_{2}},$ and by Lob's theorem, $PA ⊢ A [U] \notin {a_{1}, a_{2}} .$ Running the same argument again, we see that $A$ in fact does $a_{0}$ .

Proof of main claim

Recall the claim: we have a sequence $⟨ a_{i} ⟩$ of actions and a sequence $⟨ γ_{i} ⟩$ of consequence profiles that agree with $σ$ . We chose $γ_{0} (a) \equiv ⊥$ , $a_{i} = A (γ_{i})$ for all $i$ , and $γ_{n + 1} = γ_{n} [a_{n} \mapsto σ (a_{n})]$ . The claim says that $A = a_{k}$ for the least $k$ such that $a_{i} = a_{k}$ for some $i < k$ , and the oracle $C$ is described by $γ_{k}$ . Note that, by definition, $γ_{k} (a_{i}) = σ (a_{i})$ for $i < k$ , and $γ_{k} (a_{i}) = ⊥$ for $i \geq k$ .

Proof of claim. First we prove, by induction on $k$ starting at zero, the following statement: $\forall j < k : PA ⊢ □ ┌ A [U] \neq a_{j} ┐ \to □ ┌ A [U] \in {a_{i} ∣ i < j} ┐ .$ This is vacuous for $k = 0$ . So suppose we have the hypothesis for some $k > 0$ . We want to show that $PA ⊢ □ ┌ A [U] \neq a_{k} ┐ \to □ ┌ A [U] \in {a_{i} ∣ i < k} ┐ .$ Reasoning in $PA$ , suppose that $□ ┌ A [U] \neq a_{k} ┐$ , and suppose also that $□ ┌ A [U] \in {a_{i} ∣ i < k} ┐$ .

Let $j$ be the smallest number such that $□ ┌ A [U] \neq a_{j} ┐$ . By our suppositions, $j < k$ . By the inductive hypothesis, then, $\forall a \notin {a_{i} ∣ i < j} : □ ┌ A [U] \neq a ┐$ . This tells us everything we need to know about $C$ to deduce what $A$ does; we know that $C$ is described by $γ_{j}$ . Therefore the action is $A (γ_{j}) = a_{j}$ . In particular, $A [U] \in {a_{i} ∣ i < k}$ .

We have shown that $PA ⊢ □ ┌ A [U] \neq a_{k} ┐ \to (□ ┌ A [U] \in {a_{i} ∣ i < k} ┐ \to A [U] \in {a_{i} ∣ i < k}) .$ This argument goes through if everything is "one level deeper," i.e. wrapped in an additional $□$ (or, we can apply $Σ_{1}$ completeness of $PA$ and distributivity of $□$ over $\to $). This gives $PA ⊢ □ ┌ A [U] \neq a_{k} ┐ \to □ ┌ □ ┌ A [U] \in {a_{i} ∣ i < k} ┐ \to A [U] \in {a_{i} ∣ i < k} ┐ .$ By formalized Lob's theorem, we have the desired conclusion: $PA ⊢ □ ┌ A [U] \neq a_{k} ┐ \to □ ┌ A [U] \in {a_{i} ∣ i < k} ┐ .$ Now, let $k$ be as in the claim, so it is the least index such that $a_{i} = a_{k}$ for some $i < k$ . Reasoning again in $PA$ , suppose that $□ ┌ A [U] \in {a_{i} ∣ i < k} ┐$ . Then $\forall a \notin {a_{i} ∣ i < k} : □ ┌ A [U] \neq a ┐$ . Using the statement just proved and the previous argument about what $C$ can look like, this gives that $\exists i \leq k : A [U] = A (γ_{i})$ . Since all such actions $A (γ_{i})$ are in ${a_{i} ∣ i < k}$ , this shows that $A [U] \in {a_{i} ∣ i < k}$ .

We just proved that $PA ⊢ □ ┌ A [U] \in {a_{i} ∣ i < k} ┐ \to A [U] \in {a_{i} ∣ i < k},$ so by Lob's theorem, $PA ⊢ A [U] \in {a_{i} ∣ i < k} .$ Running the same argument again, we know that $\exists i \leq k : A [U] = A (γ_{i})$ , and we know that $C$ is described by $γ_{i}$ for some $i \leq k$ . But for $i < k$ , $γ_{i} (a_{i}) = ⊥$ and $A (γ_{i}) = a_{i}$ . So by soundness, no such $γ_{i}$ can describe $C$ . Hence $γ_{k}$ describes $C$ and $A [U] = A (γ_{k}) = a_{k}$ . QED

Corollary on uniqueness of UDT

Let us consider universes with $| U |$ infinite---there are infinitely many possible outcomes. (Each universe only has finitely many accessible outcomes; we are looking at a set of universes that are allowed to have outcomes in $N$ , so the agent will have to ask $C$ about unboundedly many $(a, u)$ pairs.) We restrict to universes that are very simple: they are just maps from $A$ to $U$ that are easy to reason about. Recall that we assumed that agents are algorithms, modulo their oracles. In other words, for any consequence profile, they eventually halt given an oracle described by that profile. Now we give a corollary of the above claim, stating that UDT is unique among halting agents that are optimal for simple universes.

Corollary. Suppose a halting agent $A$ is optimal for all universes that implement a simple function $A \to U$ . Then $A$ implements UDT, in the sense that $A$ checks if each action is provably not taken; takes any such action; and otherwise takes the action with the highest provable consequence. The order in which actions are diagonalized against can only depend on the actual consequences of actions that have already been checked. (So there is some action $a$ that $A$ always diagonalizes against; then for each $u \in U$ , there is an action $a_{u}$ such that whenever $C (a, u) = 1$ , $A$ diagonalizes against $a_{u}$ ; for each $u, v \in U$ , there is an action $a_{u v}$ such that whenever $C (a, u) = 1$ and $C (a_{u}, v) = 1$ , $A$ diagonalizes against $a_{u v}$ ; and so on.)

Proof. Fix some agent $A$ with $| A | = n$ . We will show that either $A$ fails on some simple universe $U$ , or else $A$ is UDT in the above sense. Suppose that for some simple $U$ that implements the function $f : A \to U$ , the oracle $C$ of $A [U]$ is not full, meaning that for some $a \in A$ , $C (a, u) = 1$ for all $u \in U$ . So $k < n$ , where $k$ is the index given by the main claim such that $A [U] = A (γ_{k}) = a_{k}$ .

Notice that the behavior dictated by the claim does not depend on the "true" consequences of all the actions in $A - {a_{i} ∣ i \leq k}$ ; we can show that the action $a_{k}$ happens just knowing what $A$ does for $γ_{i}$ , $i \leq k$ . So let's modify $U$ to obtain a simple universe $U^{'}$ implementing $f^{'} = f [a \mapsto u^{'}]$ , where $a$ is an action for which $PA ⊢ A [U] \neq a$ , and $u^{'}$ is some utility larger than any number in the range of $f$ . Now $A [U^{'}] \neq a$ , and so $A$ is suboptimal in a simple universe.

This characterizes the behavior of any optimal $A$ : $A$ must play chicken against each action in turn, and then choose the action with the highest unique consequence. Here, "playing chicken" means that $A (γ_{i}) = a_{i}$ , even though $γ_{i} (a_{i}) = ⊥$ .

When choosing which action to diagonalize against next, $A$ must choose only based on the provable consequences of the actions it has already diagonalized against; this is just saying that $A (γ_{i})$ is well-defined. If $A$ does not vary this choice, then there is a fixed enumeration ${a_{0}, a_{1}, \dots, a_{n - 1}}$ of $A$ such that $A$ plays chicken against each $a_{i}$ in order, in every universe. This recovers the usual formulation of UDT.

Discussion of assumptions

The assumption that $A$ must halt can be slightly relaxed; it only needs to halt for each $γ_{i}$ . However, this might still be too restrictive. It is possible, as far as I know, that there is an agent that doesn't halt given some of those "possible" oracles, and yet performs optimally on simple universes (because none of those oracles ever actually happen, or when they do the default action is the correct one).

The transparency assumption on the universe doesn't weaken the conclusion about uniqueness of UDT, but it prevents applying the analysis of oracle-agent behavior to multi-agent situations, which are the most interesting universes. If there are multiple agents, it can be impossible to prove any consequences, because the other agent also has a provability oracle. Then $A$ can't even prove that the other agent's oracle won't always return 1 (because $PA$ is inconsistent), so it can't in general prove what happens in the universe.

[Tangent: Restricting to an oracle for statements of the form $PA ⊢ A [U] = a \to U [A] = u$ is meant to model an agent that has an oracle for expected utility of actions. Allowing an abstract decision procedure to have unrestricted oracles (or rather, any access to the code for $U$ ) makes it difficult (for me) to define UDT in a reasonable way. For example, say there are two agents, $A_{1}$ and $A_{2}$ , both using UDT. When $A_{1}$ considers what strategy would be best for UDT to follow, it decides that it would be a great idea if any UDT that finds itself in the shoes of $A_{2}$ immediately donates all its resources to $A_{1}$ . This agent is delusional. One way to fix this problem is to make a decision procedure be a function of only the expected utility of actions (or some other sort of information mentioning only the relationship between actions and utility.) The decision procedure should be abstracted from the utility function, so it should be abstracted away from the concrete world model.]

AI ALIGNMENT FORUM
AF