Improving the modal UDT optimality result

AI ALIGNMENT FORUM
AF

Improving the modal UDT optimality result — AI Alignment Forum

I recently posted about an optimality result for modal UDT, which shows that for every modal decision problem $\to P (\to a)$ , there is a closed modal formula $φ$ such that the version of modal UDT that searches for proofs in $P A + φ$ will perform optimally on $\to P (\to a)$ .

Paul commented on this post and suggested a stronger version: For every modal decision theory $\to T (\to u)$ and every provably extensional modal decision problem $\to P (\to a)$ , modal UDT will do at least as well on $\to P (\to a)$ as $\to T (\to u)$ does if it is using a proof system that can prove what action $\to T (\to u)$ chooses on this decision problem, and which outcome it obtains as a result. In this post, I give a detailed proof of this.

Prerequisite: An optimality result for modal UDT, and the prerequisites therein.

Let $({\to A}^{(T)}, {\to U}^{(T)})$ be the fixed point of $\to T (\to u)$ and $\to P (\to a)$ , and recall my notation ${\to χ}^{(i)}$ for the sequence of formulas which has $⊤$ as the $i$ 'th entry, and $⊥$ as all of its other entries (with the length of the sequence being clear from context). Now suppose that $φ$ is a true closed formula in the language of $G L$ such that $G L ⊢ φ \to (({\to A}^{(T)} \leftrightarrow {\to χ}^{(i_{T}^{*})}) \land ({\to U}^{(T)} \leftrightarrow {\to χ}^{(j_{T}^{*})}));$ that is, $G L + φ$ proves that $\to T (\to u)$ chooses action $i_{T}^{*}$ , and achieves the outcome $j_{T}^{*}$ as a result. (By saying that $φ$ is a "true" formula we mean that its translation to the language of arithmetic is true about the standard natural numbers: $N ⊨ φ$ .) It's ok to talk about ${\to A}^{(T)}$ and ${\to U}^{(T)}$ in the definition of $i_{T}^{*}$ and $j_{T}^{*}$ because fixed points are unique (up to provable equivalence), and $G L$ can prove that the fixed point is in fact a fixed point.

My claim, then, is that ${\to U D T}^{(φ)} (\to u)$ will perform at least as well as $\to T (\to u)$ on the decision problem $\to P (\to a)$ ; that is, the outcome it achieves will be ranked $\leq j_{T}^{*}$ .

Intuitively, this is straight-forward. $^{(φ)} (\to u)$ searches through all pairs $(j, i)$ of outcomes and actions in lexicographical order, until it finds a pair such that it can prove that if it takes action $i$ , it will achieve outcome $j$ ; as soon as it finds such a pair, it takes action $i$ . (This is justified becaues it searches outcomes best-first, so it takes an action that leads to as good an outcome as it's able to prove it can get.) So if it can prove that taking action $i_{T}^{*}$ will lead to outcome $j_{T}^{*}$ , it will either take that action and get that outcome, or there's some pair $(j, i) < (j_{T}^{*}, i_{T}^{*})$ such that it takes action $i$ and obtains outcome $j \leq j_{T}^{*}$ . (Remember that outcomes are numbered from best to worst.)

Let's go through the details of showing that it actually works out that way.

Let's write $(\to A, \to U)$ for the fixed point of $^{(φ)} (\to u)$ with $\to P (\to a)$ . The part in our argument that we need to check carefully is that UDT will in fact stop at the pair $(j_{T}^{*}, i_{T}^{*})$ if it hasn't already stopped before that; i.e., $G L ⊢ φ \to ({U D T}_{i_{T}^{*}}^{(φ)} (\to U) \to U_{j_{T}^{*}}) .$ If this is satisfied, then we're done: We know that there will be some pair $(j^{*}, i^{*}) \leq (j_{T}^{*}, i_{T}^{*})$ such that $^{(φ)} (\to U)$ outputs $i^{*}$ and such that $G L ⊢ φ \to ({U D T}_{i^{*}}^{(φ)} (\to U) \to U_{j^{*}}),$ and hence, since (a) $G L$ is sound on $N$ , (b) $N ⊨ φ$ , and (c) by assumption, $N ⊨ {U D T}_{i^{*}}^{(φ)}$ , it follows that $N ⊨ U_{j^{*}}$ , i.e., ${U D T}^{(φ)} (\to U)$ achieves the outcome $j^{*} \leq j_{T}^{*}$ . Thus, let's check that $G L$ does indeed prove $φ \to ({U D T}_{i_{T}^{*}}^{(φ)} (\to U) \to U_{j_{T}^{*}})$ .

To do so, we make use of provable extensionality, that is, of the fact that $G L ⊢ (\to a \leftrightarrow \to b) \to (\to P (\to a) \leftrightarrow \to P (\to b)) .$ Since as a modal decision theory, $^{(φ)}$ is a p.m.e.e. sequence (provably mutually exclusive and exhaustive), $G L$ proves that ${U D T}_{i_{T}^{*}}^{(φ)} (\to U)$ implies $^{(φ)} (\to U) \leftrightarrow {\to χ}^{(i_{T}^{*})}$ , i.e., $\to A \leftrightarrow {\to χ}^{(i_{T}^{*})}$ . Hence, together with provable extensionality, we obtain $G L ⊢ {U D T}_{i_{T}^{*}}^{(φ)} (\to U) \to (\to U \leftrightarrow \to P (χ^{(i_{T}^{*})}))$ (since $G L ⊢ \to P (\to A) \leftrightarrow \to U$ by definition of $\to U$ ). But on the other hand, recall our initial assumption that $G L ⊢ φ \to (({\to A}^{(T)} \leftrightarrow {\to χ}^{(i_{T}^{*})}) \land ({\to U}^{(T)} \leftrightarrow {\to χ}^{(j_{T}^{*})}));$ again by provable extensionality, this implies $G L ⊢ φ \to ((\to P ({\to A}^{(T)}) \leftrightarrow \to P ({\to χ}^{(i_{T}^{*})})) \land ({\to U}^{(T)} \leftrightarrow {\to χ}^{(j_{T}^{*})})),$ and since $G L ⊢ \to P ({\to A}^{(T)}) \leftrightarrow {\to U}^{(T)}$ by definition of ${\to U}^{(T)}$ , this simplifies to $G L ⊢ φ \to (P ({\to χ}^{(i_{T}^{*})}) \leftrightarrow {\to χ}^{(j_{T}^{*})}) .$ But together with our earlier result, this implies $G L ⊢ (φ \land {U D T}_{i_{T}^{*}}^{(φ)} (\to U)) \to (\to U \leftrightarrow {\to χ}^{(j_{T}^{*})}),$ which is equivalent to $G L ⊢ φ \to ({U D T}_{i_{T}^{*}}^{(φ)} (\to U) \to U_{j_{T}^{*}}),$ as desired.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

9

Improving the modal UDT optimality result

9