From halting oracles to modal logic

AI ALIGNMENT FORUM
AF

From halting oracles to modal logic — AI Alignment Forum

When reading Vladimir's recent post on defining UDT through modal logic, I had to think a bit about his actual definition of the modal formula(s) corresponding to UDT, because it was phrased in terms of an algorithm doing things instead of an actual modal formula. Then I remembered having worked through the correspondence in the context of modal agents, and it became clear what was going on. I think Vladimir's approach is really interesting and I want to refer to it in future posts, so I thought I'd write a tutorial, since it seems likely that others will have the same problem.

Like Vladimir, let's start with the simplest case of an agent that chooses between two actions, in a world with two different possible outcomes (one of which is preferred to the other). Let's write the world as a universe program, in the usual style of proof-based UDT; for example:

  def U():
    if A() == a1:
      return 10
    else:
      return 5

And, in the style of UDT with a halting oracle, let's write our agent as a program with access to a halting oracle, which it uses to figure out whether certain sentences are provable (in Peano Arithmetic, let's say):

  def A():
    if   Provable("A() == a1  -->  U() == 10"):
      return a1
    elif Provable("A() == a2  -->  U() == 10"):
      return a2
    return a2

(The reference to $A ()$ and $U ()$ inside the quotes is by quining.) Our agent tries to get utility 10 by trying all possible actions in order; for each action, it sees whether it can prove that this action leads to the desired outcome, and if so, it takes that action; if it doesn't find a good action, it randomly takes the last one.

Given a programming language with access to a halting oracle, there's a relation $R u n (m, n)$ such that if $m$ is the Gödel number of a program in this language, then $N ⊨ R u n (m, n)$ if and only if this program halts and returns $n$ . (The definition of $R u n (m, n)$ is very similar to the one for programming languages without a halting oracle.) We implicitly used this fact in the definition of $A ()$ : the statement " $A () = a_{1} \to U () = 10$ " really means " $R u n (A, a_{1}) \to R u n (U, 10)$ ".

Now, by looking at the definition of $U ()$ , we can show that $P A ⊢ [U () = 10] \leftrightarrow [A () = a_{1}],$ and by looking at the definition of $A ()$ , we can show that $P A ⊢ [A () = a_{1}] \leftrightarrow □ ┌ [A () = a_{1}] \to [U () = 10] ┐ .$ If we ignore the internal structure of $[U () = 10]$ and $[A () = a_{1}]$ , and treat these as propositional variables, both of the statements above look like formulas of modal logic: $u \leftrightarrow a$ and $a \leftrightarrow □ ┌ a \to u ┐$ .

We can write this as $a \leftrightarrow F (a, u)$ and $u \leftrightarrow G (a)$ , where $F (a, u)$ and $G (a)$ are the modal formulas $F (a, u) \equiv □ ┌ a \to u ┐$ and $G (a) \equiv a$ .

So far, the modal logic thing has been an analogy, but now we can observe that the two displayed statements above imply that $[A () = a_{1}]$ is a fixed point of the formula $F (a, G (a))$ ; that is, $P A ⊢ [A () = a_{1}] \leftrightarrow F ([A () = a_{1}], G ([A () = a_{1}])) .$ Now, since $F (a, G (a))$ is a fully modalized formula (meaning that all occurrences of its propositional variable, $a$ , are inside boxes), this means that we can apply results from provability logic: in particular, there is a modal formula that's a fixed point of $F (a, G (a))$ ---a closed formula $φ$ of Gödel-Löb provability logic ( $G L$ ) such that $G L ⊢ φ \leftrightarrow F (φ, G (φ))$ ---and moreover, $P A ⊢ [A () = a_{1}] \leftrightarrow φ$ .

It's straight-forward to check that $G L ⊢ ⊤ \leftrightarrow □ ┌ ⊤ \to ⊤ ┐$ . (Recall that $G L$ proves any closed formula in the language of modal logic if and only if $P A$ proves its translation to the language of arithmetic, so we prove the above statement by arguing: $P A$ can clearly show that $P A$ proves $⊤ \to ⊤$ , and hence that $□ ┌ ⊤ \to ⊤ ┐$ is equivalent to $⊤$ .) Hence, it follows that $P A ⊢ [A () = a_{1}] \leftrightarrow ⊤$ .

So the general idea for specifying modal formulas through algorithms with halting oracles is as follows:

Write a program in a language with access to a halting oracle, which uses that oracle to check the provability of relatively simple statements---which may, however, refer to the output of the program.
By inspection of the source code of this program, show in $P A$ that its output is a fixed point of a certain fully modalized formula of modal logic.
Show in modal logic that a certain closed formula $φ$ is also a fixed point of this.
Conclude that $P A$ shows $φ$ to be equivalent to the complicated formula that describes the output of your program.

Before looking at how this applies to Vladimir's actual definition of UDT, there's one complication to add: allowing more than two actions and more than two outcomes. To do so, instead of talking just about the single statement $[A () = a_{1}]$ , we can talk about the $m$ statements $[A () = a_{1}], \dots, [A () = a_{m}]$ , and instead of talking about just $[U () = u_{1}]$ (with $u_{1} = 10$ in our example), we'll talk about $[U () = u_{1}], \dots, [U () = u_{n}]$ .

Now we describe the behavior of the agent program by $m$ statements of the form $P A ⊢ [A () = a_{i}] \leftrightarrow F_{i} ([A () = a_{1}], \dots, [A () = a_{m}], [U () = u_{1}], \dots, [U () = u_{n}])$ and the behavior by the universe program by $n$ statements of the form $P A ⊢ [U () = u_{j}] \leftrightarrow G_{j} ([A () = a_{1}], \dots, [A () = a_{m}]),$ where the $F_{i} (a_{1}, \dots, a_{m}, u_{1}, \dots, u_{n})$ are fully modalized formulas, and the $G_{j} (a_{1}, \dots, a_{m})$ are arbitrary formulas of provability logic. Intuitively, The fact that the $F_{i}$ are modalized corresponds to the fact that agents can't run the universe they're in, even if they have its source code---they have to use abstract reasoning (such as proofs) to figure out what the consequences of different actions are. The universe program, on the other hand, can just run the agent and see what it does, which corresponds to being able to refer to it outside modal operators.

I hope it won't be too confusing that I'm using $a_{i}$ both to denote an action and the corresponding propositional variable, and similarly for $u_{i}$ . From now on, I'll abbreviate $a_{1}, \dots, a_{m}$ by $\to a$ and $u_{1}, \dots, u_{n}$ by $\to u$ (deviating from Vladimir's use of overbars in order to keep overbars available for splicing natural numbers into arithmetic formulas).

Again, the place that the formulas $F_{i} (\to a, \to u)$ and $G_{j} (\to a)$ are coming from is just by looking at what the source code of these programs does and writing down the modal formula that expresses what that source does. That said, we're not just looking for formulas that in some intuitive sense correspond to what the program does; we're looking for formulas such that the displayed statements above are provable in $P A$ .

By the fixed point theorem for provability logic, we then know that we can find formulas $φ_{i}$ such that $G L ⊢ φ_{i} \leftrightarrow F_{i} (\to φ, G_{1} (\to φ), \dots, G_{n} (\to φ))$ , where $\to φ$ abbreviates $φ_{1}, \dots, φ_{m}$ ; and moreover, we know that then, $P A ⊢ [A () = a_{i}] \leftrightarrow φ_{i}$ . (To apply the fixed point theorem, we need to use the fact that $F (\to a, G_{1} (\to a), \dots, G_{n} (\to a))$ is fully modalized, which it clearly is.) Thus, we can carry out the same steps as in the two-action, two-outcome case above.

Now let's revisit Vladimir's definition of the agent program UDT, suitably adapted to my notation here and to my convention of making the $a_{i}$ mutually exclusive:

There are finitely many sentences of the form "if such-and such $a_{i}$ holds, then such-and-such $u_{j}$ holds". Find all such sentences that are provable. If no such sentences were found, choose an arbitrary action---e.g., make $F_{1}$ true and all other $F_{i}$ false.
From all pairs $(a_{i}, u_{j})$ found in the previous step, choose the $a_{i}$ whose $u_{j}$ is highest in our preference ordering. If there are multiple such $a_{i}$ , choose any one, e.g. the one with the lowest index.
For each $k$ , define $F_{k}$ to be true if $k = i$ (where $i$ is the index of the $a_{i}$ we chose in the previous step).

To cash this out, first we think of this definition as an algorithm with a halting oracle:

  def UDT():
    pairs = []
    for i in range(m):
      for j in range(n):
        if Provable("UDT() == a_i  -->  U() = u_j"):
          pairs.append((i,j))
    
    if len(pairs) == 0:
      return a_1
    else:
      find the pair (i,j) in pairs with the largest j
      return the corresponding a_i

Now we try to find $F_{i}$ such that we can prove in PA that $[U D T () = a_{i}]$ is equivalent to $F_{i} ([U D T () = a_{1}], \dots, [U D T () = a_{m}], [U () = u_{1}], \dots, [U () = u_{n}])$ . This is somewhat tedious, but possible: for example, we can:

enumerate every subset $X$ of the set of all pairs $(i, j)$ with $i \in {1, \dots, m}$ and $j \in {1, \dots, n}$
for every such subset $X$ , write down the modal formula $H_{X} (\to a, \to u)$ which is the conjunction of $□ ┌ a_{i} \to u_{j} ┐$ , for all $(i, j) \in X$ , and $\neg □ ┌ a_{i} \to u_{j} ┐$ , for all $(i, j) \notin X$ ;
let $F_{i} (\to a, \to u)$ be the disjunction of $H_{X} (\to a, \to u)$ for every $X$ such that the above algorithm returns $i$ if the Provable(...) condition is evaluated according to $X$ . (Since there are only finitely many $X$ , we can just figure this out by running the code, interpreting the call to Provable(...) according to each $X$ .)

You could certainly write the $F_{i}$ more compactly than this, but the above version shows the idea.

Having done all this, not only do we have a better understanding of how to translate Vladimir's specification into a formula of modal logic, we also have an idea of what to do with it: Given a universe program $U ()$ for which we can show statements of the form $P A ⊢ [U () = u_{j}] \leftrightarrow G_{j} ([U D T () = a_{1}], \dots, [U D T () = a_{n}]),$ we can now find formulas $φ_{i}$ such that $G L ⊢ φ_{i} \leftrightarrow F_{i} (\to φ, G_{1} (\to φ), \dots, G_{n} (\to φ))$ , and can then conclude that $φ_{i}$ tells us whether $U D T ()$ ---the program with the halting oracle---will output $a_{i}$ , given the universe $U ()$ .

This may be rather daunting, given the complicated definition of $F_{i} (\to a, \to u)$ from above, but finding the fixed point formulas $φ_{i}$ and evaluating their truth value are both decidable (unless I'm making a mistake?), so at least in principle (and hopefully in practice), we can write a program that figures out the answer for us.

(I'm not sure whether this post will be helpful or interesting to anybody---feedback appreciated...)