The odd counterfactuals of playing chicken

In this post, I examine an odd consequence of "playing chicken with the universe", as used in proof-based UDT. Let's say that our agent uses PA, and that it has a provability oracle, so that if it doesn't find a proof, there really isn't one. In this case, one way of looking at UDT is to say that it treats the models of PA as impossible possible worlds: UDT thinks that taking action $a$ leads to utility $u$ iff the universe program $U ()$ returns $u$ in all models $M$ in which $A ()$ returns $a$ . The chicken step ensures that for every $a$ , there is at least one model $M$ in which this is true. But how? Well, even though in the "real world", $N$ , the sentence $┌ ¯ A () \neq ¯ a ┐$ isn't provable---that is, $N ⊨ \neg □ ┌ ¯ ¯ A () \neq ¯ ¯ a ┐$ ---there are other models $M$ such that $M ⊨ □ ┌ ¯ ¯ A () \neq ¯ ¯ a ┐$ , and in these models, the chicken step can make $A ()$ output $a$ .

In general, the only "impossible possible worlds" in which $A () = a$ are models $M$ according to which it is provable that $A () \neq a$ . In this post, I show that this odd way of constructing the counterfactual "what would happen if I did $a$ " can cause problems for modal UDT and the corresponding notion of third-party counterfactuals.

To simplify things, I'll consider modal UDT in this post. (The fact that our definition of modal UDT doesn't have an explicit chicken-playing step will make some later arguments a little subtler than they otherwise would have been, but it's not really a problem since it still does something very similar implicitly.)

I'll consider a version of the 5-and-10 problem, which has two actions ( $1, 2$ ) and three outcomes ( $1, 2, 3$ , ordered from best to worst by the convention I've adopted in my posts on modal UDT). If the agent takes action $2$ , it always gets outcome $2$ . If the agent takes action $1$ , it gets the optimal outcome $1$ , unless it's provable that it doesn't take action $1$ , in which case it gets the worst outcome, $3$ .

Of course, in the real world, if it's provable that the agent doesn't take a particular action, then it really doesn't take that action. Hence, if the agent takes action $1$ in the real world, it will receive the optimal outcome $1$ , so intuitively, it's best to take action $1$ .

But modal UDT, as we'll see, takes action $2$ instead. This is self-consistent, because in the only "impossible possible worlds" in which it takes action $1$ , it does so because of (implicitly) playing chicken; that is, it only takes action $1$ because, in these worlds, it's "provable" that it doesn't take action $1$ , implying that it receives the worst possible outcome in all of these worlds.

We model the above set-up as a modal decision problem, that is, as three formulas $P_{1} (a_{1}, a_{2})$ , $P_{2} (a_{1}, a_{2})$ and $P_{3} (a_{1}, a_{2})$ , such that $P_{j} (a_{1}, a_{2})$ is true if an agent obtains outcome $j$ . Here, $a_{1}$ and $a_{2}$ are propositional variables; $a_{i}$ is interpreted as meaning that the agent takes action $i$ . Thus, we can define $\to P (\to a)$ as follows:

$\begin{matrix} P_{1} (a_{1}, a_{2}) & : \leftrightarrow a_{1} \land \neg □ \neg a_{1} P_{2} (a_{1}, a_{2}) & : \leftrightarrow a_{2} P_{3} (a_{1}, a_{2}) & : \leftrightarrow a_{1} \land □ \neg a_{1} \end{matrix}$

Here's a different possible reading. You may know the diamond operator, $◊ φ$ , which is an abbreviation for $\neg □ \neg φ$ . We have $N ⊨ □ ϕ$ if $ϕ$ is provable, that is, iff it is true in all models of PA; we have $N ⊨ ◊ ϕ$ if it isn't provable that not $ϕ$ , i.e., if there is some model of PA in which $ϕ$ is true. (The relation between $□$ and $◊$ is like the relation between $\forall$ and $\exists$ ---we have $N ⊨ □ φ$ iff for all models $M$ , $M ⊨ φ$ , and we have $N ⊨ \neg □ \neg φ$ or $N ⊨ ◊ φ$ if not for all models $M$ not $M ⊨ φ$ , which is equivalent to saying that there exists a model $M$ such that $M ⊨ φ$ .) So the box can be read as "true in all models", whereas the diamond can be read "true in some models" or "possible". Using the diamond operator, we can rewrite the decision problem as

$\begin{matrix} P_{1} (a_{1}, a_{2}) & \leftrightarrow a_{1} \land ◊ a_{1} P_{2} (a_{1}, a_{2}) & \leftrightarrow a_{2} P_{3} (a_{1}, a_{2}) & \leftrightarrow a_{1} \land \neg ◊ a_{1} \end{matrix}$

So the agent gets the good outcome $1$ if it chooses action $1$ and it "was possible" for it to choose action $1$ , whereas it gets the bad outcome $3$ if it chooses action $1$ but it "wasn't possible" for it to choose this action.

For example, let's consider the agent which just takes action $1$ without any further reasoning, which we represent by substituting the pair $(⊤, ⊥)$ for the pair of variables $(a_{1}, a_{2})$ . By substitution, $P_{1} (⊤, ⊥)$ is equivalent to $⊤ \land \neg □ \neg ⊤$ , which is equivalent to $\neg □ ⊥$ , which states that PA is consistent ("no contradiction is provable"). Since $N ⊨ \neg □ ⊥$ , it follows that "in the real world", the agent $(⊤, ⊥)$ obtains the best possible outcome $1$ .

It turns out that modal UDT, on the other hand, will choose action $2$ and receive a suboptimal payoff. Moreover, this is true even if we consider modal UDT with a stronger proof system than PA, as described in An optimality result for modal UDT. This means, of course, that the assumptions of the optimality theorem aren't satisfied; this isn't too surprising, because the decision problem refers to the agent's action inside boxes (see Patrick's recent post for a much simpler example of non-optimality when the decision problem refers to the agent inside boxes).

Let's write $(A_{1}, A_{2})$ , or $\to A$ for short, for the two formulas describing the action of modal UDT on our decision problem, and let's write $(U_{1}, U_{2}, U_{3})$ for the formulas describing the outcome it obtains; that is, $U_{j}$ is equivalent to $P_{j} (A_{1}, A_{2})$ . (Technically, this means that $(\to A, \to U)$ is the fixed point of $\to P (\to a)$ with $\to U D T (\to u)$ ; refer to the optimality post and the references therein for technical details.)

I claim that $G L ⊢ A_{1} \to □ \neg A_{1}$ ; that is: we can show that if UDT takes action $1$ , then it isn't provable that it takes action $1$ . Which, of course, implies that $G L ⊢ A_{1} \to U_{3}$ , since $U_{3}$ is equivalent to $A_{1} \land □ \neg A_{1}$ ; UDT concludes that taking action $1$ will lead to the worst possible outcome, $3$ . We can also think of this in terms of possible worlds: In all models $M$ of PA (all "impossible possible worlds", the way that modal UDT evaluates counterfactuals) in which our agent takes action $1$ , i.e., in all models satisfying $M ⊨ A_{1}$ , we also have $M ⊨ U_{3}$ .

All of the above implies that "in the real world", UDT must take action $2$ , i.e., $N ⊨ A_{2}$ . The most direct way to conclude this, once I've proven my claim that $G L ⊢ A_{1} \to □ \neg A_{1}$ , is by a quick proof of contradiction: Suppose that the agent took action $1$ ( $N ⊨ A_{1}$ ). Then by my claim, it would be provable that it doesn't take action $1$ ( $N ⊨ □ \neg A_{1}$ ). But this would mean that PA is inconsistent; contradiction.

I'll now proceed to prove my claim. I guess you may not be surprised if I tell you I'll do so by Löb's theorem; that is, I'll actually prove that $G L ⊢ □ (A_{1} \to □ \neg A_{1}) \to (A_{1} \to □ \neg A_{1})$ .

This may look like an opaque collection of boxes, but actually there's some interpretation. First of all, $□ (A_{1} \to □ \neg A_{1})$ implies $□ (A_{1} \to U_{3})$ , by taking the argument I made in the previous section and carrying out inside the boxes: If we have $A_{1}$ and $A_{1} \to □ \neg A_{1}$ , then we have $A_{1} \land \neg A_{1}$ , which is the same as $P_{3} (A_{1}, A_{2})$ , which is equivalent to $U_{3}$ .

So it's enough if we can prove $G L ⊢ □ (A_{1} \to U_{3}) \to (A_{1} \to □ \neg A_{1})$ , or equivalently, $G L ⊢ [A_{1} \land □ (A_{1} \to U_{3})] \to □ \neg A_{1}$ . Now consider the way modal UDT operates:

For all outcomes $j$ from 1 to 3:
- For all actions $i$ from 1 to 2:
  - If you can prove $A_{i} \to U_{j}$ , return $i$ .
If you're still here, return a default action.

Consider, furthermore, that we clearly have $G L ⊢ □ (A_{2} \to U_{2})$ (by definition, we have $G L ⊢ U_{2} \leftrightarrow A_{2}$ , which yields $G L ⊢ A_{2} \to U_{2}$ and hence $G L ⊢ □ (A_{2} \to U_{2})$ ). Thus, the search will stop at $i = j = 2$ at the latest, and by using this fact, GL can show that the only way $A_{1}$ could be true is if we have $□ (A_{1} \to U_{1})$ or $□ (A_{1} \to U_{2})$ . But if either of these is true, and we also have $□ (A_{1} \to U_{3})$ , then it follows that $□ \neg A_{1}$ (since $(U_{1}, U_{2}, U_{3})$ are provably mutually exclusive, i.e., only one of them can be true).

This proves $G L ⊢ [A_{1} \land □ (A_{1} \to U_{3})] \to □ \neg A_{1}$ , which concludes the proof of my claim.

But is this really so bad? Sure, modal UDT performs suboptimally on this decision problem, but we already know that every agent behaves badly on its "evil" decision problems, for example, and these problems are a lot simpler to reason about than what I've discussed here.

Moreover, the proof in this post only works for modal UDT using PA as its proof system, not for variants using proof systems stronger than PA. (You could define an analogous decision problem for these stronger proof systems, but without changing the decision problem, the proof only works for $P A$ -based UDT.) Now, the optimality result for modal UDT deals with the "evil" decision problems by saying that for every provably extensional decision problem, there's some $n_{0}$ such that modal UDT using $P A + n$ , $n \geq n_{0}$ , performs optimally on this decision problem. This result probably doesn't apply to the decision problem I've discussed in this post, because that's probably not provably extensional (though I haven't actually tried to prove this); but we might still say, why care if $P A$ -based UDT fails on some specific decision problem? We already knew that you need to use stronger proof systems sometimes.

The real reason to consider this problem becomes clear when we consider the logical third-person counterfactuals corresponding to modal UDT.

First of all, note that the optimality picture looks quite different when using logical counterfactuals, which ask "what would have happened if this agent had chosen a different action", rather than the physical counterfactuals from the optimality result, which ask "what would happen if you replaced the agent by a different agent".

Consider the "evil" decision problems: Fix a two-action agent ${\to A}^{'} \equiv (A_{1}^{'}, A_{2}^{'})$ . The corresponding evil decision problem, ${\to P}^{'} (\to a) \equiv (P_{1}^{'} (a_{1}, a_{2}), P_{2}^{'} (a_{1}, a_{2}))$ , compares the action the agent $\to a$ is taking against the action taken by ${\to A}^{'}$ ; if they're different, it rewards the agent $\to a$ , if they're the same, it punishes $\to a$ . Clearly, every agent gets punished on its own evil decision problem, even though there's a very simple agent which gets rewarded on the same problem.

But that's judging the agent through physical counterfactuals. When we consider logical counterfactuals, we don't compare the agent against other agents; rather, we ask whether this agent could have done better by taking a different action. From that perspective, the evil problems don't seem so problematic: It's intuitively clear, and our proof-based third-person counterfactuals agree, that if an agent returned a different action, it would still be punished by its evil decision problem.

It's true that we can't expect PA-based UDT to be optimal, even from the perspective of logical third-party counterfactuals. After all, consider a decision problem that rewards action $1$ iff PA is consistent, and action $2$ otherwise. In this case, modal UDT will not be able to prove anything of the form $A_{i} \to U_{j}$ , and will end up taking its default action, which could end up being the bad action $2$ .

However, from the perspective of proof-based third-party counterfactuals, this is acceptable. In the language of my previous post, we have "ambiguous counterfactuals": we can't figure out what outcome each action leads to, and so we throw up our hands and make peace with the fact that our agent may not behave optimally in any sense. I showed in the previous post that modal UDT is "optimal" in a certain sense whenever a universe is "fully informative", in the sense that for every action $i$ there's a unique $j$ for which it can prove $A_{i} \to U_{j}$ , or at least "sufficiently informative", which is a slight weakening of this condition. We only have reason to expect modal UDT to behave well when it's in a sufficiently informative universe; the decision problem that rewards $1$ iff PA is consistent doesn't satisfy this condition (if we use PA as our proof system), and so we don't expect PA-based modal UDT to do well on in the first place. The fact that the decision problem fails to be sufficiently informative alerts us to the fact that we shouldn't expect optimality.

But now consider the decision problem discussed in this post! This decision problem is fully informative---we obviously have $G L ⊨ A_{2} \to U_{2}$ ; we have $G L ⊬ \neg A_{2}$ , because we've shown that the agent actually takes action $2$ (i.e., $N ⊨ A_{2}$ ), and $G L$ is sound; we've shown $G L ⊢ A_{1} \to U_{3}$ ; and we have $G L ⊬ \neg A_{1}$ , because otherwise, $A_{1} \to U_{1}$ would be provable and our agent would take action $1$ . By the proof in the previous post, this implies that modal UDT performs "optimally", according to the notion of optimality defined there---that is, it acts optimally if you believe that taking action $1$ leads to the bad outcome $3$ , as examining provability in PA would lead you to believe. So unlike the problem discussed in the previous paragraph, in this case the third-party counterfactuals don't alert us to the fact that something is going wrong, and instead lead us to expect optimality---and yet we get arguably suboptimal behavior.

So the problem I want to point out in this post is that both modal UDT and the proof-based third-party counterfactuals defined in my previous post use a problematic notion of counterfactuals, related to the implicit chicken-playing step; in this case, this leads them to believe that taking action $1$ leads to the bad outcome $3$ , and to believe that they know what they're talking about (we're in the fully informative case, where for every action we seem to know exactly what outcome it would lead to), even though intuitively, it seems like taking action $1$ should be thought of as leading to the optimal outcome $1$ .

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

3

The odd counterfactuals of playing chicken

3