x

Third-person counterfactuals — AI Alignment Forum

2

Third-person counterfactuals

by Benya_Fallenstein

3rd Feb 2015

7 min read

2

Counterfactuals

Personal Blog

Third-person counterfactuals

1Benya_Fallenstein

0Benya_Fallenstein

New Comment

4 comments, sorted by

Click to highlight new comments since: Today at 9:28 PM

[-]orthonormal11y10

Typo on indices, should be:

then GL $⊢ \neg A_{2}$ would imply that the agent would take action 2 (because $\neg A_{2}$ implies $A_{2} \to U_{1}$ )

Also, isn't your example also fully informative, since if GL $⊢ \neg A_{2}$ , then GL also proves true and spurious counterfactuals about $A_{2}$ ?

[-]Benya_Fallenstein11y10

Fixed---thanks, Patrick!

Regarding the example, I earlier defined "action $i$ leads to outcome $j$ " to mean the conjunction of $G L ⊢ A_{i} \to U_{j}$ and $G L ⊬ \neg A_{i}$ ; i.e., we check for spurious counterfactuals before believing that $G L ⊢ A_{i} \to U_{j}$ tells us something about what action $i$ leads to, and we only consider ourselves "fully informed" in this sense if we have non-spurious information for each $i$ . (Of course, my follow-up post is about how that's still unsatisfactory; the reason to define this notion of "fully informative" so explicitly was really to be able to say more clearly in which sense we intuitively have a problem even when we've ruled out the problems of ambiguous counterfactuals and not enough counterfactuals.)

[-]orthonormal11y10

Ah! I'd failed to propagate that somehow.

Given that we're using PA+n to defeat evil problems, the true modal definition of "action $i$ leads to outcome $j$ " might be something like "there exists a closed formula $ϕ$ such that $N ⊨ ϕ$ , GL $⊢ ϕ \to (A_{i} \to U_{j})$ , and GL $⊬ ϕ \to \neg A_{i}$ ". But that's an unnecessary complication for this post.

[-]Benya_Fallenstein11y00

Yeah, that sounds good! Of course, by the Kripke levels argument, it's sufficient to consider $ϕ$ 's of the form $\neg □^{n} ⊥$ . And we might want to have a separate notion of " $i$ leads to $j$ at level $n$ ", which we can actually implement in a finite modal formula. This seems to suggest a version of modal UDT that tries to prove things in PA, then if that has ambiguous counterfactuals (i.e., it can't prove $A_{i} \to U_{j}$ for any $j$ ) we try PA+1 and so on up to some finite $n$ ; then we can hope these versions of UDT approximate optimality according to your revised version of " $i$ leads to $j$ " as $n \to \infty$ . Worth working out!

More from Benya_Fallenstein

Curated and popular this week

4

If you're thinking about the counterfactual world where you do X in the process of deciding whether to do X, let's call that a first-person counterfactual. If you're thinking about it in the process of deciding whether another agent A should have done X instead of Y, let's call that a third-person counterfactual. The definition of, e.g., modal UDT uses first-person counterfactuals, but when we try to prove a theorem showing that modal UDT is "optimal" in some sense, then we need to use third-person counterfactuals.

UDT's first-person counterfactuals are logical counterfactuals, but our optimality result evaluates UDT by using physical third-party counterfactuals: it asks, would another agent have done better, not, would a different action by the same agent have lead to a better outcome? The former is easier to analyze, but the latter seems to be what we really care about. Nate's recent post on "global UDT" points towards turning UDT into a notion of third-party counterfactuals, and describes some problems. In this post, I'll give a fuller UDT-based notion of logical third-party counterfactuals, which at least fails visibly (returns an error) in the kinds of cases Nate describes. However, in a follow-up post I'll give an example where this definition returns a non-error value which intuitively seems wrong.

Before I start, a historical side note: When Kenny Easwaran visited us for two days and we proved the UDT optimality result, the reason we decided to physical counterfactuals wasn't actually that we thought these were the better kind of counterfactuals. Rather, we actually thought explicitly about the problem of physical vs. logical third-person counterfactuals on the first morning of Kenny's visit, and decided to look at the physical counterfactuals case because it seemed easier to reason about. Which turned out to be a great decision, because---to our surprise---we very quickly ended up proving the first version of what later became the modal UDT optimality result!

But today, let's talk about logical counterfactuals. As Nate points out in his Global UDT post, there's a sort of duality between first-person and third-person counterfactuals---given a good third-person notion of counterfactuals, you can try to turn it into a first-person notion by writing an agent that evaluates actions according to it, and given a first-person notion you can try to turn it into a third-person notion. So is there a way to turn, say, the first-person counterfactuals of modal UDT into a way to evaluate what would have happened in a certain universe if a certain agent had taken a different action?

Nate's post describes an algorithm, GlobalUDT(U,A), which tries to tell you what agent $A ()$ should have done in order to achieve the best outcome in universe $U ()$ . Here, I want to ask a more intermediate question: What would have happened if $A ()$ had chosen a different action? Of course, we can then say that the agent should have taken the action that leads to the best possible outcome in this sense, but one advantage of my proposal is that it sometimes says, "I don't know what would have happened in that case"; in particular, in the cases Nate discusses in his post, my proposal would say that it doesn't have an answer, rather than giving a wrong answer. (However, in a follow-up post I'll show that there are cases in which my proposal gives an intuitively incorrect answer.)

So here's my proposal. Suppose that $\to A$ is an $m$ -action agent, that is, a "provably mutually exclusive and exhaustive" (p.m.e.e.) sequence of $m$ closed modal formulas $(A_{1}, \dots, A_{m})$ , where $A_{i}$ is interpreted as "the agent takes action $m$ ". "P.m.e.e." means that it's provable that exactly one of the $m$ formulas is true. Similarly, $(A_{1}, \dots, A_{m})$ is an $n$ -outcome universe, i.e., a p.m.e.e. sequence $(U_{1}, \dots, U_{n})$ where $U_{j}$ means "the $j$ 'th-best outcome obtains".

We say that, according to this notion of counterfactuals, action $i$ leads to outcome $j$ if (i) $G L ⊢ A_{i} \to U_{j}$ , and (ii) $G L ⊬ \neg A_{i}$ . So for every $i$ , there are three possible cases:

If there's exactly one $j$ such that $G L ⊢ A_{i} \to U_{j}$ , then we say that action $i$ leads to outcome $j$ .
If $G L ⊢ \neg A_{i}$ , then we don't know what would have happened if the agent had taken action $i$ , because we "don't have enough counterfactuals": there is no model of PA in which $A_{i}$ is true (we can think of the models of PA as the "impossible possible worlds" we use to evaluate the impact of different actions). In particular, this is the case if we have both $G L ⊢ A_{i} \to U_{j}$ and $G L ⊢ A_{i} \to U_{j^{'}}$ , for $j \neq j^{'}$ , since this implies $G L ⊢ \neg A_{i}$ by the assumption that $U_{j}$ and $U_{j^{'}}$ are provably mutually exclusive.
If there's no $j$ such that $G L ⊢ A_{i} \to U_{j}$ , then we don't know what would have happened if the agent had taken action $i$ , because we have "ambiguous counterfactuals": there are some distinct $j$ and $j^{'}$ such that there's a model of PA in which $A_{i} \land U_{j}$ , and a different model in which $A_{i} \land U_{j^{'}}$ . (We know that there is a model in which $A_{i}$ is true, because otherwise we'd have $G L ⊢ \neg A_{i}$ , which would imply $G L ⊢ A_{i} \to U_{j}$ for every $j$ .)

Now, for example, if we consider Nate's example of an agent that has three possible actions, but always takes the third one (i.e., $\to A \equiv (A_{1}, A_{2}, A_{3}) \equiv (⊥, ⊥, ⊤)$ ), then it's clear that our third-person counterfactuals will not fail silently, but rather give the reasonable answer that it's hard to say what outcome the agent would have achieved if it had returned a different value: for example, say that $U_{14} \equiv ⊤ \land (⊥ \lor \neg ⊤)$ ; are some of the $⊤$ 's and $⊥$ 's in the definition of this universe invocations of the agent? Which ones? We might hope that there's a notion of third-party counterfactuals which can answer questions like this about the real world, but presumably it would need to make more use of the more complicated structure of the real universe; as posed, the question doesn't seem to have a good answer.

But when we apply this notion to modal UDT, it returns a non-error answer sufficiently often to allow us to show an at least superficially sensible (if rather trivial!) optimality result.

Let's say that a pair of $(\to A, \to U)$ is "fully informative" if every $i$ leads to some $j$ according to our notion of counterfactuals. Then, given a fully informative pair, we can say that $\to A$ is optimal (according to our notion of counterfactuals!) iff the outcome that $\to A$ 's actual action leads to is optimal among the outcomes achievable by any of the available actions.

Now it's rather straight-forward to see that modal UDT is optimal, in this sense, on a universe $\to U$ whenever the pair $(\to U D T (\to U), \to U)$ is fully informative. Recall the way that modal UDT works:

For every outcome $j = 1$ to $n$ (from best to worst):
- For every action $i = 1$ to $m$ :
  - If $□ (A_{i} \to U_{j})$ , then take action $i$ .
If you're still here, take some default action.

Clearly, in the fully informative case, this algorithm will take the optimal action (in the sense we use here): Suppose that $j$ is optimal, and $i$ leads to $j$ . The search will not find a proof of an implication $A_{i^{'}} \to U_{j^{'}}$ with $j^{'} < i^{'}$ , because then $A_{i^{'}} \to U_{j^{'}}$ wouldn't be optimal according to our definition; and the search will terminate when considering the pair $(j, i)$ at the latest; so modal UDT will return some action $i^{*}$ for which $j^{'} < i^{'}$ .

I'd like to say that this covers all the cases in which we would expect modal UDT to be optimal, but unfortunately that's not quite the case. Suppose that there are two actions, $A_{1}$ and $A_{2}$ , and two outcomes, $U_{1}$ and $U_{2}$ . In this case, it's consistent that $i = 1$ leads to $j = 1$ , but we don't have enough counterfactuals about $i = 2$ , that is, $G L ⊢ \neg A_{2}$ (implying that $(\to A, \to U)$ isn't fully informative). This is because modal UDT doesn't have an explicit "playing chicken" step that would make it take action $A_{2}$ if it can prove that it doesn't take this action. Now, if we did not have $G L ⊢ A_{1} \to U_{1}$ , then $G L ⊢ \neg A_{2}$ would imply that the agent would take action $2$ (because $\neg A_{2}$ implies $A_{2} \to U_{1}$ ), which would lead to a contradiction (the agent takes an action that it provably doesn't take), so we can rule out that case; but the case of $G L ⊢ A_{1} \to U_{1}$ plus $G L ⊢ \neg A_{2}$ is consistent.

So let's say that a pair $G L ⊢ A_{1} \to U_{1}$ is "sufficiently informative" if either it's fully informative or if there is some action $i$ such that $G L ⊢ A_{i} \to U_{1}$ . Then we can say that $\to A$ is optimal if either (i) $(\to A, \to U)$ is fully informative and $\to A$ is optimal in the sense discussed earlier, or (ii) $N ⊨ U_{1}$ , that is, the agent actually obtains the best possible outcome. With these definitions, we can show that modal UDT is optimal whenever $(\to U D T (\to U), \to U)$ is sufficiently informative.

The reasoning is simple. In the fully informative case, our earlier proof works. In the other case, there's some $i$ such that $G L ⊢ A_{i} \to U_{1}$ , so the agent's search is certainly going to stop when it considers $A_{i} \to U_{1}$ at the latest; in other words, it's going to stop at some $i^{*} \leq i$ such that $G L ⊢ A_{i^{*}} \to U_{1}$ , and the agent is going to output that action $i^{*}$ ; i.e., we'll have $N ⊨ A_{i^{*}}$ . But since GL is sound, we also have $N ⊨ A_{i^{*}} \to U_{1}$ , and hence $N ⊨ U_{1}$ , showing optimality in the extended sense.

It's not surprising that modal UDT is "optimal" in this sense, of course! Nevertheless, as a conceptual tool, it seems useful to have this definition of logical third-person counterfactuals, to complement the first-person notion of modal UDT.

However, my not-so-secret agenda for going through this in detail is that in a follow-up post, I'll show that there are universes $\to U$ such that $(\to U D T (\to U), \to U)$ is fully informative, but UDT still does the intuitively incorrect thing---because the notion of counterfactuals (and hence the notion of optimality) I've defined in this post doesn't agree with intuition as well as we'd like. This failure turns out to be clearer in the context of the third-person counterfactuals described in this post than in modal UDT's first-person ones.