# 13

The following is a critique of the idea of logical counterfactuals. The idea of logical counterfactuals has appeared in previous agent foundations research (especially at MIRI): here, here. "Impossible possible worlds" have been considered elsewhere in the literature; see the SEP article for a summary.

I will start by motivating the problem, which also gives an account for what a logical counterfactual is meant to be.

Suppose you learn about physics and find that you are a robot. You learn that your source code is "A". You also believe that you have free will; in particular, you may decide to take either action X or action Y. In fact, you take action X. Later, you simulate "A" and find, unsurprisingly, that when you give it the observations you saw up to deciding to take action X or Y, it outputs action X. However, you, at the time, had the sense that you could have taken action Y instead. You want to be consistent with your past self, so you want to, at this later time, believe that you could have taken action Y at the time. If you could have taken Y, then you do take Y in some possible world (which still satisfies the same laws of physics). In this possible world, it is the case that "A" returns Y upon being given those same observations. But, the output of "A" when given those observations is a fixed computation, so you now need to reason about a possible world that is logically incoherent, given your knowledge that "A" in fact returns X. This possible world is, then, a logical counterfactual: a "possible world" that is logically incoherent.

To summarize: a logical counterfactual is a notion of "what would have happened" had you taken a different action after seeing your source code, and in that "what would have happened", the source code must output a different action than what you actually took; hence, this "what would have happened" world is logically incoherent.

It is easy to see that this idea of logical counterfactuals is unsatisfactory. For one, no good account of them has yet been given. For two, there is a sense in which no account could be given; reasoning about logically incoherent worlds can only be so extensive before running into logical contradiction.

To extensively refute the idea, it is necessary to provide an alternative account of the motivating problem(s) which dispenses with the idea. Even if logical counterfactuals are unsatisfactory, the motivating problem(s) remain.

I now present two alternative accounts: counterfactual nonrealism, and policy-dependent source code.

## Counterfactual nonrealism

According to counterfactual nonrealism, there is no fact of the matter about what "would have happened" had a different action been taken. There is, simply, the sequence of actions you take, and the sequence of observations you get. At the time of taking an action, you are uncertain about what that action is; hence, from your perspective, there are multiple possibilities.

Given this uncertainty, you may consider material conditionals: if I take action X, will consequence Q necessarily follow? An action may be selected on the basis of these conditionals, such as by determining which action results in the highest guaranteed expected utility if that action is taken.

This is basically the approach taken in my post on subjective implication decision theory. It is also the approach taken by proof-based UDT.

The material conditionals are ephemeral, in that at a later time, the agent will know that they could only have taken a certain action (assuming they knew their source code before taking the action), due to having had longer to think by then; hence, all the original material conditionals will be vacuously true. The apparent nondeterminism is, then, only due to the epistemic limitation of the agent at the time of making the decision, a limitation not faced by a later version of the agent (or an outside agent) with more computation power.

This leads to a sort of relativism: what is undetermined from one perspective may be determined from another. This makes global accounting difficult: it's hard for one agent to evaluate whether another agent's action is any good, because the two agents have different epistemic states, resulting in different judgments on material conditionals.

A problem that comes up is that of "spurious counterfactuals" (analyzed in the linked paper on proof-based UDT). An agent may become sure of its own action before that action is taken. Upon being sure of that action, the agent will know the material implication that, if they take a different action, something terrible will happen (this material implication is vacuously true). Hence the agent may take the action they were sure they would take, making the original certainty self-fulfilling. (There are technical details with how the agent becomes certain having to do with Löb's theorem).

The most natural decision theory resulting in this framework is timeless decision theory (rather than updateless decision theory). This is because the agent updates on what they know about the world so far, and considers the material implications of themselves taken a certain action; these implications include logical implications if the agent knows their source code. Note that timeless decision theory is dynamically inconsistent in the counterfactual mugging problem.

## Policy-dependent source code

A second approach is to assert that one's source code depends on one's entire policy, rather than only one's actions up to seeing one's source code.

Formally, a policy is a function mapping an observation history to an action. It is distinct from source code, in that the source code specifies the implementation of the policy in some programming language, rather than itself being a policy function.

Logically, it is impossible for the same source code to generate two different policies. There is a fact of the matter about what action the source code outputs given an observation history (assuming the program halts). Hence there is no way for two different policies to be compatible with the same source code.

Let's return to the robot thought experiment and re-analyze it in light of this. After the robot has seen that their source code is "A" and taken action X, the robot considers what would have happened if they had taken action Y instead. However, if they had taken action Y instead, then their policy would, trivially, have to be different from their actual policy, which takes action X. Hence, their source code would be different. Hence, they would not have seen that their source code is "A".

Instead, if the agent were to take action Y upon seeing that their source code is "A", their source code must be something else, perhaps "B". Hence, which action the agent would have taken depends directly on their policy's behavior upon seeing that the source code is "B", and indirectly on the entire policy (as source code depends on policy).

We see, then, that the original thought experiment encodes a reasoning error. The later agent wants to ask what would have happened if they had taken a different action after knowing their source code; however, the agent neglects that such a policy change would have resulted in seeing different source code! Hence, there is no need to posit a logically incoherent possible world.

The reasoning error came about due to using a conventional, linear notion of interactive causality. Intuitively, what you see up to time t depends only on your actions before time t. However, policy-dependent source code breaks this condition. What source code you see that you have depends on your entire policy, not just what actions you took up to seeing your source code. Hence, reasoning under policy-dependent source code requires abandoning linear interactive causality.

The most natural decision theory resulting from this approach is updateless decision theory, rather that timeless decision theory, as it is the entire policy that the counterfactual is on.

## Conclusion

Before very recently, my philosophical approach had been counterfactual nonrealism. However, I am now more compelled by policy-dependent source code, after having analyzed it. I believe this approach fixes the main problem of counterfactual nonrealism, namely relativism making global accounting difficult. It also fixes the inherent dynamic inconsistency problems that TDT has relative to UDT (which are related to the relativism).

I believe the re-analysis I have provided of the thought experiment motivating logical counterfactuals is sufficient to refute the original interpretation, and thus to de-motivate logical counterfactuals.

The main problem with policy-dependent source code is that, since it violates linear interactive causality, analysis is correspondingly more difficult. Hence, there is further work to be done in considering simplified environment classes where possible simplifying assumptions (including linear interactive causality) can be made. It is critical, though, that the linear interactive causality assumption not be used in analyzing cases of an agent learning their source code, as this results in logical incoherence.