# 3

Counterfactuals
Personal Blog

Updateless decision theory was informally defined by Wei Dai in terms of logical conditional expected utility, where the condition corresponds to an algorithm (the agent) producing a given output (action or policy). This kind of conditional expected values can be formalised by optimal predictors. However, since optimal predictor systems which are required to apply optimal predictors to decision theory generally have random advice, we need counterfactuals well-defined for random algorithms i.e. algorithms that produce different outputs with different probabilities depending on internal coin tosses. We propose to define these counterfactuals by a generalization of the notion of conditional expected utility which amounts to linear regression of utility with respect to the probabilities of different outputs in the space of "impossible possible worlds." We formalise this idea by introducing "relative optimal predictors," prove the analogue of the conditional probability formula (which takes matrix form) and uniqueness theorems.

## Motivation

We start by explaining the analogous construction in classical probability theory and proceed to defining the logical counterpart in the Results section.

Consider a probability measure on some space, a random variable representing utility, a finite set representing possible actions and another random variable taking values in and satisfying representing the probabilities of taking different actions. For a deterministic algorithm, takes values allowing defining conditional expected utility as

In the general case, it is tempting to consider

where stands for the semidirect product of with , the latter regarded as a Markov kernel with target . However, this would lead to behavior similar to EDT since conditioning by is meaningful even for a single "world" (i.e. completely deterministic and ). Instead, we select that minimizes (we regard elements of as column vectors so is a row vector). This means has to satisfy the matrix equation

The solution to this equation is only unique when is non-degenerate. This corresponds to requiring positive probability of the condition for usual conditional expected values. In case takes values in , is the usual conditional expected value.

# Notation

We will suppress the indices associated with word ensembles and bischemes whenever possible to reduce clutter. When we do show explicitly, evaluation of functions will be written as rather than as before.

The concept of "proto-error space" is renamed to "error space" whereas the previous definition of "error space" is now called "radical error space".

We slightly extend the definition of "distributional estimation problem" as follows.

# Definition 1

A distributional estimation problem is a pair where is a word ensemble and is bounded.

The corresponding notion of optimal predictor is the following.

# Definition 2

Fix an space of rank 2 and a distributional estimation problem. Consider a -valued -bischeme taking bounded values. is called an -optimal predictor for when for any -valued -bischeme , there is s.t.

The theory of optimal predictors remains intact with these new definitions (although some care is wish to consider reflective systems with range instead of ).

The following concept of "orthogonal predictor" is often more convenient than "optimal predictor", since we no longer require our error spaces to be radical.

# Definition 3

Fix an error space of rank 2 and a distributional estimation problem. Consider a -valued -bischeme taking bounded values.

is called an -orthogonal predictor for when for any -valued -bischeme s.t. factors as , there is s.t.

The two concepts are tightly related due to the following

# Theorem 1

Fix an error space of rank 2 and a distributional estimation problem. Then any -orthogonal predictor for is an -optimal predictor for . Conversely any -optimal predictor for is an -orthogonal predictor for .

We skip the proof since it is almost identical to the orthogonality lemma.

## Results

The proofs of the results are found in the Appendix.

# Definition 4

A relative estimation problem is a quadruple where is a finite set, bounded and bounded.

# Definition 5

Fix an error space of rank 2. Consider a relative estimation problem and a -valued -bischeme. is called an -optimal predictor for when for any -valued -bischeme s.t. factors as , there is s.t.

The appearance of the term might seem as making the condition substantially weaker. However, we the condition is still sufficient strong to produce the uniqueness theorem which appears as Theorem 1 below. Also, the novelty disappears in the corresponding orthogonality condition.

# Definition 6

Fix an error space of rank 2. Consider a relative estimation problem and a -valued -bischeme. is called an -orthogonal predictor for when for any -valued -bischeme s.t. factors as , there is s.t.

# Theorem 2

Fix an error space of rank 2 and a relative estimation problem. Then any -orthogonal predictor for is an -optimal predictor for . Conversely any -optimal predictor for is an -orthogonal predictor for .

# Definition 7

Consider an error space space of rank . A set is called -negligible when .

# Note 1

The -negligible sets form a set-theoretic ideal on .

All finite sets are -negligible (thanks to the the condition ).

# Definition 8

Fix an error space of rank 2. Consider a finite set, a word ensemble, and -valued -bischemes. We say is -similar to relative to (denoted ) when there is an -negligible set and s.t.

# Note 2

It's impossible to remove from the definition since doesn't have to be bounded.

# Theorem 3

Fix an error space of rank 2. Consider a relative estimation problem and -orthogonal predictors for . Then .

Theorem 4 below is the orthogonal predictor analogue of the matrix equation from the previous section.

# Proposition 1

Consider an error space of rank , and . Then is an error space.

# Theorem 4

Fix an error space of rank 2. Consider a relative estimation problem, an -valued -bischeme, a -valued bischeme, and . Assume that for any , is always invertible and the operator norm of is at most . Assume further that is an -orthogonal predictor for (componentwise) and is a -orthogonal predictor for (componentwise). Define the -valued bischeme by . Then, is a -orthogonal predictor for .

# Note 3

It is always possible to choose to be symmetric in which case the operator norm of is the inverse of the lowest absolute value of an eigenvalue of .

Finally, we extend the stability result for usual conditional probabilities to the matrix setting.

# Definition 9

Fix an error space of rank 2. Consider a finite set, a word ensemble and -valued -bischemes. We say is -similar to relative to (denoted ) when there is an -negligible set and s.t.

# Theorem 5

Fix an error space of rank 2. Consider and , a relative estimation problem, a positive-definite -orthogonal predictor for and , -orthogonal predictors for . Assume that for any , the lowest eigenvalue of is always at least and . Then, .

The Corollary below shows that in simple situations these counterfactuals produce the values we expect.

# Definition 10

Fix an space of rank 2 and a distributional estimation problem. Consider a -valued -bischeme taking bounded values. is called an -perfect predictor for when .

# Corollary

Fix an error space of rank 2. Consider and , a relative estimation problem and bounded s.t. . Suppose is a -perfect predictor for . Suppose is a positive-definite -orthogonal predictor for s.t. for any , the lowest eigenvalue of is always at least . Suppose is a -orthogonal predictor for . Then, .

# Proof of Theorem 2

Consider an -optimal predictor for . Consider a -valued -bischeme with . For any and s.t. can be bounded by a polynomial (and hence its number of digits is logarithmic), we can define and the -valued -bischeme with and . We have s.t.

By choosing we get

Take where is a polynomial s.t. . We get

Conversely, suppose an -orthogonal predictor for . Consider a -valued -bischeme with . We have

# Proof of Theorem 3

Since is orthogonal, we have s.t.

Since is orthogonal, we have s.t.

It follows that

Define . so is -negligible. We get

# Proof of Proposition 1

Given

Given

Given polynomial s.t.

# Proof of Theorem 4

We know that and

Consider a -valued -bischeme s.t. factors as . We get

Since is a -orthogonal predictor for , there is s.t.

Since is an -orthogonal predictor for , there is s.t.

is bounded and the operator norm of is at most , therefore there is s.t. and

Putting everything together we get

# Proof of Theorem 5

According to Theorem 3, there is and which is -negligible s.t.

On the other hand, since is an -orthogonal predictor for , there is s.t.

Since , there is s.t.

Combining the two together, we conclude that