x

An environment for studying counterfactuals — AI Alignment Forum

Summary: I introduce a decision theory framework where successful agents are those with good counterfactuals.

Motivation

The problem of logical counterfactuals is how to define probabilities $P (ϕ | A () = a)$ when $A () = a$ is known to be false. (I'll ignore more general counterfactuals $P (ϕ | ψ)$ in this post.)

The theory of logical induction provides a joint distribution over sentences, so the problem becomes: How do you condition on $A () = a$ when $A () = a$ has negligible probability?

Exploration tries to solve this by making sure that $A () = a$ never has negligible probability. But it doesn't work in problems like Agent Simulates Predictor that contain predictors who can't tell when the agent explores.

A better solution is early exploration, which uses an early stage of the logical inductor to do exploration. But then the later stages of the inductor know that $A () = a$ is false, and we're back where we started.

I'm going to describe an environment that captures these features of the problem — it's got reflection, early exploration, counterfactuals, and a Bayesian update that stands in for the evolution of a logical inductor.

Informal definition

The agent outputs counterfactual distributions $p (U = u | A = a)$ . This determines an expected utility for each action. Most of the time, an action is chosen for the agent that maximizes this expected utility. But a small fraction of the time, an exploration action is chosen instead.

The agent receives an observation $O$ as input, from which it can infer whether exploration will occur. The agent also receives a prior $P$ as input, and this prior accurately reflects the behavior of the agent as a function of $O$ and $P$ . (This uses a fixed-point theorem.)

If action $a$ is chosen, then the counterfactual $p (U = u | A = a)$ is factual; the rest are counterfactual. We judge an agent according to how accurate its factual counterfactual is, in addition to how much utility it gets.

Here's an agent that does okay in this environment: It adopts $P$ as its epistemic state and ignores $O$ . Because of exploration, it can compute counterfactuals by conditioning. This agent does okay but not great, since it ignores $O$ .

You could try to make a better agent as follows: Adopt $P$ as a prior and then do a Bayesian update on $O$ . But now you've inferred whether exploration occurs, so some actions have probability zero, and it's not clear how to compute counterfactuals.

If you find a good agent for this environment, you'll probably have learned something about making good counterfactuals.

Formal definition

A decision problem consists of a tuple of random variables:

$A$ is a finite set of actions.
$E = A \cup {*}$ determines whether the agent explores. If $E = *$ , no exploration takes place.
$O$ is a finite set of observations.
$U = R$ is the space of utilities.
$P = Δ (E \times O \times A \times U)$ is the space of distributions over the above variables.
$R = {0, 1}^{\infty}$ represents an infinite source of random bits that the agent can use.
$C = Δ (U)^{A}$ is the agent's output, representing a counterfactual distribution $P (U = u | A = a)$ for each action.

along with some likelihoods:

$P (E = a) = \frac{ε}{| A |}$ for all $a \in A$ , and $P (E = *) = 1 - ε$ , for some choice of $ε$ .
Likelihoods $P (O = o | E = e)$ , depending on the problem.
i.i.d. uniform distributions on each bit of $R$ .
$A$ is mostly determined by $E$ and $C$ as follows: If $E = *$ , then $A = {argmax}_{a} E_{C (a)} [U]$ . (If there is a tie, $A$ is undetermined.) Otherwise, $A = E$ .
A distribution over $U$ conditional on each value of $O$ and $A$ , depending on the problem.

An agent is a function $P \times O \times R \to C$ . A decision and an agent together almost determine a joint distribution over all the variables. What's missing is $P$ and tiebreakers for $A$ . These are determined by finding a fixed point satisfying:

If $P$ is the resulting marginal distribution over $E \times O \times A \times U$ , then $P = P$ .
For each $o \in O$ and $r \in R$ , there is a distribution over the set ${argmax}_{a} E_{C (a)} [U]$ such that $A$ is sampled from that distribution.

(I might prove the existence of a fixed point in a comment.)

We'll informally say that an agent does well on a decision problem if, for every fixed point, the following are true:

$E [U]$ is high.
The factual counterfactual is accurate — say, it's close to the marginal over $U$ conditional on the true action in total variation distance: $E [δ (C (a), P (- | O = o, A = a)) | O = o, A = a]$ .

Future work

I have an idea for defining an optimal agent for every decision problem in this family; I'll explore that in another post.

Once we find a general solution, we'd ideally transfer it to the setting of logical induction, and then we'd have logical counterfactuals.