Summary: Take the argmax of counterfactual expected utility, using a thin counterfactual which itself maximizes a priori expected utility.

Followup to: An environment for studying counterfactuals, Counterfactuals, thick and thin
Replacement for: Logical counterfactuals and differential privacy

In An environment for studying counterfactuals, I described a family of decision problems that elicit counterfactuals from an agent. In this post I'll redefine that family using reflective oracles, and define an optimal agent. I was aided in this work by discussions with Alex Appell, James Brooks, Abram Demski, Linda Linsefors, and Alex Mennen.

The problem

Here's the (not quite) Bayes net for the problem. I changed it a little from the last post, to make the thin counterfactuals more explicit:

$\xymatrix{& E \ar[dr] \ar@/_3pc/[ddddr] \ar@/1pc/@{-->}[ddl] & & S \ar[dl] \ar@/^3pc/[dddddl] \& & O \ar[dd] \ar@//@{-->}[dll] & \\bullet \ar@{-->}[r] & P \ar[rd] & & \& & C \ar[d] & \& & A \ar[d] \ar@/^/@{-->}[uull] & \& & U \ar@/^1pc/@{-->}[uuull] &}$

$E$ controls exploration and $S$ is a hidden state. The observation $O$ is isomorphic as a random variable to $E \times S$ , but the particular isomorphism $^κ$ used by the problem is unknown to the agent.

The agent is given $O$ and a prior $P$ ; it outputs a set $C$ of counterfactual expected utilities, one for each action. The agent's action $A$ is equal to the argmax of these expectations, unless it's overridden by $E$ . The action and hidden state determine the utility $U$ .

The dotted lines mean that $P$ is equal to the marginal over $E \times O \times A \times U$ . The machinery of reflective oracles resolves the circularity.

Now I'll define this distribution in detail, using Python-like pseudocode. I ought to use a probabilistic programming language, but I don't know any of those. Note that if you have a reflective oracle, you can compute expectations of distributions in constant time, and you can compute argmaxes over sets of real numbers, with ties broken in an undetermined way. Real numbers and distributions can be represented by programs.

$e n v i r o n m e n t ()$

The function $e n v i r o n m e n t ()$ takes an agent and outputs a sample from a joint distribution over $E \times O \times A \times U$ . The agent is a function that takes the environment and an observation and stochastically returns a set of counterfactual expectations, which is itself encoded as a function from actions to real numbers.

The environment samples $s$ from a uniform distribution. Usually $e$ is equal to the symbol $*$ , meaning no exploration; but with probability $ε$ it's a uniformly random action. It computes $o$ using the secret isomorphism $^κ$ . It calls the agent to get its counterfactual expectations. It argmaxes over those expectations to get the action, unless $e$ overrides that. Finally, $u$ is sampled from a problem-dependent likelihood function that depends on $a$ and $s$ .

define $e n v i r o n m e n t (a g)$ :

sample $s \sim U n i f (S)$

sample $e \sim ε U n i f (A) + (1 - ε) δ_{*}$

$o :=^κ (e, s)$

$c := a g (c u r r y (e n v i r o n m e n t, a g), o)$

if $e \in A$ : $a := e$

else: sample $a \sim {a r g m a x}_{a} c (a)$

sample $u \sim U | A = a, S = s$

return $(e, o, a, u)$

The agent

The agent's strategy is to choose a thin counterfactual that maximizes its a priori expected utility. We'll define a couple helper functions first.

$c f a c t ()$

The function $c f a c t ()$ takes a prior $P \in Δ (E \times O \times A \times U)$ , a thin counterfactual $κ : E \times S \approx O$ , an observation $o \in O$ , and an action $a \in A$ , and deterministically outputs the expected utility conditional on $o$ and counterfactually conditional on $a$ , taking the expectation with respect to $P$ and the counterfactual with respect to $κ$ .

This is how to compute a thin counterfactual distribution: First condition on $O = o$ and use $κ$ to infer a posterior on $S$ . Then forget $o$ but retain your posterior on $S$ . Then condition on $A = a$ . The reason this works is that $S$ is independent of $E$ , so there's still a chance for $A = a$ by exploration.

define $c f a c t (P, κ, o, a)$ :

$s = {p r}_{S} κ^{- 1} (o)$

return $(E_{P} [U | O = κ (a, s)] P_{P} (E = a) +$

$E_{P} [U | O = κ (*, s), A = a] P_{P} (A = a | O = κ (*, s)) P_{P} (E = *)) /$

$(P_{P} (E = a) + P_{P} (A = a | O = κ (*, s)) P_{P} (E = *))$

$u t i l i t y ()$

This function takes a prior and a thin counterfactual, and samples from the marginal on $U$ that would result if the agent used that thin counterfactual. The previous sentence uses the word "would", so you might think we need another thin counterfactual to make sense of it; but our agent is a program, and programs come with their own "thick" notion of counterfactual.

The calculation goes like this: Sample $e$ and $o$ . If $e$ makes us explore, then that determines the action and we can sample a utility conditional on $e$ and $o$ . Otherwise, we assume the action is chosen to maximize $c f a c t ()$ .

define $u t i l i t y (P, κ)$ :

sample $e, o \sim P$

if $e \in A$ : return $u \sim P | E = e, O = o$

else: return ${m a x}_{a} c f a c t (P, κ, o, a)$

$a g e n t ()$

The agent is given a prior and an observation, and outputs the expected utility conditional on $A = a$ , for every $a$ . This output is encoded in a program that takes an action and outputs a real number. The output may be stochastic because $a g e n t ()$ contains an argmax which may be stochastic in case of ties.

The true thin counterfactual $^κ : E \times S \approx O$ determines a projection $O \to E$ . The agent can deduce this projection by analyzing the prior: For every $o$ , $P_{P} (E = e | O = o) = 0$ for all but one value of $e$ .

The argmax is taken over all isomorphisms $κ : E \times S \approx O$ that commute with projecting to $E$ . I won't bother writing that out in the pseudocode:

define $a g e n t (P, o)$ :

$κ := {a r g m a x}_{κ} E_{P} [u t i l i t y (P, κ)]$

return $c u r r y (c f a c t, P, κ, o)$

Optimality theorems

One of the counterfactual expectations that the agent emits is factual. It's reasonable to ask whether that one is accurate.

Theorem: If $U$ is bounded by $U_{m a x}$ and $E = *$ , then with probability $\geq 1 - \sqrt{ε}$ , if we sample $a \sim A$ , $s \sim S$ , $o \sim O$ , we have

$∣ ∣ ∣ a g e n t (P, o) (a) - E [U | A = a, S = s] ∣ ∣ ∣ < 4 U_{m a x} \sqrt{ε}$

Proof: $a g e n t (P, o) (a) = c f a c t (P, κ, o, a)$ for some $κ$ . That equals

$(E [U | O = κ (a, s)] P (E = a) + E [U | O = κ (*, s), A = a] P [A = a | O = κ (*, s)) P (E = *)) / D$

$= (E [U | O = κ (a, s)] \frac{ε}{| A |} + E [U | O = o, A = a] P [A = a | O = o) (1 - ε)) / D$

where $s = {p r}_{S} κ^{- 1} (o)$ and $D = \frac{ε}{| A |} + P (A = a | O = o) (1 - ε)$ . With a little algebra, that gives us

$∣ ∣ ∣ a g e n t (P, o) (a) - E [U | A = a, S = s] ∣ ∣ ∣ < \frac{2 U_{m a x} ε}{ε + (1 - ε) | A | P (A = a | O = o)}$

By the following lemma, with probability $\geq 1 - \sqrt{(} ε)$ , we have $P (A = a | O = o) > \frac{\sqrt{ε}}{| A |}$ . A little more algebra concludes the proof. $■$

Lemma: Sample $x, y$ from a joint distribution over $X \times Y$ . With probability $\geq 1 - ε$ , we have $P (X = x | Y = y) > \frac{ε}{| X |}$ .

Now we'll prove that $a g e n t ()$ gets an optimal amount of utility. First, a couple lemmas.

Lemma: $E [U] = {m a x}_{κ} E [u t i l i t y (P, κ)]$

Proof: By the definition of $e n v i r o n m e n t ()$ ,

$E [U] = Σ_{a} E [U | E = a] P (E = a) + (1 - ε) E_{o | e = *} {m a x}_{a} a g e n t (P, o) (a)$

By definition of $a g e n t ()$ , this is

$= \frac{ε}{| A |} Σ_{a} E [U | E = a] + (1 - ε) E_{o | e = *} E_{κ} {m a x}_{a} c f a c t (P, κ, o, a)$

where the expectation $E_{κ}$ is taken over the output distribution of ${a r g m a x}_{κ} E [u t i l i t y (P, κ)]$ . By linearity of expectation,

$= E_{κ} [\frac{ε}{| A |} Σ_{a} E [U | E = a] + (1 - ε) E_{o | e = *} {m a x}_{a} c f a c t (P, κ, o, a)]$

This is just

$= E_{κ} [u t i l i t y (P, κ)]$

$= {m a x}_{κ} E [u t i l i t y (P, κ)] ■$

Recall that $^κ$ is the secret isomorphism $E \times S \approx O$ . The next lemma says that $u t i l i t y (P,^κ)$ represents the highest utility that an agent can get:

Lemma: $E [u t i l i t y (P,^κ)] = \frac{ε}{| A |} Σ_{a} E [U | A = a] + (1 - ε) E [U | A = {a r g m a x}_{a} E [U | A = a, S = s]]$

Proof: Using the fact that

$E [U | O =^κ (a, s)] = E [U | A = a, S = s]$

and

$E [U | O =^κ (*, s), A = a] = E [U | A = a, S = s]$

and the definition of $c f a c t ()$ , we have

$c f a c t (P,^κ, o, a) = E [U | A = a, S = s]$ .

Plugging that in to the definition of $u t i l i t y ()$ , we get

$u t i l i t y (P,^κ) = \frac{ε}{| A |} Σ_{a} E [U | A = a] + (1 - ε) E {m a x}_{a} E [U | A = a, S = s]$

$= \frac{ε}{| A |} Σ_{a} E [U | A = a] + (1 - ε) E [U | A = {a r g m a x}_{a} E [U | A = a, S = s]] ■$

Finally, this theorem says that the agent gets the highest possible utility:

Theorem: $E [U] = \frac{ε}{| A |} Σ_{a} E [U | A = a] + (1 - ε) E [U | A = {a r g m a x}_{a} E [U | A = a, S = s]]$

Proof: Call the right-hand side $U_{o p t}$ . It's easy to see that $E [U] \leq U_{o p t}$ . By the previous two lemmas, we also have

$E [U] = {m a x}_{κ} E [u t i l i t y (P, κ)] \geq E [u t i l i t y (P,^κ)] = U_{o p t} ■$

What about Troll Bridge?

Troll Bridge is a problem that punishes exploration. You can't represent it in this framework because the action screens off exploration. But if you modified the framework to implement Troll Bridge, the agent would fail. (It would be afraid to cross the bridge.)

I suspect the solution to Troll Bridge is to use continuous exploration: Explore at every timestep of a logical inductor. If Troll Bridge punishes exploration at timestep $n$ , then exploration at $n - 1$ succeeds. I haven't worked this out yet.

What about counterfactual mugging?

This agent doesn't get counterfactually mugged. (This is bad.) This agent only exhibits one of the two types of updatelessness.

Next steps

The notion of "thin counterfactual" used here is more specific than it needs to be. I'd like to generalize it.

Once I do that, I might be able to formulate the symmetric Prisoner's Dilemma. I'm not sure how this agent does on that problem.

I also want to implement this agent in the setting of logical induction. I'm imagining traders getting licenses to participate in thin counterfactual markets, and the licenses prohibit them from learning the price of $A = a$ . This will also involve some kind of continuous exploration.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

4

Counterfactuals and reflective oracles

4

The problem

$e n v i r o n m e n t ()$

The agent

$c f a c t ()$

$u t i l i t y ()$

$a g e n t ()$

Optimality theorems

What about Troll Bridge?

What about counterfactual mugging?

Next steps

4

Counterfactuals and reflective oracles

4

The problem

environment()

The agent

cfact()

utility()

agent()

Optimality theorems

What about Troll Bridge?

What about counterfactual mugging?

Next steps

$e n v i r o n m e n t ()$

$c f a c t ()$

$u t i l i t y ()$

$a g e n t ()$