AI ALIGNMENT FORUM
AF

Logical counterfactuals and differential privacy — AI Alignment Forum

Edit: This article has major flaws. See my comment below.

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

Summary

For a logical inductor $P$ , define logical counterfactuals by

$P_{n} (ϕ | ψ) := \sum y P_{k} (ϕ | ψ \land Y = y) P_{n} (Y = y)$

for a suitable $k < n$ and a random variable $Y$ independent of $ψ$ with respect to $P_{k}$ . Using this definition, one can construct agents that perform well in ASP-like problems.

Motivation

Recall the Agent Simulates Predictor problem:

$U_{n} = 10^{6} P_{n - 1} (A_{n} = 1) + 10^{3} 1 (A_{n} = 2)$

Naively, we want to solve this by argmaxing:

$A_{n} = {argmax}_{a} E_{n} [U_{n} | A_{n} = a]$

Hopefully, $P_{n} (A_{n} = 1) \approx 1$ , $P_{n - 1} (A_{n} = 1) \approx 1$ , and $E_{n} [U_{n} | A_{n} = 1] \approx 10^{6}$ . Also, two-boxing should be less attractive than one-boxing:

$E_{n} [U_{n} | A_{n} = 2] \approx 10^{3}$

However, if we make this well-defined with $ε$ -exploration, we'll get

$E_{n} [U_{n} | A_{n} = 2] \approx 10^{6} + 10^{3}$

and then the agent will two-box, contradiction. Instead we'd like to use predictable exploration and set

$E_{n} [U_{n} | A_{n} = 2] := E_{k} [U_{n} | A_{n} = 2]$

for $k$ small enough that the right-hand side is sensible. Let's see how.

Predictable exploration

Choose $k ≪ n$ so that $P_{k} (A_{n} = 2) ≫ 0$ . Our agent decides whether to explore at stage $k$ , and uses its beliefs at stage $k$ as a substitute for counterfactuals:

$\begin{matrix} {explore}_{0} & := P_{k} ({explore}_{0}) < ε {explore}_{1} & := {\begin{matrix} 1 & P_{k} ({explore}_{1} = 1) < \frac{1}{2} 2 & otherwise \end{matrix} \forall a E_{n} (ϕ | A = a) & := E_{k} (ϕ | A = a \land {explore}_{0}) if P_{n} (A = a) < δ A_{n} & := {\begin{matrix} {explore}_{1} & if {explore}_{0} {argmax}_{a} E_{n} [U_{n} | A_{n} = a] & otherwise \end{matrix} \end{matrix}$

Here $ε, δ$ are small positive numbers. It's easy to see that, under reasonable assumptions, this agent 1-boxes on Agent Simulates Predictor. But it can't use the full strength of $P_{n}$ in its counterfactual reasoning, and this is a problem.

Differential privacy

To illustrate the problem, add a term to the utility function that sometimes rewards two-boxing:

$\begin{matrix} U_{n} & = 10^{6} P_{n - 1} (A_{n} = 1) + 10^{3} 1 (A_{n} = 2) + 10^{6} 1 (A_{n} = 2 \land X_{n - 1}) X_{n - 1} & := P_{n - 1} (X_{n - 1}) < \frac{1}{2} \end{matrix}$

The agent should two-box if and only if $X$ . Assuming that's the case, and $P_{n - 1}$ knows this, we have:

So if $\neg X_{n - 1}$ , two-boxing is the more attractive option, which is a contradiction. (I'm rounding $ε$ to zero for simplicity.)

The problem is that the counterfactual has to rely on $P_{k}$ 's imperfect knowledge of $X_{n - 1}$ . We want to combine $P_{k}$ 's ignorance of ${explore}_{0}$ with $P_{n}$ 's knowledge of $X_{n - 1}$ .

If $X$ is independent of $A$ conditioned on ${explore}_{0}$ with respect to $P_{k}$ , then we can do this:

$\begin{matrix} E_{k} [U | A = a \land {explore}_{0}] & = \sum x E_{k} [U | A = a \land {explore}_{0} \land X = x] P_{k} (X = x | A = a \land {explore}_{0}) = \sum x E_{k} [U | A = a \land {explore}_{0} \land X = x] P_{k} (X = x | {explore}_{0}) \end{matrix}$

Then replace $P_{k} (X = x | {explore}_{0})$ with $P_{n} (X = x | {explore}_{0})$ :

$E_{n} [U | A = a \land {explore}_{0}] := \sum x E_{k} [U | A = a \land {explore}_{0} \land X = x] P_{n} (X = x | {explore}_{0})$

This is more accurate than $E_{n} [U | A = a \land {explore}_{0}]$ , and unbiased.

If $X$ is not independent of $A$ conditional on ${explore}_{0}$ , we can introduce an auxilliary variable and construct a version of $X$ that is independent. This construction is a solution to the following differential privacy problem: Make a random variable $Y$ that is a function of $X$ and independent randomness, maximizing the mutual conditional information $H (X; Y | A)$ , subject to the constraint that $A$ is independent of $Y$ . Using the identity

$H (X | A) = H (X; Y | A) + H (X | A Y)$

we see that the maximum is attained when $H (X | A Y) = 0$ , which means that $X$ is a function of $A$ and $Y$ .

Now here's the construction of $Y$ :

Let $X$ be the finite set of possible values of $X$ , and let $A$ be the finite set of possible values of $A$ . We'll iteratively construct a set $Y$ and define a random variable $Y$ taking values in $Y$ . To start with, let $Y = \emptyset$ .

Now choose

$(a, x) := {argmin}_{\begin{matrix} a \in A x \in X P (X = x, Y \notin Y | A = a) > 0 \end{matrix}} P (X = x, Y \notin Y | A = a)$

and for each $a^{'} \in A ∖ {a}$ , choose some $f (a^{'}) \in X$ such that $P (X = f (a^{'}), Y \notin Y | A = a^{'}) > 0$ . Then make a random binary variable $T_{a^{'}}$ such that

$P (T_{a^{'}} \land X = f (a^{'}) \land Y \notin Y | A = a^{'}) = P (X = x \land Y \notin Y | A = a)$

Then let $y$ be the event defined by

$(X = x \land Y \notin Y \land A = a) \lor ⋁ a^{'} \in A ∖ {a} (T_{a^{'}} \land X = f (a^{'}) \land Y \notin Y \land A = a^{'})$

and add $y$ to $Y$ . After repeating this process $| X | | A |$ times, we are done.

We can do this with a logical inductor as well. In general, to get a sentence $T$ such that $P_{k} (T \land B | C) \approx p$ , take $T := P_{k} (T \land B | C) < p \land B \land C$ .

Now given random variables $U$ and $A$ , and some informative sentences $ϕ_{1}, \dots, ϕ_{ℓ}$ , let $X \in {T, F}^{ℓ}$ be the random variable encoding the values of $ϕ_{0}, \dots, ϕ_{ℓ - 1}$ . The above construction works approximately and conditional on ${explore}_{0}$ to give us a random variable $Y$ that is approximately independent of $A$ conditional on ${explore}_{0}$ with respect to $P_{k}$ . Now we define

$E_{n} [U | A = a] := \sum y E_{k} [U | A = a \land {explore}_{0} \land Y = y] P_{n} (Y = y | {explore}_{0})$

whenever $P_{n} (A = a) < δ$ .

This succeeds on the problem at the beginning of this section: Assume $A_{n} = 2 \leftrightarrow X_{n - 1}$ , and assume that $P_{n - 1}$ knows this. Then:

$\begin{matrix} P_{n - 1} (A_{n} = 1) & = \frac{1}{2} \neg X_{n - 1} \to E_{n} [U_{n} | A_{n} = 1] & = \frac{1}{2} 10^{6} \neg X_{n - 1} \to E_{n} [U_{n} | A_{n} = 2] & = E_{k} [U_{n} | A_{n} = 2 \land {explore}_{0} \land \neg X_{n - 1}] = 10^{3} X_{n - 1} \to E_{n} [U_{n} | A_{n} = 2] & = \frac{1}{2} 10^{6} + 10^{3} + 10^{6} X_{n - 1} \to E_{n} [U_{n} | A_{n} = 1] & = E_{k} [U_{n} | A_{n} = 1 \land {explore}_{0} \land X_{n - 1} = 10^{6} \end{matrix}$

which does not lead to contradiction. In fact, there are agents like this that do at least as well as any constant agent:

Theorem

Let $U_{n} (P, A)$ be a utility function defined with metasyntactic variables $n$ , $P$ , and $A$ . It must be computable in polynomial time as a function of $A$ , $P_{f_{i} (n)} (A = a)$ , and $X := P_{f_{i} (n)} (X) < p$ , where $f_{i}$ can be any polytime functions that doesn't grow too slowly and such that $f_{i} (n) < n$ . Then there exists a logical inductor $P$ such that for every $n$ , there exists $k < n$ , $ε, δ > 0$ , and a pseudorandom variable $Y$ such that the agent $A$ defined below performs at least as well on $U_{n}$ as any constant agent, up to a margin of error that approaches $0$ as $n \to \infty$ :

$\begin{matrix} {explore}_{0} & := P_{k} ({explore}_{0}) < ε {explore}_{1} & := ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} a_{1} & if P_{k} ({explore}_{1} = a_{1}) < \frac{1}{ℓ}; else a_{2} & if P_{k} ({explore}_{1} = a_{2}) < \frac{1}{ℓ}; else ⋮ a_{ℓ} & otherwise \end{matrix} \forall a E_{n} (ϕ | A = a) & := \sum y E_{k} (ϕ | A = a \land {explore}_{0} \land Y = y) P_{n} (Y = y | {explore}_{0}) if P_{n} (A = a) < δ A_{n} & := {\begin{matrix} {explore}_{1} & if {explore}_{0} {argmax}_{a} E_{n} [U_{n} | A_{n} = a] & otherwise \end{matrix} \end{matrix}$

Proof sketch

Choose $k$ smaller than the strength parameter of the weakest predictor in $U_{n}$ . If $a_{n}$ is the best constant policy for $U_{n}$ , assume $A_{n} = a_{n}$ . Since $P_{n}$ can compute $U_{n}$ , our agent's factual estimate $E_{n} [U_{n} | A_{n} = a_{n}]$ is accurate, and the counterfactual estimate $E_{n} [U_{n} | A_{n} = a^{'}]$ for $a^{'} \neq a_{n}$ is an accurate estimate of the utility assigned to the constant policy $a^{'}$ , as long as we make $Y$ rich enough. So the agent will choose $a_{n}$ . Thus we have an implication of the form "if $P$ believes $A_{n} = a_{n}$ , then $A_{n} = a_{n}$ is true", and so we can create a logical inductor $P$ that always believes that $A_{n} = a_{n}$ for every $n$ by adding a trader with a large budget that bids up the price of $A_{n} = a_{n}$ .

Isn't this just UDTv2?

This is much less general than UDTv2. If you like, you can think of this as an agent that at time $k$ chooses a program to run, and then runs that program at time $n$ , except the program always happens to be "argmax over this kind of counterfactual".

Also, it doesn't do policy selection.

Next steps

Instead of handing the agent a pseudorandom variable $Y$ that captures everything important, I'd like to have traders inside a logical inductor figure out what $Y$ should be on their own.

Also, I'd rather not have to hand the agent an optimal value of $k$ .

Also, I hope that these counterfactuals can be used to do policy selection and win at counterfactual mugging.