UDT in the Land of Probabilistic Oracles — AI Alignment Forum

UDT in the Land of Probabilistic Oracles — AI Alignment Forum

In my previous post, I explained how a set of programs predicting each other using a probabilistic oracle and taking utility-maximizing actions will naturally play a Nash equilibrium. However, this seems to be suspiciously CDT-like. A program will, for example, betray in a Nash equilibrium when playing against itself.

Naturally, probabilistic oracle machines are a good way to formalize Newcomblike problems (since we have prediction), so we can explore alternative decision theories in this framework. Assume we have some probabilistic world program $W (O)$ that runs the world and calls the oracle (which is not considered part of the world), returning an outcome. There may be (probabilistic) agent programs embedded in this world program (as is common in formalizations of UDT). Different agents value outcomes differently: define $V_{i} (w)$ to be the utility player $i$ assigns to the outcome $W (O) = w$ . I am using $V_{i}$ instead of $U_{i}$ to differentiate this utility function from the game theoretic utility function in the previous post.

Assume that player $i$ 's output is $a_{i} (O) = O (E_{i}, 1 / 2)$ for some machine $E_{i}$ . This is a convenient assumption and it might be the case that this is enough, but I haven't actually proven this.

Here's an example of player 1 playing Prisoner's dilemma against player 2: $W (O) = let x = O (E_{1}, 1 / 2), y = O (E_{2}, 1 / 2) in (x, y)$ $U_{1} (x, y) = x - 10 y$ $U_{2} (x, y) = y - 10 x$ Note that $x$ and $y$ are the results of 2 independent calls to $E_{1}$ . We assume that action 1 is defection and action 0 is cooperation. As a side note, perhaps I should be using monadic do-notation to show that these calls are independent, but this is simple enough that it doesn't seem necessary.

Here's an example of player 1 playing Newcomb's problem: $W (O) = let x = O (E_{1}, 1 / 2), y = O (E_{1}, 1 / 2) in (x, y)$ $U_{1} (x, y) = 1000 x + 1000000 (1 - y)$ Here, action 1 is 2-boxing and action 0 is 1-boxing. $x$ is the player's actual action while $y$ is Omega's prediction of the player's action. Note the strong resemblance between Newcomb's problem and playing prisoner's dilemma against yourself.

How might agents in this environment act to maximize utility? In the previous post, we had a quantity we wanted to estimate, $E [\frac{U_{i} (a_{- i}, 1) - U_{i} (a_{- i}, 0) + 1}{2}]$ . We had $P (E_{i} = 1)$ equal to this expectation, so $O (E_{i}, 1 / 2)$ will return a utility-maximizing action for player $i$ . The quantities $U_{i} (a_{- i}, 1)$ and $U_{i} (a_{- i}, 0)$ are causal counterfactuals, but maybe we want to replace them with other kinds of counterfactuals.

To do this, we could define masked oracles $Mask (s, p, q, O)$ for state $s$ , probabilities $p$ and $q$ , and oracle $O$ as follows:

$s$ $P (Mask (s, p, q, O) (s_{2}, p_{2}) 1) = P (O (s_{2}, p_{2}) = 1) for s \neq s_{2} \lor p \neq p_{2}$

This is the same as $O$ except that, when given the query $(s, p)$ , it returns 1 with probability $q$ . Note that $Mask (s, p, q, O)$ might not be a reflective oracle even if $O$ is.

For a UDT-style decision theory, we would like to define the machine $E_{i}$ to compute a utility difference between oracular counterfactuals (similar to logical counterfactuals):

$P (E_{i} (O) = 1) = E [\frac{V_{i} (M (Mask (E_{i}, 1 / 2, 1, O))) - V_{i} (M (Mask (E_{i}, 1 / 2, 0, 0))) + 1}{2}]$

In other words, we try masking the oracle to make our agent $O (E_{i}, 1 / 2)$ return either always 1 or always 0, and then simulate the world twice with these different oracles and get the expected utility difference. It is not hard to confirm that this reasoner succeeds in cooperating on the prisoner's dilemma with itself and one-boxing in Newcomb's problem. However, since the masked oracle need not be reflective, it will not propagate any consequences of the change we made, and we get a few counterintuitive results, such as not cooperating in a prisoner's dilemma with a slightly different but logically equivalent program.

There's also an issue with Newcomb's problem when we add a step of indirection to Omega's prediction: $W (O) = let x = O (E_{1}, 1 / 2), y = O (a_{1}, 1 / 2) in (x, y)$ $U_{1} (x, y) = 1000 x + 1000000 (1 - y)$

Since the counterfactual only modifies $O (E_{1}, 1 / 2)$ and not $O (a_{1}, 1 / 2)$ , the player in this case will conclude that 2-boxing is correct.

So, maybe we want to get a modified version $O^{'}$ of $O$ so that $O^{'} (s, p) = q$ , but $O^{'}$ is also reflective, and is probably mostly similar to $O$ when possible. However, we have no guarantee that $O^{'}$ actually exists. For example, if $M$ is a machine that always returns 1, we cannot have a reflective oracle in which $O^{'} (M, 1 / 2) = 0$ .

Perhaps we can construct our $E_{i}$ machines so that such an oracle always exists. I'm not sure what approach makes sense here. This seems similar to the problem that playing chicken with the universe is meant to solve, but I don't know how well this translates to a context based on probabilistic oracles rather than logic.

If we succeeded in specifying more reasonable modified oracles, we might end up with a nice alternative formalism of UDT that can be computed using Nash equilibria.