A putative new idea for AI control; index here.

Other posts in the series: Introduction, Double decrease, Pre-existence deals, Full decision algorithms, Breaking acausal trade, Trade in different types of utility functions, Being unusual, and Summary.

We're going to have to look at the issue of universal, pre-existence trade agreements - the idea that agents should abide by some sort of idealised trade agreement based on a universal prior of possible agent values, without updating on the fact that they happen to exist as agents.

The arguments for it seem superficially similar to the arguments for preferring FDT/UDT/TDT over CDT, but are actually quite different. In brief, there is no profit, advantage, or benefit for an existing agent to commit to a trade agreement with non-existing agents.

Choosing a better decision theory

Newcomb, as usual

Causal decision theory gets the Newcomb problem wrong.

There are many ways to see that CDT gets this wrong (for a start, it ends up with less money that other agents), but one of the most damning is that CDT is not stable - it will self-modify into a sort of timeless agent, if it can do so, precisely to triumph on Newcomb-like problems.

It does this in a clunky and odd way: it will one-box on Newcomb problems where Omega does its prediction after learning of CDT's change, but not on Newcomb problems where Omega made its decision in the past.

The reason for it's clunkness is that CDT, being causal, doesn't allow for correlations between its own decision and causally anterior events.

Functional decision theory, which has a richer theory of correlations, would spot that correlation, and, assuming it had somehow ended up with a CDT-like decision module, would immediately change that into an FDT-compatible one that would one-box on all Newcomb problems.

I had another argument for the weakness of CDT (and some forms of UDT), namely that it shouldn't make a difference whether an agent was simulated or just predicted in a non-simulation manner. Keep this argument in mind, because I think it's wrong.

A mugging both counterfactual and surprising

Suppose an agent is using the FDT understanding of correlations and a CDT decision module, and is suddenly confronted with the Counterfactual mugging.

If the agent had time to think of the problem ahead of time, they would have self-modified into a full FDT agent. But now it is too late: they already know that Omega's coin came up tails. There may be symmetry or "rationality" or other philosophical arguments for paying the $100, but, unlike the Newcomb problem, there is no profit argument for doing so.

The full FDT theory of correlations shows that, in this instance, behaving like an FDT agent is the wrong decision.

Or consider a slight variant of the mugging: replace "the coin was tails" with " $17 \times 19 = 323$ . Now, there is a brief moment of uncertainty about the product of those numbers; in that brief moment of uncertainty, could an FDT agent with a grasp of logical uncertainty precommit to paying Omega anyway?

Time's up; the uncertainty is passed; $17 \times 19$ is indeed $323$ . It seems clear that the optimal course of action is to check this, rather than precommitting to the original uncertainty. Of course, there's nothing special about using logical uncertainty here; we could simply use empirical uncertainty, of the kind that's easy to check or that we can even check in our minds (there are lots of questions about your favourite books or shows that you can answer, but only after a little bit of thought).

Similarly, if the very first time the agent even thought about the counterfactual mugging is the moment when Omega told them the true fact that the coin came up tails, then it shouldn't ignore that fact. It lucked out, and there's no reason to undo that lucking out. There is no profit to doing so (I found a similar argument in this post).

Insurance sold too late

The argument above only works because the agent hasn't yet fully transformed their decision theory. In general, self-modifying to becoming an agent that pays the counterfactual mugger is the right decision, and any reasonable agent will do that early.

But there is a special case where the agent cannot self modify early, and that is when it doesn't exist.

What if Omega approached you and said "there was a 50-50 chance that you had not been born as you with your current values $V$ , but as someone else with [radically different and stupid value set $V^{'}$ ]".

Do you have any reason to follow $V^{'}$ in any way? Not from the profit motive, the one that transforms CDT into a weird CDT-FDT hybrid, or transforms an FDT correlation user into a FDT decision theorist.

If both agents had been simulated before existence, and they could have negotiated, then a deal could have been reached. But note that it is profitable for your simulation to agree to that deal, but it is never profitable for you to do so. If you were unsure whether you were the simulation or not, then it may be profitable to agree.

But, absent that simulation, from the earliest moment of your actual existence, it is never profitable to agree to that deal. Never-existence insurance always arrives too late.

And even if there were some abstract reason that FDT agent "should" agree to that deal - note that since there is no profit, there is no reason that an FDT agent that didn't agree to the deal would ever want to change their mind - humans are not idealised FDT agents, so we don't have to. Hence neither do the agents that we create.

The usual exceptions

Of course, things are different if there were a large network of agents committed to the universal pre-existence utility, and committed to enforcing agreement to that. But in that situation, it becomes potentially profitable again to follow that, so the change in behaviour does not undermine the argument.

If modal realism were true, and there existed worlds in which we had never existed, and we cared about what happened in those worlds, then we would want to reach such deals - but that is just classical acausal trade. There's no philosophical distinction between different worlds, and causally disconnected parts of our world.

AI ALIGNMENT FORUM
AF