Pragmatic FDT, and predictors as game theory

Stuart_Armstrong

Decision theory is back in fashion (defining fashion as "one good post on a good EA blog"). Bentham's Bulldog (BB) has published a case against FDT (functional decision theory), contrasting rationalist enthusiasm with academic scepticism: "Academic decision theorists don't like the theory. The number of academic decision theorists who adopt it could be counted on one hand by someone missing four of their fingers."

I am, just barely, a published academic decision theorist, so you can keep a small finger to count me too. My position is that, though FDT may have problems with its definitions and under-definedness, we can build defined variants that achieve what FDT attempted to.

I want to do two things in this post. First, sketch a "pragmatic" version of FDT designed to sidestep the theoretical pitfalls that Will MacAskill and Wolfgang Schwarz identify. Second, take a closer look at what predictors actually do, and argue that whenever they make counterfactual predictions, decision theory shades into game theory -- which explains why EDT/TDT/UDT/FDT can look irrational in the odd branch. It's the old debate of "should you pay the blackmailer", dressed up in predictor garb.

Pragmatic FDT

MacAskill and BB both press on the difficulty of saying, formally, whether two algorithms are "the same." Rather than solving that, I'm going to retreat and declare victory. I won't define whether two algorithms are the same in any abstract sense, and I'll ignore logical counterfactuals and counterpossibles entirely. Instead I say that two algorithms are equivalent if the equivalence can be built:

p-FDT: a pragmatic FDT agent decides in four steps:
1. Baseline. Compute the CDT ^[1] action and its expected utility. This is the default.
2. Search. Look for likely-true isomorphisms between the agent's own decision process and parts of the world, using the ordinary tools we humans use to judge when two algorithms compute the same function (laid out below).
3. Evaluate. For each candidate , find the input-output map that maximises expected utility, on the assumption that choosing as the agent's own decision map also sets (the isomorphic process out in the world). Weight by the probability that is true: with probability , the world responds as the isomorphism dictates; with probability , it behaves as the causal baseline says and the agent has merely played into a world that ignores it.
4. Adopt or default. Call exploitable if its best beats the baseline in expected utility. Adopt the highest expected-utility exploitable found; if none exist, take the baseline CDT action.

Here, what's an isomorphism between functions? Suppose we have an invertible map between (sets of) inputs of functions and , and between (sets of) their outputs. The two functions are isomorphic if . Think of as relabelling inputs and outputs: it says that, up to relabelling, and are the same thing.

Take Will's calculators:

consider two calculators. The first calculator is like calculators we are used to. The second calculator is from a foreign land: it's identical except that the numbers it outputs always come with a negative sign ('–') in front of them when you'd expect there to be none, and no negative sign when you expect there to be one. Are these calculators running the same algorithm or not?

Will's answer is that it depends on how you interpret the '–', and there's no fact of the matter. Under p-FDT we don't need one. The calculators are plainly isomorphic: maps identical inputs to identical inputs, and adds or removes the minus sign on the outputs. Up to that relabelling they compute the same thing. Whatever the foreign calculator was intended to do, it runs an algorithm isomorphic to the standard one, and that's all we'll ever need.

Note that operates on sets of inputs and outputs, so it's also an abstraction. Every way of typing "2+2=" counts as the same input, whether the calculator user is standing on their head, declaiming opera, or has sat down on the calculator in their back pocket. Every "4" (or "–4") counts as the same output, though each corresponds to quadrillions of atoms on the screen in subtly different, moment-to-moment-changing positions. This is the same trick underlying all of computer science (formal abstractions of messy physical processes) and all statistical reasoning. Abstractions are used because they're useful.

A useful abstraction need not cover all situations. The number is neither even nor odd; the South Pole has no time-zone; wood has no boiling point. So maybe the first calculator has a key while the foreign one doesn't, and vice versa for the key. Maybe the user could smash the first calculator on the floor and jump up and down on its ruins, and this has no clear isomorphism to the second calculator.

So need not be total -- it need not map from or to all inputs and all outputs. And it need not be maximally complex; indeed minimal isomorphisms are often the most useful. When playing the Prisoner's Dilemma against an identical copy of yourself, there is a strong isomorphism between every detail of both your thoughts and actions, but all you really need to know is "we will both cooperate, or both defect".

Exploitable isomorphisms

By design, must be exploitable: so that the agent gains from acting on its existence. In the standard Newcomb problem, that is certainly the case -- making Omega believe the agent will one-box is of great value (the cost being that the agent will actually one-box). So the search for is not a search for some abstract equivalence between, say, your brain and a roiling cloud of dust or my brain and the US economy. It needs to be an exploitable isomorphism, where the agent can understand the inputs and outputs and how changes to the input-output map affect the world and hence its own utility. No-one has yet proposed a plausible exploitable isomorphism between my brain and the US economy, or reasons to think that one exists ^[2] .

For a limited agent there's an extra caveat: the agent must actually be able to implement the winning . In Parfit's Hitchhiker, it's easy to see there is an isomorphism between 'appear fully trustworthy' and 'get saved by the driver'. But maybe that move is beyond the human hitchhiker. Maybe instead the best isomorphism has an output which is 'become genuinely willing to pay, by focusing on gratitude towards the driver', because that output is actually implementable.

How we identify likely-true isomorphisms

So how do we actually find these things? We already have a whole toolkit, and it's worth laying it out, because the reassuring point is that none of it is new metaphysics ^[3] . It's the ordinary business of deciding when two processes compute the same function, run at whatever level of rigour the stakes demand.

Roughly from cheapest-and-most-certain to most expensive-and-least-certain:

Identity and near-identity. We can identify an agent with a near-exact copy of itself, and with a faithful simulation of itself. These are the easy cases: the isomorphism is transparent and barely needs checking.
Same code, different substrate. Two identical pieces of code running on different machines compute the same function, as long as the abstraction doesn't leak. Overflow, timing, hardware faults, and side-channels are all ways they can leak; known problems, with known ways of taking them into account.
Different code, same task. Two different implementations -- a bubble sort and a quicksort, two chess engines that always pick the same move, a compiled and an interpreted version of one program -- can be isomorphic at the level of the input-output map we care about, even when their internals differ wildly.
Coarse behavioural equivalence. Sometimes we only need a fragment of the map. Two negotiators from the same culture may be isomorphic just over "how they respond to a lowball offer," and nowhere else; two thermostats built by rival firms agree on "switch the heating on below the setpoint" while sharing no circuitry. A partial isomorphism over the decision-relevant slice is enough.
Black-box testing. When we can't see inside, we probe. Feed a wide variety of inputs from across the input space, and specifically try to make the two processes diverge -- hunt for the input where they come apart. If we can't find one after honest effort, we provisionally treat them as isomorphic over the region we tested.
Vetting a claimed predictor. If Omega claims a great track record, we check it -- including with randomised trials, to rule out the possibility that Omega is riding a superficial correlation (the gene) rather than tracking our decision process (the simulation). If someone claims to see the future, we subject them to highly sceptical investigation; if someone claims merely to read intent, moderately sceptical investigation. Throughout, we ask whether the purported isomorphism is compatible with everything else we know about the world.

We use all the tools that human reasoning and trained common sense make available, and we'll need them: in toy models a simple formal check suffices, but in the messy world, identifying useful isomorphisms is a task of arbitrarily high complexity. Often the agent will find none, and default to CDT. That's not a failure of the theory. Even a useful, true isomorphism may simply be beyond the p-FDT agent's ability to find -- and if the agent assumed it could always find one (if one existed) it could walk straight into contradictions ^[4] .

Application to standard problems

In Newcomb with a simulator: Omega predicts the agent by running a simulation of the agent's decision process. Here maps the agent's decision to the simulation's decision, because the simulation is its decision process under relabelling. So the map is exploitable: the agent one-boxes, and thereby the simulation one-boxed, and thereby the box is full. Standard Newcomb, for fun and profit.

In the gene version of Newcomb, Omega predicts the agent by checking whether it carries a gene that correlates 99.9% with two-boxing. Now the only candidate would map its decision to its gene, but such a can't be constructed (or validated) with the methods above. We'd need a scenario where we saw the gene change depending on the agent's decision. So, CDT, and two-boxing, and hope to have the right gene.

In a Prisoner's Dilemma against a copy: the obvious is exploitable (mutual cooperation is better than mutual defection), so the agent switches to cooperate. In a Stag Hunt against a copy where the default is already Stag: exists but isn't exploitable; it doesn't give a higher utility. In smoker's lesion: no plausible , so nothing even theoretically exploitable, so p-FDT is causal and smokes ^[5] . And where other agents try to extort it through predictional reasoning: it declines to act on the isomorphism and defaults back to CDT.

Discontinuity across a spectrum of predictors

MacAskill worries that FDT has an embarrassing discontinuity:

What if the 'predictor' is a very unsophisticated agent that doesn't even understand the implications of what they're doing? [...] For FDT, there will be some point of sophistication at which the agent moves from simply being a conduit for a causal process to instantiating the right sort of algorithm, and suddenly FDT will switch from recommending two-boxing to recommending one-boxing.

It's worse than that -- the switch can happen in several places, in different directions, depending on small changes in the setup. But that's exactly what p-FDT predicts, because the switch just is the point at which an exploitable appears (or changes). Walk up the spectrum:

A nationality-based predictor. Say Scots tend to one-box and the English to two-box, and Omega predicts on nationality alone. If nationality is fixed, there's no (nationality isn't something the input-output map selects) so p-FDT two-boxes ^[6] . And why do the Scots one-box? If it's because they run FDT-ish algorithms and Omega reads the algorithm-identity rather than the decision, then FDT should notice the prediction tracks identity, not policy, and two-box anyway. A Scot who keeps one-boxing here is simply mistaken: modelling this predictor as cleverer than it is. As Scots wise up, they two-box, and reap the best outcome of all: predicted to one-box, actually two-boxing.

A shrewd human predictor. Now Omega is a perceptive person with a good gut sense for who'll one- or two-box. There's a real connection between the agent's decision process and the prediction -- but gut instinct is limited. What p-FDT would like to signal is "I'll one-box if you're sharp enough to read that I will, and two-box otherwise." That's hard to communicate implicitly, though not impossible between people who know each other well. Usually: two-box, predicted correctly.

Omega proper. Raise the predictor to a genuine simulator. Now is trivial to see; p-FDT one-boxes, and so does the simulation.

Omega vs. a sharper agent. Now raise the p-FDT agent's intelligence too, enough to reliably detect whether they're inside the simulation. The optimal map becomes "if simulated, one-box; if real, two-box," which extracts the maximum.

So the verdict flips from two-box to one-box as we climb, and flips back at the top. Both ends of the spectrum two-box, for different reasons. The "sharp switch" isn't a glitch in FDT's metaphysics, it's p-FDT correctly tracking where an exploitable isomorphism exists. Throughout, the p-FDT agent is doing one thing: hunting for a that convinces the predictor it'll one-box, while also keeping an eye out for a way to actually two-box.

No advanced counterpossible theory required

Instead of a theory of counterfactual and counterpossible worlds, we've substituted a specification of what the agent can be seen to 'control' (its input-output map) and practical ways for finding isomorphisms which allow it to exploit that control. A pragmatic approach, with no deep philosophical theories of impossible worlds needed. ^[7]

Predictors, counterfactuals, and game theory

Going in a different direction, and looking at Newcomb problems in general: predictors change decision theory, and not necessarily in the obvious ways. There are two kinds of predictor, with different implications:

A straight predictor knows what will happen and doesn't visibly change the scenario on the basis of its prediction (it may change it invisibly). This is classical Newcomb: Omega predicts, acts on it silently, and is generally right. Straight predictors do two things: they let you play a turn-based game out of order, and they wreck CDT (see the appendix). The out-of-order effect is unmysterious -- there's no puzzle in "choose, then Omega fills box B to match" -- and, notably, it isn't the rearrangement that breaks CDT.

A counterfactual predictor knows what would have happened -- what you'd have done in a scenario that may not be the real one. This covers the Counterfactual Mugging, Transparent Newcomb, Parfit's Hitchhiker, and the rest. ^[8] And counterfactual predictors do something new: they import game theory into decision theory.

In game theory, consider the Ultimatum Game: the proposer offers a split, the responder accepts or rejects (reject and nobody gets anything). Responders reject lopsided offers, so proposers learn to offer fairer ones. The proposer is deciding on a counterfactual prediction of the responder -- "if I get greedy, they'll reject."

A counterfactual predictor is really just another player whose "action-following-prediction" is a best response; you can always rewrite it as a utility-maximiser and get the same behaviour.

Take, for instance, the Bomb thought experiment. Here, the agent chooses Left or Right; Right costs $100 but is always safe; if the predictor (tiny error rate) predicted Left it put a deadly bomb in Left, otherwise Left is safe. ^[9] So far this is straight-predictor Newcomb. The twist is the note: the predictor tells the agent that it predicted Right and therefore did put a bomb in Left. If the note is taken to be accurate, the setup needs a counterfactual predictor -- because a straight predictor can't leave an informative note here at all. ^[10]

If the note is informative, then Bomb maps cleanly onto a ransomware scenario. The extortionist [predictor] targets a company [agent]. It can encrypt the company's data [place a bomb] or not. The company can pay $100 [go Right] and recover its data [Right is safe], or refuse and eat a large loss [go Left in the presence of a bomb]. But the extortionist also bears costs if the company refuses -- the wasted hack, law enforcement, bad publicity -- so it predicts the company first. Predict "pay" [Right] → hack and leave a note. Predict "refuse" [Left] → don't bother.

The only thing I've added to Bomb is the fact that the extortionist also bears costs if the company refuses. That was added to give the predictor a reason not to hack a refuser [not to put a bomb if the simulated agent goes Left].

Typically a CDT agent pays (goes Right; note and bomb appear in Left), and an FDT agent refuses (goes Left; no note, no bomb). But the extortionist isn't perfect. Once in a trillion trillion times it mispredicts, and an FDT agent sees the note with a real bomb behind it -- and walks into it, because a no-pay policy is exactly what buys the good outcome in the other 999,999,999,999,999,999,999,999 cases. That lone bad branch isn't the agent being irrational; it's the price of a policy, in a setting that's game theory rather than decision theory. And games routinely trade a loss in one branch for gains across the rest. That's the point I'm really after: the moment a predictor goes counterfactual, you're playing a game, and game-shaped verdicts should stop surprising us.

Conclusion

So where does this leave FDT? Its critics are right that, as stated, it leans on counterpossible reasoning and an undefined notion of algorithmic identity. But those are failures of formulation, not of the underlying idea. p-FDT keeps the idea -- some correlations between your decision process and the rest of the world are yours to steer -- and swaps the metaphysics for engineering: a correlation matters to your decision exactly when the isomorphism behind it can be built, validated, and exploited. Where no such isomorphism can be built, p-FDT just is CDT, and different agents, with different isomorphism-finding abilities, will legitimately decide differently.

And when the exploitable isomorphism runs through a counterfactual predictor, you're no longer doing decision theory alone; this is game theory land. The apparently insane verdicts (walking into the bomb, refusing the blackmailer) are the familiar game-theoretic price of a winning policy, encountered in its losing branch. Critics judge the branch; defenders judge the policy; and I don't think the word "rational" settles which is correct. But that dispute is one of game theory's oldest -- whether to honour a commitment it no longer pays to honour -- and not some new pathology invented by rationalist decision theorists.

Appendix: CDT can't believe in predictors

I hadn't appreciated how badly CDT does around predictors, or why. It isn't that the predictor acts first. Run Newcomb with the predictor acting later -- the agent locks in its choice, then Omega runs the prediction and fills or empties box B, then the agent gets its reward. Logically, the same algorithm run earlier or later gives the same answer, so an agent would be insane to think that the timing matters.

The CDT agent is not insane in that way: it expects that the prediction algorithm will give the same answer whenever it is run. But it doesn't model the algorithm as correlated with its decision in either case: that's because the prediction isn't causally downstream of the action, even when it runs (temporally) later.

Picture Omega running three things: the prediction before the CDT agent's choice, an identical prediction after it, and finally a direct look at CDT's actual choice. A CDT agent cannot model these three as giving the same answer. And it cannot learn that they do, no matter how often it watches this happen. It simply can't credit the existence of reliable predictors of itself. Though it's perfectly happy to believe in predictors of other agents.

This isn't just an informal observation. Oesterheld and Conitzer (2021), writing in the thoroughly academic Philosophical Quarterly, construct a scenario where a CDT agent facing a reliable predictor voluntarily takes a bet that loses money in expectation -- in a single decision -- and then extend it to a diachronic Dutch book. An agent that cannot credit predictors of itself isn't merely stubborn; it's a money pump.

A CDT agent will much sooner believe in time travel than in someone who can predict it.

Why CDT? Because it is defined in a way that EDT is not. People are still arguing as to what EDT agents do in various situations, while CDT behaviour is often agreed on. Moreover, the TDT/UDT/FDT family is in part designed to fix the problems with CDT; using CDT as a baseline means that the more advanced methods only apply when they actually find ways to improve on CDT. ↩︎
Though if someone does identify one, please do let me know. ↩︎
Nor is this a lonely project. Formalising "my decision process is legibly correlated with that process over there", without any metaphysics of algorithmic identity, is exactly what the program equilibrium literature does. Tennenholtz (2004), building on an idea of Howard (1988), has players submit programs that can read each other's source code -- with cooperation initially resting on exact code identity, the fragility the later papers repair; Barasz et al. (2014) build "modal agents" whose cooperation survives the agents' code not being identical, an isomorphism-shaped result if ever there was one; Critch (2019) extends the trick to resource-bounded agents, and Oesterheld (2019) makes the equilibria robust in a different direction. Over in game theory proper, Halpern and Pass study translucent players -- players who believe that switching strategies may be visible to, and change the strategies of, their opponents. None of these authors is an FDT adherent, so BB's finger-count of adopters may stand -- but the formal machinery FDT was groping toward is being built in peer-reviewed venues, by more hands than one. ↩︎
Either Gödel-style -- the setup would be something where finding the isomorphism would be equivalent to proving your own consistency -- Löb-style -- finding the isomorphism is equivalent to proving you take an inferior action -- or Russell-style -- the isomorphism exists if and only if you can't find it. ↩︎
I think the smoking lesion problem does EDT dirty. In that problem, we know that smoking and cancer are just correlated by a genetic lesion, but the EDT agent doesn't. It's easy to get an agent to behave badly if you conceal crucial information from it! And if you don't know about the lesion, then the correlation is prima facie evidence you should avoid smoking. Which turned out to be the right decision in the real world; EDT was being sensible, given the information it had, and ultimately correct. ↩︎
Could the English make a fortune by faking a Scottish accent? Only if the predictor is dim enough to be fooled by it; in which case the accent has become the predictive variable, standing in for the nationality that the predictor can no longer reliably read. ↩︎
Oh, ok, here's a sketch of a theory. Counterfactual and counterpossible reasoning asks what would happen under different decisions we might take; it analyses what happens, for each decision. But because we will only actually take one decision, all but one of those analyses has a false premise: assuming a decision not actually taken. Push that assumption far enough and you will hit a contradiction, and by the principle of explosion you can then deduce anything, which will likely produce a nonsense decision.

The traditional fix is ontological: build a nearest "as close as possible" world to reason about, which is itself consistent. I prefer an epistemic fix: don't let the agent push its reasoning to the point of explosion. p-FDT's rigidity -- a fixed formalism, a fixed reading of the agent's input-output function, fixed standards for when an isomorphism counts as "likely true" -- is there precisely to keep the agent's exploration inside the region that won't explode. In a sense CDT does the same thing with its do(X) operator: by severing X from its causes, it avoids confronting the implications of assuming X when X needn't hold. But CDT pays too high a price for that: it cannot grapple with the existence of predictors (see Appendix). p-FDT pushes much further, but has its own failure modes. Once it has found a , p-FDT acts effectively like it has a do(f, \phi \circ f\circ \phi^{-1}) operator. An Omega that strikes straight at that -- say, an Omega that rewards it for choosing its second best rather than its best -- will cause it to fail.

Want even more speculative theory? Ok, let's go wild. There's no such thing as causation, only correlations exist. A causal relationship X to Y is a correlation where we say that "X could plausibly have had another value, and Y would also be changed" while saying "Y could plausibly have had another value, without changing X". I flip the switch (X) and the lights go on (Y). When I don't flip the switch, the lights don't go on; and when someone else flips the switch, the lights go on (Y can happen without X happening). Of course, this narrative is complicated by all sorts of caveats -- there needs to be electricity, a non-burnt out light bulb -- and a lot of induction and grouping together of similar situations.

Thus counterfactuals don't really exist; we have taken different correlational observations, and formalised a statement like "I could have not flipped the switch" by comparing it with similar situations. So formally defined counterfactuals just don't exist.

We can still do almost all causal reasoning, but, philosophically, there is no causation, no counterfactuals, and counterfactual worlds are purely incorrect models. But decision theories that rely on there being an actual separation between causation and correlation, and on counterfactuals meaning something in a strong sense, will break if you push them too far. I'm hoping a new theory will be able to resolve this issue properly. ↩︎
A counterfactual predictor needn't run multiple counterfactuals, and its prediction scenario may turn out to be the real one. In Transparent Newcomb, if you one-box the prediction scenario was real; if you two-box it wasn't. It's the potential gap between the prediction scenario and the real one that makes the predictor counterfactual. ↩︎
Minor rant: unless the size of the payoff is the point (Pascal's Mugging, dust specks vs torture), I dislike thought experiments where one reward dwarfs the other. Bomb weighs a lethal explosion against $100. Schwarz weighs ruin against paying $1 to a blackmailer -- "of course you should pay!" Eliezer once weighed a 10%-effective asteroid deflector against a possibly-100%-effective one. Cranking the stakes just tempts us to take the safe option out of fear or expedience, which muddies the intuition it's meant to isolate. ↩︎
A straight predictor doesn't visibly change the scenario, and the note is visible. So, to include the note, the predictor would have to have composed the note before running the simulation -- but its bomb decision depends on that simulation's outcome. So the note's contents can carry no information about whether the bomb is there. To make the note informative one needs a counterfactual predictor: e.g. one that models the agent in the presence of the note, and leaves note-and-bomb if the agent would go Right, neither if it would go Left. ↩︎