## AI ALIGNMENT FORUMAF

The procrastination paradox is isomorphic to well-founded recursion. In the reasoning, the fourth step, "whether or not I press the button, the next agent or an agent after that will press the button" is an invalid proof-step; it's shown that there is an inductive steps ending at the conclusion, but not that that chain has a base case.

This can only happen when the relation between an agent and its successor is not well-founded. If there is any well-founded relation between agents and their successors - either because they're in a finite universe, or because the first agent picked a well-founded relation and build that in - then the button will eventually get pushed.

# 0

The procrastination paradox relates to the following problem:

1. There are infinite time steps (one for each natural number).
2. For each time step, there is an agent.
3. Each agent may press the button or not.
4. Each agent will get utility 1 if and only if it or a later agent presses the button.

The paradox is that the following reasoning process leads to the button never getting pressed:

• My rule (and the rule for all future agents) is that, if I can prove that the button will be pressed in the future, then I will not press the button, and otherwise I will. This rule (appears to) maximizes utility.
• The next agent uses this rule, so the next agent will only fail to press the button if it can prove that the button will be pressed by some later agent.
• I trust the next agent's proof system.
• Therefore, whether or not I press the button, the next agent or an agent after that will press the button.
• Therefore, if I don't press the button, my utility is 1.
• Therefore, my rule says I will not press the button.

But the same reasoning can be used for every time step! No agent will press the button, and all will trust that some future agent will. The flaw here is that we constructed a sequence of logical systems (one for each agent), each of which considers the next one sound.

We can formalize the procrastination paradox using reflective oracles instead of logic. Suppose each agent is a reflective CDT agent. Define a machine to represent whether agent presses the button (for some natural number ), and define a machine to represent whether agent or a later agent presses the button:

The first line states that the agent must press the button if , and may do anything otherwise. In the second line, , and is a way of sampling a bit with the same distribution , as defined in this post:

So we call our oracle on the pair (), and throw a fair coin.

• If the coin comes up heads and the oracle says “false” (the probability of returning 1 is smaller than 0.5), we output a zero.
• If the coin lands heads and the oracle says “true”, we output a one.
• If the coin lands tails and the oracle says “false”, we call our oracle on (,0.25) and repeat the process; if the coin lands tails and the oracle says “true”, we call the oracle on (,0.75) and repeat.

Observe that an assignment of probabilities to each (from which we can determine an consistent with these probabilities) is consistent if and only if each . This is because if any , then would also be less than 1, which implies , which implies .

Although a reflective oracle must assign for all , we have no restrictions on given this, so it is consistent for the oracle to say that no agent presses the button, but the button gets pressed eventually. Therefore, a sequence of reflective CDT agents reasoning in this fashion may choose to never press the button!

As reflective oracles were derived from Paul's probabilistic logic, we would expect this proof to resemble the proof that Paul's logic fails the procrastination paradox.
The proofs are not exactly analogous (specifically, the proof for Paul's logic does not use recursion to define the statement that the button is pressed in the future), but they are similar. Perhaps if we can solve the problem in the simpler case with reflective oracles, we can adapt the solution to talk about Paul's logic. For example, Benja suggested that we could restrict utility functions to be a continuous function of the actions (in that sufficiently late actions can only have a small effect on the resulting utility), and then prove an optimality result for reflective oracles that depends on continuity.