Vladimir Nesov


Oracle predictions don't apply to non-existent worlds

IMO, either the Oracle is wrong, or the choice is illusory

This is similar to determinism vs. free will, and suggests the following example. The Oracle proclaims: "The world will follow the laws of physics!". But in the counterfactual where an agent takes a decision that won't actually be taken, the fact of taking that counterfactual decision contradicts the agent's cognition following the laws of physics. Yet we want to think about the world within the counterfactual as if the laws of physics are followed.

Oracle predictions don't apply to non-existent worlds

No, I suspect it's a correct ingredient of counterfactuals, one I didn't see discussed before, not an error restricted to a particular decision theory. There is no contradiction in considering each of the counterfactuals as having a given possible decision made by the agent and satisfying the Oracle's prediction, as the agent doesn't know that it won't make this exact decision. And if it does make this exact decision, the prediction is going to be correct, just like the possible decision indexing the counterfactual is going to be the decision actually taken. Most decision theories allow explicitly considering different possible decisions, and adding correctness of the Oracle's prediction into the mix doesn't seem fundamentally different in any way, it's similarly sketchy.

Oracle predictions don't apply to non-existent worlds

Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?

The agent knows what "correct" means, correctness of a claim is defined for the possible worlds that the agent is considering while making its decision (which by local tradition we confusingly collectively call "counterfactuals", even though one of them is generated by the actual decision and isn't contrary to any fact).

In the post Chris_Leong draws attention to the point that since the Oracle knows which possible world is actual, there is nothing forcing its prediction to be correct on the other possible worlds that the agent foolishly considers, not knowing that they are contrary to fact. And my point in this thread is that despite the uncertainty it seems like we have to magically stipulate correctness of the Oracle on all possible worlds in the same way that we already magically stipulate the possibility of making different decisions in different possible worlds, and this analogy might cast some light on the nature of this magic.

Oracle predictions don't apply to non-existent worlds

No, this is an important point: the agent normally doesn't know the correctness scope of the Oracle's prediction. It's only guaranteed to be correct on the actual decision, and can be incorrect in all other counterfactuals. So if the agent knows the boundaries of the correctness scope, they may play chicken and render the Oracle wrong by enacting the counterfactual where the prediction is false. And if the agent doesn't know the boundaries of the prediction's correctness, how are they to make use of it in evaluating counterfactuals?

It seems that the way to reason about this is to stipulate correctness of the prediction in all counterfactuals, even though it's not necessarily correct in all counterfactuals, in the same way as the agent's decision that is being considered is stipulated to be different in different counterfactuals, even though the algorithm forces it to be the same. So it's a good generalization of the problem of formulating counterfactuals, it moves the intervention point from agent's own decisions to correctness of powerful predictors' claims. These claims act on the counterfactuals generated by the agent's own decisions, not on the counterfactuals generated by delivery of possible claims, so it's not about merely treating predictors as agents, it's a novel setup.

Oracle predictions don't apply to non-existent worlds

When the Oracle says "The taxi will arrive in one minute!", you may as well grab your coat.

Oracle predictions don't apply to non-existent worlds

It's not actually more general, it's instead about a somewhat different point. The more general statement could use some sort of a notion of relative actuality, to point at the possibly counterfactual world determined by the decision made in the world where the prediction was delivered, which is distinct from the even more counterfactual worlds where the prediction was delivered but the decision was different from what it would relative-actually be had the prediction been delivered, and from the worlds where the prediction was not delivered at all.

If the prediction is not actually delivered, then it only applies to that intermediately-counterfactual world and not to the more counterfactual alternatives where the prediction was still delivered or to the less counterfactual situation where the prediction is not delivered. Saying that the prediction applies to the world where it's delivered is liable to be interpreted as including the more-counterfactual worlds, but it doesn't have to apply there, it only applies to the relatively-actual world. So your original framing has a necessary part of saying this carefully that my framing didn't include, replacing it with my framing discards this correct detail. The Oracle's prediction only has to apply to the "relatively-actual" world where the prediction is delivered.

Oracle predictions don't apply to non-existent worlds

Consider the variant where the Oracle demands a fee of 100 utilons after delivering the prediction, which you can't refuse. Then the winning strategy is going to be about ensuring that the current situation is counterfactual, so that in actuality you won't have to pay the Oracle's fee, because the Oracle wouldn't be able to deliver a correct prediction.

The Oracle's prediction only has to apply to the world that is. It doesn't have to apply to worlds that are not.

The Oracle's prediction only has to apply to the world where the prediction is delivered. It doesn't have to apply to the other worlds. The world where the prediction is delivered can be the world that is not, and another world can be the world that is.

Can you control the past?

One way of noticing the Son-of-CDT issue dxu mentioned is thinking of CDT as not just being unable to control the events outside the future lightcone, but as not caring about the events outside the future lightcone. So even if it self-modifies, it's not going to accept tradeoffs between the future and not-the-future of the self-modification event, as that would involve changing its preference (and somehow reinventing preference for the events it didn't care about just before the self-modification event).

With time, CDT continually becomes numb to events outside its future, loses parts of its values. Self-modifying to Son-of-CDT stops further loss, but doesn't reverse past loss.

Can you control the past?

Agent's policy determines how its instances act, but in general it also determines which instances exist, and that motivates thinking of the agent as the algorithm channeled by instances rather than as one of the instances controlling the others, or as all instances controlling each other. For example, in Newcomb's problem, you might be sitting inside the box with the $1M, and if you two-box, you have never existed. Grandpa decides to only have children if his grandchildren one-box. Or some copies in distant rooms numbered (on the outside) 1 to 5 writing integers on blackboards, with only the rooms whose number differs from the integer written by at most 1 being occupied. In the occupied rooms, the shape of the digits is exactly the same, but the choice of the integers determines which (if any) of the rooms are occupied. You may carefully write a 7, and all rooms are empty.

If you are the algorithm, which algorithm are you, and what instances are running you? Unfortunate policy decisions, such as thinking too much, can sever control over some instances, as in ASP, or when (as an instance) retracting too much knowledge (UDT-style) and then (as a resulting algorithm) having to examine too many possible states of knowledge or of possible observations, grasping at a wider scope but losing traction, because the instances can no longer channel such an algorithm. Decisions of some precursor algorithm may even determine which successor algorithm an instance is running, not just which policy a fixed algorithm executes, in which case identifying with the instance is even less coherent than if it can merely cease to exist.

paulfchristiano's Shortform

The point is that in order to be useful, a prediction/reasoning process should contain mesa-optimizers that perform decision making similar in a value-laden way to what the original humans would do. The results of the predictions should be determined by decisions of the people being predicted (or of people sufficiently similar to them), in the free-will-requires-determinism/you-are-part-of-physics sense. The actual cognitive labor of decision making needs to in some way be an aspect of the process of prediction/reasoning, or it's not going to be good enough. And in order to be safe, these mesa-optimizers shouldn't be systematically warped into something different (from a value-laden point of view), and there should be no other mesa-optimizers with meaningful influence in there. This just says that prediction/reasoning needs to be X-and-only-X in order to be safe. Thus the equivalence. Prediction of exact imitation in particular is weird because in that case the similarity measure between prediction and exact imitation is hinted to not be value-laden, which it might have to be in order for the prediction to be both X-and-only-X and efficient.

This is only unimportant if X-and-only-X is the likely default outcome of predictive generalization, so that not paying attention to this won't result in failure, but nobody understands if this is the case.

The mesa-optimizers in the prediction/reasoning similar to the original humans is what I mean by efficient imitations (whether X-and-only-X or not). They are not themselves the predictions of original humans (or of exact imitations), which might well not be present as explicit parts of the design of reasoning about the process of reflection as a whole, instead they are the implicit decision makers that determine what the conclusions of the reasoning say, and they are much more computationally efficient (as aspects of cheaper reasoning) than exact imitations. At the same time, if they are similar enough in a value-laden way to the originals, there is no need for better predictions, much less for exact imitation, the prediction/reasoning is itself the imitation we'd want to use, without any reference to an underlying exact process. (In a story simulation, there are no concrete states of the world, only references to states of knowledge, yet there are mesa-optimizers who are the people inhabiting it.)

If prediction is to be value-laden, with value defined by reflection built out of that same prediction, the only sensible way to set this up seems to be as a fixpoint of an operator that maps (states of knowledge about) values to (states of knowledge about) values-on-reflection computed by making use of the argument values to do value-laden efficient imitation. But if this setup is not performed correctly, then even if it's set up at all, we are probably going to get bad fixpoints, as it happens with things like bad Nash equilibria etc. And if it is performed correctly, then it might be much more sensible to allow an AI to influence what happens within the process of reflection more directly than merely by making systematic distortions in predicting/reasoning about it, thus hypothetical processes of reflection wouldn't need the isolation from AI's agency that normally makes them safer than the actual process of reflection.

Load More