Abram Demski

20

Yeah, in hindsight I realize that my iterated mugging scenario only communicates the intuition to people who already have it. The Lizard World example seems more motivating.

20

You can do exploration, but the problem is that (unless you explore into non-fixed-point regions, violating epistemic constraints) your exploration can never confirm the existence of a fixed point which you didn't previously believe in. However, I agree that the situation is analogous to the handstand example, assuming it's true that you'd never try the handstand. My sense is that the difficulties I describe here are "just the way it is" and only count against FixDT in the sense that we'd be happier with FixDT if somehow these difficulties weren't present.

I think your idea for how to find repulsive fixed-points could work if there's a trader who can guess the location of the repulsive point exactly rather than approximately, and has the wealth to precisely enforce that belief on the market. However, the wealth of that trader will act like a martingale; there's no reliable profit to be made (even on average) by enforcing this fixed point. Therefore, such a trader will go broke eventually. On the other hand, attractive fixed points allow profit to be made (on average) by approximately guessing their locations.

Repulsive points effectively "drain willpower".

20

I think so, yes, but I want to note that my position is consistent with nosy-neighbor hypotheses not making sense. A big part of my point is that there's a lot of nonsense in a broad prior. I think it's hard to rule out the nonsense without learning. If someone thought nosy neighbors *always* 'make sense', it could be an argument against my whole position. (Because that person might be just fine with UDT, thinking that my nosy-neighbor 'problems' are *just *counterfactual muggings.)

Here's an argument that nosy neighbors can make sense.

For values, as I mentioned, a nosy-neighbors hypothesis is a value system which cares about what happens in many different universes, not just the 'actual' universe. For example, a utility function which assigns some value to statements of mathematics.

For probability, a nosy-neighbor is like the Lizard World hypothesis mentioned in the post: it's a world where what happens there depends a lot on what happens in *other* worlds.

I think what you wrote about staples vs paperclips nosy-neighbors is basically right, but maybe if we rephrase it it can 'make more sense'?: "I (actual me) value paperclips being produced in the counterfactual(-from-my-perspective) world where I (counterfactual me) don't value paperclips."

Anyway, whether or not it makes intuitive sense, it's mathematically fine. The idea is that a world will contain facts that are a good lens into alternative worlds (such as facts of Peano Arithmetic), which utility hypotheses / probabilistic hypotheses can care about. So although a hypothesis is only mathematically defined as a function of worlds where it holds, it "sneakily" depends on stuff that goes on in other worlds as well.

20

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it's misleading to imply we can surmount it. It's great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I've felt this was obscured in many relevant conversations.

I don't get your disagreement. If your view is that you can't eat one cake and keep it too, and my view is that you can eat some cakes and keep other cakes, isn't the obvious conclusion that these two views are compatible?

I would also argue that you can slice up a cake and keep some slices but eat others (this corresponds to mixed strategies), but this feels like splitting hairs rather than getting at some big important thing. My view *is mainly about* iterated situations (more than one cake).

Maybe your disagreement would be better stated in a way that didn't lean on the cake analogy?

My point is that the theoretical work you are shooting for is so general that it's closer to "what sorts of AI designs (priors and decision theories) should always be implemented", rather than "what sorts of AI designs should humans in particular, in this particular environment, implement".

And I think we won't gain insights on the former, because there are no general solutions, due to fundamental trade-offs ("no-free-lunchs").

I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/"eye-balling"/iterating.

Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions.

Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful.

It sort of seems like your point is that theoretical work always needs to be predicated on simplifying assumptions. I agree with this, but I don't think it makes theoretical work useless. My belief is that we should continue working to make the assumptions more and more realistic, but the 'essential picture' is often preserved under this operation. (EG, Newtonian gravity and general relativity make most of the same predictions in practice. Kolmogorov axioms vindicated a lot of earlier work on probability theory.)

20

This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can't differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on "our own beliefs" or "which beliefs I endorse"? After all, that's just one more part of reality (without a clear boundary separating it).

I'm comfortable explicitly assuming this isn't the case for nice clean decision-theoretic results, so long as it looks like the resulting decision theory also handles this possibility 'somewhat sanely'.

It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can't know the correct one in advance, we always have to rely on extrapolating contingent past observations.

But then, it seems like your reaction is still hoping that we can have our cake and eat it: "I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I'm in the Infinite Counterlogical Mugging... then I will just eventually change my prior because I noticed I'm in the bad world!". But then again, why would we think this update is safe? That's just not being updateless, and losing out on the strategic gains from not updating.

My thinking is more that we should accept the offer finitely many times or some fraction of the times, so that we reap some of the gains from updatelessness while also 'not sacrificing too much' in particular branches.

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

Since a solution doesn't exist in full generality, I think we should pivot to more concrete work related to the "content" (our particular human priors and our particular environment) instead of the "formalism".

This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

20

You're right, I was overstating there. I don't think it's probable that everything cancels out, but a more realistic statement might be something like "if UDT starts with a broad prior which wasn't designed to address this concern, there will probably be many situations where its actions are more influenced by alternative possibilities (delusional, from our perspective) than by what it knows about the branch that it is in".

20

Yeah, I expect the Lizard World argument to be the more persuasive argument for a similar point. I'm thinking about reorganizing the post to make it more prominent.

20

Let's frame it in terms of value learning.

**Naive position:** UDT can't be combined with value learning, since UDT doesn't learn. If we're not sure whether puppies or rainbows are what we intrinsically value, but rainbows are easier to manufacture, then the superintelligent UDT will tile the universe with rainbows instead of puppies because that has higher expectation according to the prior, regardless of evidence it encounters suggesting that puppies are what's more valuable.

**Cousin_it's reply:** There's puppy-world and rainbow-world. In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.

The UDT agent gets to observe which universe it is in, although it has a 50-50 prior on the two. There are four policies:

- Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.
- EV: 45

- Always make rainbows: 50% chance of utility 100, otherwise zero.
- EV: 50

- Make puppies in rainbow world; make rainbows in puppy world.
- EV: 0

- Make puppies in puppy world, make rainbows in rainbow world.
- EV: 95

The highest EV is to do the obvious value-learning thing; so, there's no problem.

**Fixing the naive position:** Some hypotheses will "play nice" like the example above, and updateless value learning will work fine.

However, there are some versions of "valuing puppies" and "valuing rainbows" which value puppies/rainbows *regardless of which universe the puppies/rainbows are in*. This only requires that there's some sort of embedding of counterfactual information into the sigma-algebra which the utility functions are predicated on. For example, if the agent has beliefs about PA, these utility functions could check for the number of puppies/rainbows in arbitrary computations. This mostly won't matter, because the agent doesn't have any control over arbitrary computations; but some of the computations contemplated in Rainbow Universe will be good models of Puppy Universe. Such a rainbow-value-hypothesis will value policies which create rainbows over puppies *regardless of which branch they do it in*.

These utility functions are called "nosy neighbors" because they care about what happens in other realities, not just their own.

Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I'll assume they're nosy enough that they value puppies/rainbows in other universes exactly as much as in their own. There are four policies:

- Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.
- EV: 90

- Always make rainbows: 50% worthless, 50% worth 100 + 100.
- EV: 100

- Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.
- EV: 95

- Puppies in puppy universe, rainbown in rainbow universe:
- EV: 95

In the presence of nosy neighbors, the naive position is vindicated: UDT doesn't do "value learning".

The argument is similar for the case of 'learning the correct prior'. The problem is that if we start with a broad prior over possible priors, then there can be nosy-neighbor hypotheses which spoil the learning. These are hard to rule out, because it is hard to rule out simulations of other possible worlds.

30

Here's a different way of framing it: if we *don't* make this assumption, is there some *useful generalization of UDT* which emerges? Or, having not made this assumption, are we stuck in a quagmire where we can't really say anything useful?

I think about these sorts of 'technical assumptions' needed for nice DT results as "sanity checks":

- I think we need to make several significant assumptions like this in order to get nice theoretical DT results.
- These nice DT results won't precisely apply to the real world; however, they do show that the DT being analyzed
*at least behaves sanely when it is in these 'easier' cases.* - So it seems like the natural thing to do is prove tiling results, learning results, etc under the necessary technical assumptions, with
*some*concern for how restrictive the assumptions are (broader sanity checks being better), and then also, check whether behavior is "at least somewhat reasonable" in other cases.

So if UDT fails to tile when we remove these assumptions, but, at least appears to choose its successor in a reasonable way given the situation, this would count as a success.

Better, of course, if we can find the more general DT which tiles under weaker assumptions. I do think it's quite plausible that UDT needs to be generalized; I just expect *my* generalization of UDT will still need to make an assumption which rules out *your* counterexample to UDT.

This is the fundamental obstacle according to me, so, unfortunate that I haven't successfully communicated this yet.

Perhaps I could suggest that you try to prove your intuition here?