Response to "What does the universal prior actually look like?"

28Paul Christiano

1michaelcohen

4Paul Christiano

1michaelcohen

3Paul Christiano

3michaelcohen

2Paul Christiano

2Paul Christiano

1michaelcohen

2Paul Christiano

1michaelcohen

2Paul Christiano

1michaelcohen

2Paul Christiano

1michaelcohen

3Paul Christiano

1michaelcohen

3Paul Christiano

1michaelcohen

2Paul Christiano

1michaelcohen

1michaelcohen

2Paul Christiano

1michaelcohen

3Paul Christiano

1michaelcohen

3Paul Christiano

0Signer

2Paul Christiano

10Paul Christiano

1michaelcohen

New Comment

I think the original post was pretty unclear (I was even more confusing 5 years ago than I am now) and it's probably worthwhile turning it into a more concrete/vivid scenario. Hopefully that will make it easier to talk about any remaining disagreements, and also will make the original post clearer to other folks (I think most people bounce off of it in its current form).

To make things more vivid I'll try describe what the world might look like from our perspective if *we* came to believe that we were living inside the imagination of someone thinking about the universal prior. This is not a particularly realistic story, but hopefully it's clear enough to clear up the common confusions.

------------ Start story

Our civilization lives for an incredibly long time, we spread throughout the universe, and we do a bunch of fundamental physics. We eventually discover that our laws of physics are extremely simple inside a particular computational model. Perhaps our entire physics are described by a Turing machine with 22 states.

(Of course Turing machines are just the model of computation we use---it's much more likely that the alien civilization running the Solomonoff inductor uses a different model of computation. But to keep things vivid and simple I'll imagine that the alien civilization running the Solomonoff inductor uses Turing machines, and that to us Turing machines are an unfamiliar concept.)

(And on top of that, I think that our physics probably *aren't* most simply described as a Turing machine. But of course that's just a difference between us and the people who would live inside our Solomonoff inductor, and I hope it will be forgiven for the purpose of vividness.)

On top of that, we've done a ton of comparative intellectual history. We've simulated astronomical numbers of civilizations, and we know a lot about the whole range of models of computation that are developed and used by them to define hypothetical processes of Solomonoff induction. We know that Turing machines are a reasonably common model of computation (perhaps something isomorphic is used by one in every million civilizations).

Moreover, physics being represented by a simple Turing machine isn't really philosophically natural to us---they aren't the kind of world that we are inclined to think is deeply meaningful by virtue of their intrinsic simplicity. Simple Turing machines are *distinctive* to the intellectual project of computer science, they are an artifact of the way that computer science develops and is connected to society, not of any notion of simplicity that we find objectively appealing.

So when we see that physics happens to be a tiny Turing machine, it's a clue about the nature of reality. It's not decisive: Turing machines aren't the canonically simplest kind of object, but they also aren't incredibly unnatural, and much more importantly we can't actually tell whether we are living in a Turing machine or in some other kind of simple structure that gives rise to the same dynamics. (We've considered vast numbers of possibilities and basically know what the menu of options are.)

And of course we're thinking at great length about what we should do. We care a lot about what happens in our world, but we are also interested in the possibility that we are primarily operating within the imagination of someone thinking about Solomonoff induction, who themselves is living in a more intrinsically meaningful/important universe. We would have cared about that possibility anyway (and indeed people already discuss it and wonder what they should do in light of it), but it becomes much more salient to us once we discover that the basic nature of our universe appears to be *the kind of thing* that would be produced by a civilization trying to define a universal prior rather than the kind of thing that we feel "emerges naturally" or is of intrinsically meaningful/important.

So we're interested in investing some tiny fraction of our resources in escaping from such simulations. The most important way to minimize the cost is by reducing probability. For example, if we decide to do something only with 1% probability (and in the other 99% of worlds focus exclusively on other things) then we only decrease our log probability by 7 bits. I actually doubt we'd waste those 7 bits, but it's worth noting how cheap this kind of breakout attempt could be.

Now we devote a tiny fraction of our energies (e.g. a billion billion years of effort from a civilization much more sophisticated than our own) to understanding how a breakout via Solomonoff induction might work. We run large numbers of historical simulations to understand the distribution of cases where Solomonoff inductors that favor of us are instantiated---what are the civilizations like, what are the decisions that they face. What's the distribution of ways that they define the prior, and the distribution of ways that they connect up the model to sensors? And so on.

At this point we know a lot about the game, but we still have a lot of work to do. We have to think about all the ways in which our Turing machine might produce the output that is actually used by the Solomonoff inductor, and which of those we could influence how cheaply. Most of those possible outputs will be largely out of our control (e.g. they will have already output tons of bits) but some of them will be in our control (e.g. because they only output bits under some special conditions that we can bring about but have never occurred naturally, or they only start outputting bits at extremely late times, or whatever).

(Of course the fraction of output mappings we can control is very tiny, but so is the fraction of output mappings that directly correspond to what a particular camera on Old Earth sees. Indeed, output rules that correspond to camera inputs are essentially a special case of output rules that an embedded civilization can control! I believe that the ones we can control, *when a civilization is collaborating with the machine to try to produce useful output*, is vastly larger than the fraction that record the output from a given camera that was constructed without any understanding of the nature of the underlying Turing machine. I think this is true by a huge margin and with very little uncertainty. The only real thing we have working against us is that we are much later in history, in a world with vastly more interesting stuff going on. That does make the point a bit subtle and adds some uncertainty, but I still feel like the calculus is overwhelmingly in favor of the people who are trying to make a beacon vs the people who randomly made a camera.)

So we end up with some leading hypotheses about the Turing machine we are running on, the history that gave rise to us, and the output rule used by that Turing machine. We choose which ones influence, based on what we can influence cheaply, what other civilizations will be trying to influence, and so on. And then we build the (likely incredibly complex) machinery to control the physical world well enough to precisely control the output in the way that we want.

Separately from that, we need to choose the distribution of sequences that we want to play in the game. Intuitively, we basically sample a bunch of civilizations at random, pick out the inputs to which they apply Solomonoff induction, and then reweight them based on the prior probability of our universe (under the particular universal prior used by that sequence) and the value that we assign to corrupting their output. If we have enough compute we can literally do this---we run simulations of universes that we morally value and actually rejection sample the important Solomonoff inductors they contain. If that's computationally intractable (as seems most likely) then this is instead an incredibly complex scientific project to which we devote billions of years.

Once we've computed our sequences, then we actually write out that sequence. For example, we conduct the insane high-energy experiments that would drive a particular TM (that contains us) into a state where it outputs a bit. Of course we don't know for sure that this is the particular way that the TM works, since we can't infer the output rules from our observations, and probably it isn't. But that's just like saying that any given TM *probably* doesn't output the pixels your camera. The point is that we are doing things that would cause a tiny fraction of the TMs containing us to output good sequences, and that's going to be a way higher fraction than those that happen to output the pixels of cameras.

Or we build a mega-structure that would be detected by a different output rule. Or we recognize a particular feature of our physical universe---e.g. the value of a particular field at a particular spot---as corresponding to a key cell running the TM, and then apply tons of energy controlling that critical feature to emit a simple start sequence that would never have appeared in history. Or we just build vast numbers of cameras. Or we ensure that a given pattern is the modal pattern output by a certain recording procedure.

In reality we do *tons* of those things since many of them are extremely cheap.

We never get any feedback about whether any of this works. We keep doing it indefinitely, controlling different output channels, to get as much influence as we can over people who reason about this kind of universal prior.

If we conclude that our structural advantage is large, then we may mostly be competing with other consequentialists who are doing similar things. If that's the case, then I think it's reasonably likely we will make bargains early in our history to avoid wasteful expenditures of resources (in the real world), and so will just spend enough resources to ensure that consequentialists collectively dominate the posterior.

So we end up with some leading hypotheses about the Turing machine we are running on, the history that gave rise to us, and the output rule used by that Turing machine.

I feel like this story has run aground on an impossibility result. If a random variable’s value is unknowable (but its distribution is known) and an intelligent agent wants to act on its value, and they randomize their actions, the expected log probability of them acting on the true value cannot exceed the entropy of the distribution, no matter their intelligence. (And if they’re wrong about the r.v.’s distribution, they do even worse). But lets assume they are correct. They know there are, say, 80 output instructions (two work tapes and one input tape, and a binary alphabet, and 10 computation states). And each one has a 1/3 chance of being “write 0 and move”, “write 1 and move”, or “do nothing”. Let’s assume they know the rules governing the other tape heads, and the identity of the computation states (up to permutation). Their belief distribution is (at best) uniform over these 3^80 possibilities. Is computation state 7 where most of the writing gets done? They just don’t know. It doesn’t matter if they’ve figured out that computation state 7 is responsible for the high-level organization of the work tapes. It’s totally independent. Making beacons is like assuming that computation state 7, so important for the dynamics of their world, has anything special to do with the output behavior. (Because what is a beacon if not something that speaks to *internally* important computation states?)

That’s all going along with the premise that when consequentialists face uncertainty, they flip a coin, and adopt certainty based on the outcome. So if they think it’s 50/50 whether a 0 or a 1 gets output, they flip a coin or look at some tea leaves, and then act going forward as if they just learned the answer. Then, it only costs 1 bit to say they decided “0”. But I think getting consequentialists to behave this way requires an intervention into their proverbial prefrontal cortices. If these consequentialists were playing a bandit game, and one arm gave a reward of 0.9 with certainty, and the other was 50/50 between a reward of 0 or 1, they obviously don’t flip a coin to decide whether to act as if it’s really reward 1 or reward 0. (I understand that Thompson sampling is a great strategy, but only when your uncertainty is ultimately empirically resolvable, and uncertainty about output behavior is not).

I think you’ll get the impression that I’m being miserly with bits here. And your relative profligacy comes from the fact that you expect they’ll make it up easily in the anthropic update. But you’ll ultimately have to claim that the anthropic update is more cheaply specified as a natural product of consequentialist civilization than through direct encoding. And if every step of the way in your story about how consequentialists arrive at this behavior, I make the point that this is not sensible goal-oriented behavior, and you make the point that it only takes a few bits to make it be what they would do anyway, then if you put it all together, I’m not just haggling over bits. If you put it all together, it looks to me like the consequentialists’ consequentialism is doing little to none of the work; every step of the way, reasonable behavior is being overridden, a few bits at a time. So then I ultimately claim, this anthropic update is not most parsimoniously described as “that thing that consequentialists sometimes produce” because it’s just not actionable for any consequentialists with reasonable epistemics.

In a different part of the story, you claim that if something composes 1% of what consequentialists value, we can assume that they flip 6 and a half coins, and with 1% probability they act as if that’s the only thing they value.

So we're interested in investing some tiny fraction of our resources in escaping from such simulations. The most important way to minimize the cost is by reducing probability. For example, if we decide to do something only with 1% probability (and in the other 99% of worlds focus exclusively on other things) then we only decrease our log probability by 7 bits. I actually doubt we'd waste those 7 bits, but it's worth noting how cheap this kind of breakout attempt could be.

This seems to me to be another case of supplanting consequentialists’ instrumental rationality with randomness. These are key parts of the story. The part where they make a wild guess about the output of their world, and the part where decide to pursue it in the first place are both places where reasonable goal-oriented behavior is being *replaced *with deference to the descriptive and normative wisdom of tea leaves; this is not just specifying one of the things they sometimes do naturally with small-but-non-negligible probability. It would be an unusual model of agency which claimed that: for a utility function , we have . And it would be an even more unusual model of agency which claimed that: for a world , we have . Even within large multiplicative fudge fators. I feel like I need the assistance of a Dutch bookie here.

Of course we don't know for sure that this is the particular way that the TM works, since we can't infer the output rules from our observations, and probably it isn't. But that's just like saying that any given TM

probablydoesn't output the pixels your camera. The point is that we are doing things that would cause a tiny fraction of the TMs containing us to output good sequences, and that's going to be a way higher fraction than those that happen to output the pixels of cameras

I don’t think it’s just like saying that. I think I have argued that the probability that consequentialists act on the belief that camera 12,041 has a direct line to the output tape is *smaller* than the probability that it is actually true. Likewise for something that appears “beacon-like” to the world’s residents. Given a state of total ignorance about the means by which they can affect the output tape, their guesses can be no better than the true prior distribution over what in the world has a direct line to the output tape. (This is contra: "the fact that they are trying puts them at a massive advantage" from your other comment; intelligence and effort don't work in a max-ent setting. With maximum entropy beliefs about the output channel, those silly no free lunch theorems of optimization do actually apply.) And then also given that ignorance, there is much less value in taking a stab in the dark. In general, with probabilistic mixtures over worlds , .

I don’t think it’s just like saying that...

I didn't quite get this, so let me try restating what I mean.

Let's say the states and rules for manipulating the worktapes are totally fixed and known, and we're just uncertain about the rules for outputting something to the output tape.

**Zero** of these correspond to reading off the bits from a camera (or dataset) embedded in the world. *Any* output rule that lets you read off precisely the bits from the camera is going to involving adding a bunch of new states to the Turing machine.

So let's instead consider the space of all ways that you can augment a given physics to produce outputs. This will involve adding a ton of states to the Turing machine. Most of them produce really garbage-y outputs, but a tiny fraction do something intelligent that produces a coherent-looking output.

Some fraction of those involve reading off the bits from a particular camera embedded in the world. Let's arbitrarily say it's ?

Now consider any given intervention that we can perform to try to manipulate the prior. For example, we can perform a high-energy physics experiment that produces an unprecedented interaction, and control a parameter of how that interaction occurs. We can write the bits in the pattern of giant star-sized clumps of matter. Or we can send the relevant bits out to infinity with unprecedentedly high energies. Or we can build a quadrillion cameras throughout the world. Or we can change the nature of the camera so that more of the possible output rules read off its values. Or we can alter the distribution of matter when the universe comes apart, so that a Turing machine can read it off from that. Or whatever.

It seems to me that "random camera on old Earth" is probably less likely to be output by the physics-extension than *some* of these other ways of encoding data. For example, maybe of all output rules end up reading off data from the highest-energy events in the universe, and we can influence that.

I think the only thing that really cuts against this is that a random camera on old earth (i) is earlier in history, (ii) takes place in a world with less interesting stuff going on (so that e.g. rules like "find some interesting structure and then read from it" have an easier time finding it). But those effects appear to be tiny potatoes (and I don't feel like you are engaging with them yet because you have prior objections).

Overall I'm playing a game of thinking about the distribution of output channels implied by the universal prior, and the probability that distribution places on "Camera on old earth" vs "Best things that a sophisticated civilization can control." I feel like you are playing some different game.

With maximum entropy beliefs about the output channel, those silly no free lunch theorems of optimization do actually apply.

Most of the possible output rules are not controllable, so you can ignore them. And then amongst those that are controllable, you can control many of them at once.

Maybe one more example to try to communicate why this seems so wrong to me.

Suppose the physics turn out to be a game of life. A dataset on old earth is some totally random configuration of cells (you have to read off the magnetization of a piece of a hard drive, which is traveling through physical at some high rate...). In order to read that off you need to add a *ton* of stuff to the TM that was simulating the game of life. You aren't doing something like "Everytime you enter state 23, which is used to update a cell when it has 3 live neighbors, output a 0 and move the output head to the right." It's just not even in that vague zone, not even a little bit.

(This is unrealistic both because (i) game of life is an extremely unnatural physics to emerge under a TM-based universal prior, (ii) our physics is completely unlike the game of life. But I think the point is even stronger if you pick a more realistic example and I just want to try to get more and more concrete to understand where the miscommunication is happening.)

But now that we've learned that physics is the game of life, we can make *much* better guesses about how to build a dataset so that a TM could output it. For example, we can:

- Build the dataset at a large number of places. This appears to just linearly increase the probability that it gets read off by many kinds of rules. If the fraction of interesting structures devoted to the dataset is larger than the fraction of interesting structure devoted to the dataset on old earth, this seems to me like it is
*already enough*for us to win the game. - By performing an incredibly complex engineering project, we can e.g. encode the dataset in the values of a
*particular cell*in the game of life, over a series of contiguous timesteps. This seems easier than reading off a dataset embedded in some random set of cells, which is in turn easier than reading off a dataset not even defined by cells but by higher-level patterns that are meaningful to us only because of the design of the hard drive. - Suppose we discover that the initial condition involved filling in cells in a tiny region of space and leaving the rest blank. Then I claim there are more TMs that read off values from that region of space than from distant regions of space (the fraction of TMs reading off values from any given part of space needs to fall to 0).
- We can encode the dataset in a huge number of different ways all at once, so that many different output rules will hit it.

You might think that some very different kind of rules dominates the probability of the camera embedded in the game of life, so that none of those arguments are relevant. For example, maybe you think that most of the probability comes from a TM that works by generating the game of life, then looping over small extraction methods until it finds one that has a certain property, and *then* using that extraction method to produce an output. I'm game with whatever alternative you want to propose; that is, I challenge you to find *any* plausible description of a rule that outputs the bits observed by a camera, for which I can't describe a simpler extraction rule that would output some set of bits controlled by the sophisticated civilization.

I'm imagining that the consequentialists care about something, like e.g. human flourishing. They think that they could use their control over the universal prior to achieve more of what they care about, i.e. by achieving a bunch of human flourishing in some other universe where someone thinks about the universal prior. Randomizing is one strategy available to them to do that.

So I'm saying that I expect they will do *better---*i.e. get *more* influence over the outside world (per unit of cost paid in their world)---than if they had simply randomized. That's because randomizing is one of the strategies available to them and they are trying to pick the best one.

(In fact I think they will do many orders of magnitude better than randomizing since they can simultaneously win for many different output methods, and they can ignore the overwhelming majority of output rules which have no chance of describing something interesting about the world).

You seem to be saying that they will get *less* influence than if they randomized. Something about how this behavior is not sensible "goal-oriented behavior," and instead the sensible goal-oriented behavior is something that *doesn't* get them any influence? In what sense do you think it is sensible goal-oriented behavior, if it doesn't result in getting any influence?

Maybe the key difference is that I'm talking about a scenario where the consequentialists have the goal of influencing the universal prior, and that possibility seems so weird to you that you aren't even engaging with it?

It's definitely not too weird a possibility for me. I'm trying to reason backwards here--the best strategy available to them *can't* be effective in expectation at achieving whatever their goals are with the output tape, because of information-theoretic impossibilities, and therefore, any given strategy will be that bad or worse, including randomization.

To express my confusion more precisely:

I feel like this story has run aground on an impossibility result. If a random variable’s value is unknowable (but its distribution is known) and an intelligent agent wants to act on its value, and they randomize their actions, the expected log probability of them acting on the true value cannot exceed the entropy of the distribution, no matter their intelligence.

I think that's right (other than the fact that they can win simultaneously for many different output rules, but I'm happy ignoring that for now). But I don't see why it contradicts the story at all. In the story the best case is that we know the true distribution of output rules, and then we do the utility-maximizing thing, and that results in our sequence having way more probability than some random camera on old earth.

If you want to talk about the information theory, and ignore the fact that we can do multiple things, then we control the single output channel with maximal probability, while the camera is just some random output channel (presumably with some much smaller probability).

The information theory isn't very helpful, because actually all of the action is about which output channels are controllable. If you restrict to some subset of "controllable" channels, *and believe that any output rule that outputs the camera is controllable*, then the conclusion still holds. So the only way it fails is when the camera is higher probability than the best controllable output channels.

I currently don't understand the information-theoretic argument at all (and feels like it must come down to some kind of miscommunication), so it seems easiest to talk about how the impossibility argument applies to the situation being discussed.

If we want to instead engage on the abstract argument, I think it would be helpful to me to present it as a series of steps that ends up saying "And that's why the consequentialists can't have any influence." I think the key place I get lost is the connection between the math you are saying and a conclusion about the influence that the consequentialists have.

If these consequentialists ascribed a value of 100 to the next output bit being 1, and a value of 0 to the next output bit being 0, and they valued nothing else, would you agree that all actions available to them have identical expected value under the distribution over Turing machines that I have described?

I don't agree, but I may still misunderstand something. Stepping back to the beginning:

Suppose they know the sequence that actually gets fed to the camera. It is x= 010...011.

They want to make the next bit 1. That is, they want to maximize the probability of the sequence (x+1)=010...011**1**.

They have developed a plan for controlling an output channel to get it to output (x+1).

For concreteness imagine that they did this by somehow encoding x+1 in a sequence of ultra high-energy photons sent in a particular direction. Maybe they encode 1 as a photon with frequency A and a 0 as a photon with frequency B.

There is no way this plan results in the next bit being 0. If they are wrong about how the output channel encodes photons (i.e. it decodes A as 1 and B as 0) then that channel isn't going to end up with any probability.

You don't try to encode 010...011**1** and then accidentally end up encoding 010...011**0**. You end up encoding something like 101...1000, or something totally different.

Suppose they know the sequence that actually gets fed to the camera.

If you're saying that they know their Turing machine has output x so far, then I 100% agree. What about in the case where they don't know?

I don't think I understand what you mean. Their goal is to increase the probability of the sequence x+1, so that someone who has observed the sequence x will predict 1.

What do you mean when you say "What about in the case where they don't know"?

I agree that under your prior, someone has no way to increase e.g. the fraction of sequences in the universal prior that start with 1 (or the fraction of 1s in a typical sequence under the universal prior, or any other property that is antisymmetric under exchange of 0 and 1).

Okay, now suppose they want the first N bits of the output of their Turing machine to obey predicate P, and they assign that a value of 100, and a they assign a value of 0 to any N-bit string that does not obey predicate P. And they don't value anything else. If some actions have a higher value than other actions, what information about the output tape dynamics are they using, and how did they acquire it?

They are using their highest probability guess about the output channel, which will be higher probability than the output channel exactly matching some camera on old earth (but may still be very low probability). I still don't understand the relevance.

I'm probably going to give up soon, but there was one hint about a possible miscommunication:

Suppose they want the first N bits of the output of their Turing machine to obey predicate P, and they assign that a value of 100

They don't care about "their" Turing machine, indeed they live in an infinite number of Turing machines that (among other things) output bits in different ways. They just care about the probability of the bitstring x+1 under the universal prior---they want to make the mass of x+1 larger than the mass of x+0. So they will behave in a way that causes some of the Turing machines containing them to output x+1.

And then the question is whether the total mass of Turing machines (i.e. probability of noise strings fed into the UTM) that they are able to get to output x+1 is larger or smaller than the mass of Turing machines that output x for the "intended" reason.

They are using their highest probability guess about the output channel, which will be higher probability than the output channel exactly matching some camera on old earth (but may still be very low probability). I still don't understand the relevance.

I’m trying to find the simplest setting where we have a disagreement. We don’t need to think about cameras on earth quite yet. I understand the relevance isn’t immediate.

They don't care about "their" Turing machine, indeed they live in an infinite number of Turing machines that (among other things) output bits in different ways.

I think I see the distinction between the frameworks we most naturally think about the situation. I agree that they live in an infinite number of Turing machines, in the sense that their conscious patterns appear in many different Turing machines. All of these Turing machines have weight in some prior. When they change their behavior, they (potentially) change the outputs of any of these Turing machines. Taking these Turing machines as a set, weighted by those prior weights we can consider the probability that the output obeys a predicate P. The answer to this question can be arrived at through an equivalent process. Let the inhabitants imagine that there is a correct answer to the question “which Turing machine do I *really* live in?” They then reason anthropically about which Turing machines give rise to such conscious experiences as theirs. They then use the same prior over Turing machines that I described above. And then they make the same calculation about the probability that “their” Turing machine outputs something that obeys the predicate P. So on the one hand, we could say that we are asking “what is the probability that the section of the universal prior which gives rise to these inhabitants outputs an output that obeys predicate P?” Or we could equivalently ask “what is the probability that this inhabitant ascribes to ‘its’ Turing machine outputting a string that obeys predicate P?”

There are facts that I find much easier to incorporate when thinking in the latter framework, such as “a work tape inhabitant knows nothing about the behavior of its Turing machine’s output tape, except that it has relative simplicity given the world that it knows.” (If it believes that its conscious existence depends on its Turing machine never having output a bit that differs from a data stream in a base world, it will infer other things about its output tape, but you seem to disagree that it would make that assumption, and I’m fine to go along with that). (If the fact were much simpler—“a work tape inhabitant knows nothing about the behavior of its Turing machine’s output tape” full stop—I would feel fairly comfortable in either framework.)

If it is the case that, for any action that a work tape inhabitant takes, the following is unchanged: [the probability that *it* (anthropically) ascribes to “its” Turing machine printing an output that obeys predicate P after it takes that action], then, no matter its choice of action, then the probability under the universal prior that the output obeys predicate P is also unchanged.

What if the work tape inhabitant only cares about the output when the the universal prior is being used for important applications? Let Q be the predicate [P and “the sequence begins with a sequence which is indicative of important application of the universal prior”]. The same logic that applies to P applies to Q. (It feels easier to talk about probabilities of predicates (expectations of Boolean functions) rather than expectations of general functions, but if we wanted to do importance weighting instead of using a strict predicate on importance, the logic is the same).

Writing about the fact I described above about what the inhabitants believe about their Turing machine’s output has actually clarified my thinking a bit. Here’s a predicate where I think inhabitants could expect certain actions to make it more likely that their Turing machine output obeys that predicate. “The output contains the string [particular 1000 bit string]”. They believe that their world’s output is simple given their world’s dynamics, so if they write that 1000 bit string somewhere, it is more likely for the predicate to hold. (Simple manipulations of the string are nearly equally more likely to be output).

So there are *severe* restrictions on the precision with which they can control even low-probability changes to the output, but not total restrictions. So I wasn’t quite right in describing it as a max-entropy situation. But the one piece of information that distinguishes their situation from one of maximum uncertainty about the output is very slight. So I think it’s useful to try to think in terms of how they get from that information to their goal for the output tape.

I was describing the situation where I wanted to maximize the probability where the output of our world obeys the predicate: “this output causes decision-maker simulators to believe that virtue pays”. I think I could very slightly increase that probability by trying to reward virtuous people around me. Consider consequentialists who want to maximize the probability of the predicate “this output causes simulator-decision-makers to run code that recreates us in their world”. They want to make the internals of their world such that there are simple relative descriptions for outputs for which that predicate holds. I guess I think that approach offers extremely limited and imprecise ability to deliberately influence the output, no matter how smart you are.

If an approach has very limited success probability, (i.e. very limited sway over the universal prior), they can focus all their effort on mimicking a few worlds, but then we’ll probably get lucky, and ours won’t be one of the ones they focus on.

From a separate recent comment,

But now that we've learned that physics is the game of life, we can make

muchbetter guesses about how to build a dataset so that a TM could output it. For example, we can:

- Build the dataset at a large number of places.
- [etc.]
...

I challenge you to find

anyplausible description of a rule that outputs the bits observed by a camera, for which I can't describe a simpler extraction rule that would output some set of bits controlled by the sophisticated civilization.

You're comparing the probability of one of these many controlled locations driving the output of the machine to the probability that a random camera does on an earth-like Turing machine drives the output. Whereas it seems to me like the right question is to look at the absolute probabilities that one of these controlled locations drives the output. The reason is that what they attempt to output is a mixture over many sequences that a decision-maker-simulator might want to know about. But if the sequence we're feeding in is from a camera on earth, than their antics only matter to the extent that their mixture puts weight on a random camera on earth. So *they *have to specify the random camera on an earth-like Turing machine too. They're paying the same cost, minus any anthropic update. So the costs to compare are roughly [- log prob. successful control of output + bits to specify camera on earth - bits saved from anthropic update] vs. [bits to specify camera on earth - bits saved from directly programmed anthropic update]. This framing seems to imply we can cross off [bits to specify camera on earth] from both sides.

bits to specify camera on earth - bits saved from anthropic update

I think the relevant number is just "log_2 of the number of predictions that the manipulators want to influence." It seems tricky to think about this (rather small) number as the difference between two (giant) numbers.

So

theyhave to specify the random camera on an earth-like Turing machine too.

They are just looking at the earth-like Turing machine, looking for the inductors whose predictions are important, and then trying to copy those input sequences. This seems mostly unrelated to the complexity of adding states to the Turing machine so that it reads data from a particular location on a particular hard drive. It just rests on them being able to look at the simulation and figure out what's going on.

On the other hand, the complexity of adding states to the Turing machine so that it reads data from a particular location on a particular hard drive seems *very closely* related to the complexity of adding states to the Turing machine so that it outputs data encoded by the sophisticated civilization in the format that they thought was easiest for the Turing machine to output.

bits to specify camera on earth - bits saved from directly programmed anthropic update

Do you have some candidate "directly programmed anthropic update" in mind? (That said, my original claim was just about the universal prior, not about a modified version with an anthropic update)

I still feel like the quantitative question we're discussing is a blow-out and it's not clear to me where we are diverging on that. My main uncertainty about the broader question is about whether any sophisticated civilizations are motivated to do this kind of thing (which may depend on the nature of the inductor and how much reasoning they have time to do, since that determines whether the inductor's prediction is connected in the decision-theoretically relevant way with the civilization's decisions or commitments).

Do you have some candidate "directly programmed anthropic update" in mind? (That said, my original claim was just about the universal prior, not about a modified version with an anthropic update)

I’m talking about the weight of an anthropically updated prior *within* the universal prior. I should have added “+ bits to encode anthropic update directly” to that side of the equation. That is, it takes some number of bits to encode “the universal prior, but conditioned on the strings being important to decision-makers in important worlds”. I don’t know how to encode this, but there is presumably a relatively simple direct encoding, since it’s a relatively simple concept. This is what I was talking about in my response to the section “The competition”.

One way that might be helpful about thinking about the bits saved from the anthropic update is that it is string is important to decision-makers in important worlds. I think this gives us a handle in reasoning about anthropic savings as a self-contained object, even if it’s a big number.

> bits to specify camera on earth - bits saved from anthropic update

I think the relevant number is just "log_2 of the number of predictions that the manipulators want to influence." It seems tricky to think about this (rather small) number as the difference between two (giant) numbers.

But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit. If log of the number of predictions that the manipulators want to influence is 7 bits shorter than [bits to specify camera on earth - bits saved from anthropic update], then there’s a 99% chance we’re okay. If different manipulators in different worlds are choosing differently, we can expect 1% of them to choose our world, and so we start worrying again, but we add the 7 bits back because it’s only 1% of them.

So let’s consider two Turing machines. Each row will have a cost in bits.

A B

Consequentialists emerge, Directly programmed anthropic update.

make good guesses about controllable output,

decide to output anthropically updated prior.

Weight of earth-camera within anthropically updated prior

The last point can be decomposed into [description length of camera in our world - anthropic savings], but it doesn’t matter; it appears in both options.

I don’t think this is what you have in mind, but I’ll add another case, in case this is what you meant by “They are just looking at the earth-like Turing machine”. Maybe, just skip this though.

A B

Consq-alists emerge *in a world like ours*, Directly prog. anthropic update.

make good guesses about controllable output,

output (strong) anth. updated prior.

Weight of earth-camera in strong anth. update … in normal anth. update

They can make a stronger anthropic update by using information about their world, but the savings will be equal to the extra cost of specifying that the consequentialists are in a world like ours. This is basically the case I mentioned above where different manipulators choose different sets of worlds to try to influence, but then the set of manipulators that choose our world has smaller weight.

------ end potential skip

What I think it boils down to is the question:

Is the anthropically updated version of the universal prior most simply described as “the universal prior, but conditioned on the strings being important to decision-makers in important worlds” or “that thing consequentialists sometimes output”? (And consequentialists themselves may be more simply described as “those things that often emerge”). “Sometimes” is of course doing a lot of work, and it will take bits to specify which “sometimes” we are talking about. If the latter is more simple, then we might expect the natural continuation of those sequences to usually contain treacherous turns, and if the former is more simple, then we wouldn’t. This is why I don’t think the weight of an earth-camera in the universal prior ever comes into it.

But/so I don’t understand if I’m missing the point of a couple paragraphs of your comment—the one which starts “They are just looking at the earth-like Turing machine”, and the next paragraph, which I agree with.

Here's my current understanding of your position:

- The easiest way to specify an important prediction problem (in the sense of a prediction that would be valuable for someone to influence) is likely to be by saying "Run the following Turing machine, then pick an important decision from within it." Let's say the complexity of that specification is N bits.
- You think that if consequentialists dedicate some fraction of their resources to doing something that's easy for the universal prior to output, it will still likely take more than N bits or not much less.
- [Probably] You think the differences may be small enough that they can be influenced by factors of 1/1000 or 1/billion (i.e. 10-30 bits) of improbability of consequentialists spending significant resources in this task.
- [Probably] You think the TM-definition update (where the manipulators get to focus on inductors who put high probability on their own universe) or the philosophical sophistication update (where manipulators use the "right" prior over possible worlds rather than choosing some programming language) are small relative to these other considerations.

I think the biggest disagreement is about 1+2. It feels implausible to me that "sample a data stream that is being used by someone to make predictions that would be valuable to manipulate" is simpler than any of the other extraction procedures that consequentialists could manipulate (like sample the sequence that appears the most times, sample the highest energy experiments, sample the weirdest thing on some other axis...)

But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit.

I think we're probably on the same page now, but I'd say: the consequentialists can also sample from the "important predictions" prior (i.e. the same thing as that fragment of the universal prior). If "sample output channel controlled by consequentialists" has higher probability than "Sample an important prediction," then the consequentialists control every important prediction. If on the other hand "Sample an important prediction" has higher probability than the consequentialists, I guess maybe they could take over a few predictions, but unless they were *super* close it would be a tiny fraction and I agree we wouldn't care.

Yeah, seems about right.

I think with 4, I've been assuming for the sake of argument that manipulators get free access to the right prior, and I don't have a strong stance on the question, but it's not complicated for a directly programmed anthropic update to be built on that right prior too.

I guess I can give some estimates for how many bits I think are required for each of the rows in the table. I'll give a point estimate, and a range for a 50% confidence interval for what my point estimate would be if I thought about it for an hour by myself and had to write up my thinking along the way.

I don't have a good sense for how many bits it takes to get past things that are just extremely basic, like an empty string, or an infinite string of 0s. But whatever that number is, add it to both 1 and 6.

1) Consequentialists emerge, 10 - 50 bits; point estimate 18

2) TM output has not yet begun, 10 - 30 bits; point estimate 18

3) make good guesses about controllable output, 18 - 150 bits; point estimate 40

4) decide to output anthropically updated prior, 8 - 35 bits; point estimate 15

5) decide to do a treacherous turn. 1 - 12 bits; point estimate 5

vs. 6) direct program for anthropic update. 18-100 bits; point estimate 30

The ranges are fairly correlated.

By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?

I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?

If you want to compare most easily to models like that, then instead of using (1)+(2)+(3) you should compare to (6') = "Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior."

Then the comparison is between specifying "important predictor to influence" and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it's kind of hard for me to see any version of (6) that doesn't have an obviously simpler analog that could be engineered by a sophisticated civilization.

With respect to (4)+(5), I guess you are saying that your point estimate is that only 1/million of consequentialists decide to try to influence the universal prior. I find that surprisingly low but not totally indefensible, and it depends on exactly how expensive this kind of influence is. I also don't really see why you are splitting them apart, shouldn't we just combine them into "wants to influence predictors"? If you're doing that presumably you'd both use the anthropic prior and then the treacherous turn.

But it's also worth noting that (6') gets to largely skip (4') if it can search for some feature that is mostly brought about deliberately by consequentialists (who are trying to create a beacon recognizable by *some* program that scans across possible worlds looking for it, doing the same thing that "predictor that influences the future" is doing in (6)).

I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?

Yes, and then outputs strings from that set with probability proportional to their weight in the universal prior.

By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?

I would say "successfully controlled" instead of controllable, although that may be what you meant by the term. (I decomposed this as controllable + making good guesses.) For some definitions of controllable, I might have given a point estimate of maybe 1 or 5 bits. But there has to be an output channel for which the way you transmit a bitstring out is the way the evolved consequentialists expect. But recasting it in these terms, implicitly makes the suggestion that the specification of the output channel can take on some of the character of (6'), makes me want to put my range down to 15-60; point estimate 25.

instead of using (1)+(2)+(3) you should compare to (6') = "Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior."

Similarly, I would replace "can be" with "seems to have been". And just to make sure we're talking about the same thing, it takes this list of patterns, and outputs them with probability proportional to their weight in the universal prior.

Yeah, this seems like it would make some significant savings compared to (1)+(2)+(3). I think replacing parts of the story from being specified as [arising from natural world dynamics] to being specified as [picked out "deliberately" by a program] generally leads to savings.

Then the comparison is between specifying "important predictor to influence" and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it's kind of hard for me to see any version of (6) that doesn't have an obviously simpler analog that could be engineered by a sophisticated civilization.

I don't quite understand the sense in which [worlds with consequentialist beacons/geoglyphs] can be described as [easiest-to-specify controllable pattern]. (And if you accept the change of "can be" to "seems to have been", it propagates here). Scanning for important predictors to influence does feel very similar to me to scanning for consequentialist beacons, especially since the important worlds are plausibly the ones with consequentialists.

There's a bit more work to be done in (6') besides just scanning for consequentialist beacons. If the output channel is selected "conveniently" for the consequentialists, since the program is looking for the beacons, instead of the consequentialists giving it their best guess(es) and putting up a bunch of beacons, there has to be some part of the program which aggregates the information of multiple beacons (by searching for coherence, e.g.), or else determines which beacon takes precedence, and then also determines how to interpret their physical signature as a bitstring.

Tangent: in heading down a path trying to compare [scan for "important to influence"] vs. [scan for "consequentialist attempted output messages"] just now, my first attempt had an error, so I'll point it out. It's not necessarily harder to specify "scan for X" than "scan for Y" when X is a subset of Y. For instance "scan for primes" is probably simpler than "scan for numbers with less than 6 factors".

Maybe clarifying or recasting the language around "easiest-to-specify controllable pattern" will clear this up, but can you explain more why it feels to you that [scan for "consequentialists' attempted output messages"] is so much simpler than [scan for "important-to-influence data streams"]? My very preliminary first take is that they are within 8-15 bits.

I also don't really see why you are splitting them [(4) + (5)] apart, shouldn't we just combine them into "wants to influence predictors"? If you're doing that presumably you'd both use the anthropic prior and then the treacherous turn.

I split them in part in case there is there is a contingent of consequentialists who believes that outputting the right bitstring is key to their continued existence, believing that they stop being simulated if they output the wrong bit. I haven't responded to your claim that this would be faulty metapyhsics on their part; it still seems fairly tangential to our main discussion. But you can interpret my 5 bit point estimate for (5) as claiming that 31 times out of 32 that a civilization of consequentialists tries to influence their world's output, it is in an attempt to survive. Tell me if you're interested in a longer justification that responds to your original "line by line comments" comment.

Just look at the prior--for any set of instructions for the work tape heads of the Turing machine, flipping the "write-1" instructions of the output tape with the "write-0" instructions gives an equally probably Turing machine.

I basically agree that if the civilization has a really good grasp of the situation, and in particular has no subjective uncertainty (merely uncertainty over which particular TM they are), then they can do even better by just focusing their effort on the single best set of channels rather than randomizing.

(Randomization is still relevant for reducing the cost to them though.)

With randomization, you reduce the cost and the upside in concert. If a pair of shoes costs $100, and that's more than I'm willing to pay, I could buy the shoes with probability 1%, and it will only cost me $1 in expectation, but I will only get the shoes with probability 1/100.

I agree that randomization reduces the "upside" in the sense of "reducing our weight in the universal prior." But utility is not linear in that weight.

I'm saying that the consequentialists completely dominate the universal prior, and they will still completely dominate if you reduce their weight by 2x. So either way they get all the influence. (Quantitatively, suppose the consequentialists currently have probability 1000 times greater than the intended model. Then they have 99.9% of the posterior. If they decreased their probability of acting by two, then they'd have 500 times the probability of the intended model, and so have 99.8% of the posterior. This is almost as good as 99.9%.)

That could fail if e.g. if there are a bunch of other consequentialists also trying to control the sequence. Or if some other model beyond the intended one has much higher probability. But if you think that the consequentialists are X bits simpler than the intended model, and you are trying to argue that the intended model dominates the posterior, then you need to argue that the consequentialists wouldn't try to grab the universal prior even when doing so only requires acting in of worlds.

If I flip a coin to randomize between two policies, I don't see how that mixed policy could produce more value for me than the base policies.

(ETA: the logical implications about the fact of my randomization don't have any weird anti-adversarial effects here).

Someone in the basement universe is reasoning about the output of a randomized Turing machine that I'm running on.

I care about what they believe about that Turing machine. Namely, I want them to believe that most of the time when the sequence x appears, it is followed by a 1.

Their beliefs depend in a linear way on my probabilities of action.

(At least if e.g. I committed to that policy at an early enough time for them to reason about it, or if my policy is sufficiently predictable to be correlated with their predictions, or if they are able to actually simulate me in a universe with reflective oracles... If I'm not able to influence their beliefs about me, then of course I can't influence their beliefs about anything and the whole manipulative project doesn't get off the ground.)

But my utility is a non-linear function of their beliefs, since P(1|x) is a non-linear function of their beliefs.

So my utility is a non-linear function of my policy.

To clarify, sufficient observations would still falsify all "simulate simple physics, start reading from simple location" programs and eventually promote "simulate true physics, start reading from camera location"?

In this story, I'm imagining that hypotheses like "simulate simple physics, start reading from simple location" lose, but similar hypotheses like "simulate simple physics, start reading from simple location after a long delay" (or after seeing pattern X, or whatever) could be among the output channels that we consider manipulating. Those would also eventually get falsified (if we wanted to deliberately make bad predictions in order to influence the basement world where someone is thinking about the universal prior) but not until a critical prediction that we wanted to influence.

I hope that most of your comments are cleared up by the story. But some line by line comments in case they help:

affecting the world in which the Turing machine is being run

I'm talking about what the actual real universal prior looks like rather than some approximation, and no one is actually running all of the relevant Turing machines. I'm imagining this whole exercise being relevant in the context of systems that perform abstract reasoning about features of the universal prior (e.g. to make decisions on the basis of their best guesses about the posterior).

So in particular, the civilization in question isn't being simulated, its being imagined or reasoned about.

This raises decision-theoretic questions that neither you nor I get into. Especially: is the correlation between their action and the outcome of the imagining good enough to actually give them reason to influence the behavior of the universal prior? If only a small fraction of agents reason that way then that will eat into the probability, if none do then the whole argument breaks down.

You can have the intermediate situation where e.g. I perform simulations of early behavior of civilizations, or of individuals or algorithms, to check whether civilizations actually make commitments to behave this way (as part of my reasoning about the universal prior). The results will inform my reasoning about the universal prior, and then this gives a normal CDT reason for people to make commitments like this. In some sense I think this is the most robust version of the argument.

I don't think this affects the particular considerations you raise too much, and if we imagined a universe containing reflective oracles then we could talk literally about simulations. But for many people I expect that kind of decision-theoretic question would be the main point of disagreement.

From within the work tapes, there is no visibility of the output tape, and even if one one work tape happens to mimic the output tape perfectly, there is no evidence of this to an inhabitant of the work tapes, because the content of the output tape has no observable effect on the work tapes; it is, by definition "write-only".

Yes, but they are able to perform science to understand a lot about the Turing machine on which they run (as well as history to understand what the distribution of plausible universal priors is) from which they can figure out the distribution of possible ways that outputs could work.

We don't need to inform the work tape's inhabitants about the output instructions. In fact, it would hardly be possible, because nothing on the work tapes can provide evidence about the content of the output tape.

My point was that the inhabitants can make (stochastic) guesses about how the output tape works, and then we need to use some of the bits to pick the worlds where they made a correct guess. The inhabitants will be trying to make guesses as effectively as possible (basically decoding their randomness as a message about how the output tape works), so the number of bits we need to send is much (much!) less than the number of bits needed to pin down the particular output rule that corresponds to the bits observed by a camera.

(Though as I mention in the story, they will also be able to try to influence large numbers of possible output channels at a time.)

I think this was a confusing way for me to write about the situation.

We could (laboriously) program the world to have inhabitants that

believethat certain things are being written to the output tape, but our only method for signaling anything to the inhabitants of the work tapes is through the Turing machine instructions.

Obviously anything like this is impractical. I hope the story communicated what I imagine happening.

With the prior I described, for every Turing machine with an instruction to write "1" to the output tape (for a given computation state and for given tape head readings) there is another equally likely one that writes "0" instead.

I agree that they can't know exactly how the output tape works. If we imagine them guessing how the output tape works, this is one consideration that cuts their guessing probability by half. But it also cuts the probability of the "intended" model by half, so it has no bearing on the relative probability of their model vs the intended model. (And cutting it by half is basically nothing given how tiny all the probabilities are.)

In particular, controlling a few simple regions of the work tape is no more likely to have an effect on the output tape than anything else

I agree that e.g. "output whatever is at the start of the work tape" is just one thing that the Turing machine could do (and that that particular output is probably uncontrollable). But "no more likely" seems obviously wrong, simpler regions of the work tape (like the start) are output by more of the possible TMs.

(The OP shouldn't have emphasized this as much as it does compared to all the other kinds of output channels, or to emphasize simplicity vs number. But my current view is that the basic point is totally sound.)

But the main point I want to make is the work-tape inhabitants know

so littleabout their effect on the universal prior, they just have no way to execute deliberate control over worlds that simulate them with any more granularity than "I want would-be watchers to believe that my world goes likethis, as do the pieces of their world that resemble it".

I don't think this is right. They are at least as able to pick output channels as the camera-maker was (and indeed the fact that they are trying puts them at a massive advantage). The only real uncertainty is whether being late in history (and living in a larger world) more than offsets that. And I think the most likely answer is that if the fraction of their universe dedicated to prior-manipulation is larger than the fraction of the Solomonoff Inductor's universe dedicated to cameras, and if the temporal extent of prior-manipulation is much larger as a fraction of their history, then they have an overwhelming advantage. That's a tentative conclusion that could be overturned by finding some new considerations, but right now I think basically all the considerations point in that direction.

So in total, what they know about their world's output is that that outputs resembles some stream of data produced by a world that contains computers.

This seems like an understatement (especially the "a world that contains computers"). They are assuming that the containing world involves someone thinking about the output of Solomonoff induction using a universal prior according to which their physics has relatively high probability.

So basically, don't rock the boat.

I'm describing the situation where the output channel hasn't yet started emitting, just as the intended model of the camera doesn't start emitting until the camera is constructed. Worlds where the output channel just happened to match some predicted sequence in the world seem much less likely.

We can get back to some of these points as needed, but I think our main thread is with your other comment, and I'll resist the urge to start a long tangent about the metaphysics of being "simulated" vs. "imagined".

These are my thoughts on this post of Paul Christiano. I claim "malign" models do not form the bulk of the Solomonoff prior.

I agree that the natural description of the sequence is reasonably high, so we can't rule out room for improvement immediately.

Let me give a formal picture of this. We have a Turing machine with a unidirectional read-only input tape, a unidirectional write-only output tape, and multiple bidirectional work tapes. Unidirectional just means the tape head can only move one direction. The particular Turing machine is just the instructions for a) what to write on each of the tapes at the locations where the tape heads currently are, b) which direction the tape heads should move, and c) what computation state for the machine to enter. If these instructions are universal, we can interpret the input tape as taking a program, and the output tape is for the output sequence. The universal prior can be thought of as the probability over the output strings given Bernoulli(1/2) bits on the input tape. I think it's easier to think about the following similar formulation of the universal prior.

For each Turing machine--that is, for each possible set of instructions about how the tape heads move and write given what they see--it will produce outputs stochastically given Bernoulli(1/2) sampled bits on the input tape. That single Turing machine will thus define a probability distribution over infinite binary strings. The universal prior can also be thought of as a mixture over these probability distributions of all possible Turing machines, weighted by a prior over Turing machines, where the prior depends on some simple property like the number of computations states it uses. The nice thing about this framing in which we separate out the models in the mixture explicitly is that we'll spend plenty of time looking at individual ones. For concreteness, let's assume our prior over Turing machines assigns 6/π2n2 prior weight to the set of Turing machines with n computation states, and it assigns uniform weight to each n-state Turing machine.

So, in this formalism, there will be simple Turing machines such that consequentialists are likely to live within the work tapes. ("Likely" is with respect to noise on the input tape). The output tape will record some fraction of what goes on on the work tapes; maybe it will record everything. Maybe the contents of the output tape will be recorded on a work tape too; that is, the output tape happens to be an exact copy of a work tape. From within the work tapes, there is no visibility of the output tape, and even if one one work tape happens to mimic the output tape perfectly, there is no evidence of this to an inhabitant of the work tapes, because the content of the output tape has no observable effect on the work tapes; it is, by definition "write-only".

So my first real disagreement is with the next paragraph.

We don't need to inform the work tape's inhabitants about the output instructions. In fact, it would hardly be possible, because nothing on the work tapes can provide evidence about the content of the output tape. We could (laboriously) program the world to have inhabitants that

believethat certain things are being written to the output tape, but our only method for signaling anything to the inhabitants of the work tapes is through the Turing machine instructions. That's a simple consequence of the type signatures of the objects we're discussing.The only observation which work tape inhabitants can use to infer something about the output tape is that they exist. In the circumstance where there is no output tape, but they exist anyway (that is my mainline opinion for our universe, by the way; there is no output tape, and we exist for some other reason than being of use to some civilization that is simulating us), the observation of existence has no implications for the output tape. If, on the other hand, they assume they exist because someone is running lots of Turing machines with random noise as input in order to make predictions about their world, then the observation of their existence will be evidence that the output tape of their world has probably been corresponding with some stream of data in the "real world"--the world of the people running all these Turing machines. If it didn't correspond, simulating them would no longer be useful. That single fact is the sum total of what work tape inhabitants will know about the nature of the output tape. (This is

ananthropic update, but it's different from what Paul later callstheanthropic update).Regarding "10 bits of description complexity", yes that is small relative to the total description length of a world like ours, but it would be premature to cache the belief that it is negligibly small, because we'll later be looking at the difference in description complexity between different models. Otherwise, no objections here.

So we can expect work tape inhabitants trying to control the universal prior to act on the belief that they live on the work tape of a Turing machine.

With the prior I described, for every Turing machine with an instruction to write "1" to the output tape (for a given computation state and for given tape head readings) there is another equally likely one that writes "0" instead. There is no "place" in the universe computed on the work tapes corresponding to "good output material", because the instructions for the output tape head are completely independent of the instructions for the work tape heads, and even if there were "good places", there would no information about how the states of those locations correspond to Turing machine outputs. In particular, controlling a few simple regions of the work tape is no more likely to have an effect on the output tape than anything else, and the effect cannot be controlled effectively if the precise effect is unknowable.

But how about the anthropic information about the output tape--the fact that the output of their world has probably corresponded to some data stream in our world? Work tape inhabitants might act on the belief that their Turing machine has large posterior weight, since that is the circumstance in which their attempted manipulation of their output is mostly likely to have an effect. If their Turing machine has high posterior weight, then probably they're living in one of the simplest Turing machines that outputs bits the way they do. If they notice a "simple" region in their universe, that's the last thing that is likely to be the source of bits on the output tape! Turing machines much simpler than theirs would be able to output the same bit string. So in general, work tape inhabitants should expect the output tape to read out the complexities of their universe, and there is no reason to think that controlling the complexities of their universe would be cheap. You can't just quietly go to some special-looking site, and leave the rest of your civilization humming along. So the epistemic state of work tape inhabitants interested in controlling the universal prior is: "I don't know to affect our world's output. It probably has something to do with shaping the most complex and interesting features of the world in some big way (i.e. costly changes to valuable things). I can prove that I have no way of predicting which big changes to the world will have what effects on the output."

But maybe they don't need to know whether their behavior will yield a 0 or a 1 as the next output bit (and by symmetry, they have no clue). They can infer from [the fact that their simulation is still running] that [the output of their world resembles the output of some piece of our world] (that is, the world of the people running the Turing machines). So suppose they want us to believe something about how that piece of our world will evolve in the future. They have inferred that (some piece of) their world's evolution resembles the evolution of that piece of our world. So equivalently, they want us to believe that some piece of their world evolves in a certain way. They don't know which piece; maybe it's the whole thing, but they want people to believe certain things about the likely evolution of (parts of) their world. This is basically civilizational-level self-consciousness, with a fairly unusual origin.

If I were dedicating my life to controlling the universal prior, under the assumption that I only exist because our world's output is matching some data stream of a world that is simulating us, what could I possibly do? I think my best bet (not that I think it's a very good bet) is to make the world evolve the way I want people to think it does. I guess I want people to think that virtue pays, and vice doesn't, and nuclear waste is easy to store cheaply, so I'd live my life trying to prove these things in case anyone is watching. Even so, this is a big long shot! And if I were trying to have as much influence on the universal prior as possible with minimal changes to the world around me, I would not know where to begin; I think they are in direct opposition. (This, by the way, regards the project of actually having influence in the world in which we are being simulated, assuming our simulated world already has large posterior weight; I haven't begun to describe how I might go about ensuring that our world continues to be simulated and has increasing posterior weight in the mixture).

So in summary, if 1/1000 eccentric souls care about their world's contribution to the universal prior, they can't just go to special corners of the universe and do they're own thing; they will care about the same sorts of things the other 999/1000 do--the evolution of the most complex features of their society. The other 999 probably won't give them free rein shaping these societal features.

But the main point I want to make is the work-tape inhabitants know

so littleabout their effect on the universal prior, they just have no way to execute deliberate control over worlds that simulate them with any more granularity than "I want would-be watchers to believe that my world goes likethis, as do the pieces of their world that resemble it".In this section, Paul talks about why consequentialists would want to have an impact on the world simulating their world (assuming they figured out how to). No disagreements here.

This is saying work-tape inhabitants interested in affecting the universal prior would benefit from an initial phase of controlling their world's output with the goal of helping their world/Turing machine gain posterior weight.

So I've mostly been talking about the second task so far: how a work-tape inhabitant would affect the universal prior

oncetheir world had significant posterior weight. I haven't begun to describe the difficulties of the first step, where you try to make it as likely as possible the the output of your world resembles the output of some data stream in the world that is running the Turing machines.Recall the one exception to work tape inhabitants' total ignorance about the output tape: it resembles some data stream of a world that is simulating them. And a notable consequence I haven't yet mentioned is that the world that is simulating them is rich enough for computers to exist within. So in total, what they know about their world's output is that that outputs resembles some stream of data produced by a world that contains computers. And if you want your simulation to continue (and by the way, I think the other 999/1000 inhabitants could be sold on this project, if they can be convinced they're in a simulation) so that your posterior weight in the universal prior grows, then you want your world's output to continue to have this property of resembling this data stream in the world simulating you.

So basically, don't rock the boat. Keep calm and carry on. Just do exactly what the laws of physics in your universe are telling you to. (I, for one, am very good at this!) You have no idea about the details of what your world's output channel is doing, but the one thing you can guess about it is that it's doing exactly what you want it to be doing: it's resembling a stream of data produced by a world that contains computers. (I mentioned earlier another thing that you "know" if you're assuming the sake of argument that your world will eventually have large posterior weight: you can conclude your world's output channel is capturing much of your world's complexity. But assuming your world will eventually have large posterior weight is not helpful in the project of increasing the likelihood that that comes to pass).

There's another point I want to make on this topic. We are supposing that the work tape inhabitants condition on the fact that they are being simulated because their Turing machine has been outputting useful things, and they will continue being simulated right up until the point that their world's output errs--this is the condition for there being value in them tampering with their world's output. Among believers of this proposition, the faction that says "make sure the simulation of our world continues" will almost certainly have much broader support than any faction that says "let's try to end the simulation our world at exactly the right moment in exactly the right way on the off chance that it precipitates some valuable changes in the world that is simulating us". Attempting a treacherous turn (which again, they have no way of knowing how to execute with precision) would be suicide.

The (log) mass the of consequentialists

that have successfully controlled their world's entire output, and have (deliberately) made it resemble a sample fromq is only going to increase as we condition on more data.This may be true, but the mass of consequentialists with the qualification that I added in italics does not start very high at all. I can't think of how such a consequentialist civilization would arise from simple physics since there is no way for them to know from within what their world's output is. I can imagine how to make such consequentialists if we were trying: we deliberately encode a world of consequentialists, and

rules of the world themselveshave the machinery to imbue beliefs in the minds of these inhabitants (despite those beliefs not being evidenced by the inhabitants' observations, and despite these inhabitants being "emergent" rather than atomic), and then the world imbues (correct) beliefs about how to precisely affect the output tape.A possible response here is that it does not take many extra computation states for the Turing machine instructions to "tag" some "location" in the work tape world as special, and make that location correspond to what gets output. Sophisticated consequentialists will know that that location is no more likely to influence to the output tape than any other spot, and they will know that they have know way of knowing the correspondence between that location's state and the output tape, but they may have natural superstitions. So a Turing machine may not literally require the machinery to edit the brains of the inhabitants directly. But for all that, this is extremely far from a convergent instrumental activity, and notably irrational. Also, Paul admits that any "tagged location" might have spiritual significance, and thus be the object of other civilizational preferences. Finally, not only does the Turing machine have to design the output to depend on this location; it has to depend on it

in the way thatthe emergent consequentialists' superstitions suggest.This is a different anthropic update than the ones I've talked about above. A work-tape inhabitant of a Turing machine who assumes their Turing machine is being simulated in another world can guess from their very existence that the (stochastic) procedure by which the simulators pick a Turing machine to simulate (i.e. the simulators' prior over Turing machines) is the sort of procedure that favors their home Turing machine, and

thenthey can focus on the worlds of simulators who use such a procedure to make an important decision. So they're left with a distribution over Turing machines that starts as being the simulators' prior, but then is conditioned on resembling the world of adecision-maker-simulator. Paul argues that then the work tape inhabitants could arrange for their world's output to resemble themixtureof the outputs of other Turing machines that are in that distribution. Producing output from that distribution is the same as first sampling a Turing machine from that distribution, and then producing output that mimics that Turing machine.Even setting aside my arguments above that work tape inhabitants would have no way of knowing how to arrange for such a thing, this isn't a better strategy than arranging for their world's output to resemble the output of

their own Turing machine(which it already does, and immutably will, and it's hard to be sure I'm still making words mean things). If their home Turing machine is still being run, then whatever (unknowable) output it has produced has succeeded in resembling the data it is being checked against, and there's no reason to expect that the data it's being checked against will suddenly change course, and therefore no reason to expect that they would be better off making their world output bits that resemble a different Turing machine that scores highly on similar priors that they do.If the work-tape inhabitants believe that their Turing machine has yet to output anything, then the move Paul is suggesting could make sense. So they could decide to try to make their world behave like another Turing machine for a while. But why would they believe that their Turing machine has yet to output anything? I would think that most simple Turing machines that produce consequentialists start writing output to the output tape well before the consequentialists emerge, otherwise it would need some cumbersome description of how to recognize that the consequentialists have evolved, so that it can switch to outputting bits. But if there's some reason I'm missing that would cause consequentialists to believe their world has yet to output anything, recall my earlier contention that the inhabitants have no way of knowing how to make their world's output resemble that of other similar Turing machines.

But suppose for the sake of argument that these consequentialists do choose to mimic the output of a different Turing machine, and they succeed at it. Paul claims (I'm pretty sure) that if we sample a Turing machine from the prior I proposed, and it outputs a string that

looks likeit was produced by a simple Turing machine T that models our world, then most of the time, it's not that we have sampled T; it's that we sampled a different Turing machine, and its inhabitants randomly picked Turing machine T to mimic, and this was made more plausible because they conditioned on the true fact that we are in a world in which we will make use of this output. This brings us toI've argued it would take many more bits to specify the part of prior corresponding to consequentialists that 1) also want to influence the decisions of simulators of their world, 2) have unfounded but nonetheless correct superstitions about how to influence the output of their world, 3) have that priority as a civilization for how to use an evidently special location, 4) are certain enough that their world has never output any bits before that the best course would be intervene in the dynamics of this special terminal that they feel is a channel to an output tape, and 5) are willing to terminate the execution of their world for a single opportunity to give their simulators bad info. But we can put that aside for now.

The anthropic update can be made programmatically; it doesn't require the evolution of computational life. It was simple enough for Paul to describe; we can specify the Turing machine which samples from Turing machines according to same the prior we're using to sample some Turing machines, but conditioned on those Turing machines resembling the worlds of decision-maker-simulators. This Turing machine samples from other Turing machines using the anthropic update, but without the possibility of a treacherous turn later.

Paul responds

Basically, there is overhead (cumbersome additional complexity) when you use one Turing machine to simulate another. I think our main disagreement on this point is that I think the consequentialists face this overhead just as much. Self-organizing consequentialism seems quite analogous to me to self-organizing computation: there are many ways it could happen, but the kludginess of the language of specification leads to some inefficiencies in the description. One potentially relevant observation is that advanced consequentialists only live in universes that have the machinery to execute universal computation (or at least enormous finite state automata). So the sorts of worlds that self-organize into simple computations seem to me to be at least as prevalent as the sorts of worlds that self-organize into simple life. And the anthropically updated version of the prior over Turing machines seems to me to qualify as a simple computation (in the sense of description-simplicity).

I know Paul has thought about this last point a lot, and we've discussed it a bit, so I understand that intuitions can reasonably vary on this last point.

## Conclusion

I think I've identified multiple independent reasons to expect that consequentialists living in Turing machines will not deliberately and successfully affect their Turing machine's output for the purpose of affecting the world in which the Turing machine is being run, excepting of course the consequentialists that are deliberately engineered to behave this way by the Turing machine's instructions. But those deliberate instructions cannot just pick out one convergent instrumental activity of consequentialists and direct them to act like that; the instructions have to override the fact that this behavior is irrational for multiple separate reasons, or alternatively, the instructions have to encode the "desired" behavior from scratch.