# 56

This argument came to my attention from this post by Paul Christiano. I also found this clarification helpful. I found these counter-arguments stimulating and have included some discussion of them.

Very little of this content is original. My contributions consist of fleshing out arguments and constructing examples.

Thank you to Beth Barnes and Thomas Kwa for helpful discussion and comments.

# What is the Solomonoff prior?

The Solomonoff prior is intended to answer the question "what is the probability of X?" for any X, where X is a finite string over some finite alphabet. The Solomonoff prior is defined by taking the set of all Turing machines (TMs) which output strings when run with no input and weighting them proportional to , where  is the description length of the TM (informally its size in bits).

The Solomonoff prior says the probability of a string is the sum over all the weights of all TMs that print that string.

One reason to care about the Solomonoff prior is that we can use it to do a form of idealized induction. If you have seen 0101 and want to predict the next bit, you can use the Solomonoff prior to get the probability of 01010 and 01011. Normalizing gives you the chances of seeing 1 versus 0, conditioned on seeing 0101. In general, any process that assigns probabilities to all strings in a consistent way can be used to do induction in this way.

# Why is it malign?

Imagine that you wrote a programming language called python^10 that works as follows: First, it takes all alpha-numeric chars that are not in literals and checks if they're repeated 10 times sequentially. If they're not, they get deleted. If they are, they get replaced by a single copy. Second, it runs this new program through a python interpreter.

Hello world in python^10:

ppppppppprrrrrrrrrriiiiiiiiiinnnnnnnnnntttttttttt('Hello, world!')

Luckily, python has an exec function that executes literals as code. This lets us write a shorter hello world:

eeeeeeeeexxxxxxxxxxeeeeeeeeeecccccccccc("print('Hello, world!')")

It's probably easy to see that for nearly every program, the shortest way to write it in python^10 is to write it in python and run it with exec. If we didn't have exec, for sufficiently complicated programs, the shortest way to write them would be to specify an interpreter for a different language in python^10 and write it in that language instead.

As this example shows, the answer to "what's the shortest program that does X?" might involve using some roundabout method (in this case we used exec). If python^10 has some security properties that python didn't have, then the shortest program in python^10 that accomplished any given task would not have these security properties because they would all pass through exec. In general, if you can access alternative ‘modes’ (in this case python), the shortest programs that output any given string might go through one of those modes, possibly introducing malign behavior.

Let's say that I'm trying to predict what a human types next using the Solomonoff prior. Many programs predict the human:

1. Simulate the human and their local surroundings. Run the simulation forward and check what gets typed.
2. Simulate the entire Earth. Run the simulation forward and check what that particular human types.
3. Simulate the entire universe from the beginning of time. Run the simulation forward and check what that particular human types.
4. Simulate an entirely different universe that has reason to simulate this universe. Output what the human types in the simulation of our universe.

Which one is the simplest? One property of the Solmonoff prior is that it doesn't care about how long the TMs take to run, only how large they are. This results in an unintuitive notion of "simplicity"; a program that does something  times might be simpler than a program that does the same thing  times because the number  is easier to specify than .

In our example, it seems likely that "simulate the entire universe" is simpler than "simulate Earth" or "simulate part of Earth" because the initial conditions of the universe are simpler than the initial conditions of Earth. There is some additional complexity in picking out the specific human you care about. Since the local simulation is built around that human this will be easier in the local simulation than the universe simulation. However, in aggregate, it seems possible that "simulate the universe, pick out the typing" is the shortest program that predicts what your human will do next. Even so, "pick out the typing" is likely to be a very complicated procedure, making your total complexity quite high.

Whether simulating a different universe that simulates our universe is simpler depends a lot on the properties of that other universe. If that other universe is simpler than our universe, then we might run into an exec situation, where it's simpler to run that other universe and specify the human in their simulation of our universe.

This is troubling because that other universe might contain beings with different values than our own. If it's true that simulating that universe is the simplest way to predict our human, then some non-trivial fraction of our prediction might be controlled by a simulation in another universe. If these beings want us to act in certain ways, they have an incentive to alter their simulation to change our predictions.

At its core, this is the main argument why the Solomonoff prior is malign: a lot of the programs will contain agents with preferences, these agents will seek to influence the Solomonoff prior, and they will be able to do so effectively.

## How many other universes?

The Solomonoff prior is running all possible Turing machines. How many of them are going to simulate universes? The answer is probably "quite a lot".

It seems like specifying a lawful universe can be done with very few bits. Conway's Game of Life is very simple and can lead to very rich outcomes. Additionally, it seems quite likely that agents with preferences (consequentialists) will appear somewhere inside this universe. One reason to think this is that evolution is a relatively simple mathematical regularity that seems likely to appear in many universes.

If the universe has a hospitable structure, due to instrumental convergence these agents with preferences will expand their influence. As the universe runs for longer and longer, the agents will gradually control more and more.

In addition to specifying how to simulate the universe, the TM must specify an output channel. In the case of Game of Life, this might be a particular cell sampled at a particular frequency. Other examples include whether or not a particular pattern is present in a particular region, or the parity of the total number of cells.

In summary, specifying lawful universes that give rise to consequentialists requires a very simple program. Therefore, the predictions generated by the Solomonoff prior will have some influential components comprised of simulated consequentialists.

## How would they influence the Solomonoff prior?

Consequentialists that find themselves in universes can reason about the fundamental laws that govern their universe. If they find that their universe has relatively simple physics, they will know that their behavior contributes to the Solomonoff prior. To gain access to more resources in other universes, these consequentialists might seek to act in ways that influence the Solomonoff prior.

A contrived example of a decision other beings would want to manipulate is "what program should be written and executed next?" Beings in other universes would have an incentive to get us to write programs that were aligned with their values. A particularly interesting scenario is one in which they write themselves into existence, allowing them to effectively "break into" our universe.

For example, somewhere in the Solomonoff prior there is a program that goes something like: "Simulate this universe. Starting from the year 2100, every hour output '1' if there's a cubic meter of iron on the Moon, else output '0'." By controlling the presence/absence of a cubic meter of iron on the Moon, we would be able to influence the output of this particular facet of the Solomonoff prior.

This example is a very complicated program and thus will not have much weight in the Solomonoff prior. However, by reasoning over the complexity of possible output channels for their universe, consequentialists would be able to identify output channels that weigh heavily in the Solomonoff prior. For example, if I was in the Game of Life, I might reason that sampling cells that were living in the initial conditions of the universe is simpler than sampling other cells. Additionally, sampling cells and reporting their outputs directly is simpler than sampling cells and reversing their values. Therefore, I might choose to control regions close to the initial live cells of the universe for purposes of exerting influence over the Solomonoff prior.

Additionally, consequentialists can send across many plausible channels at once, focus efforts on channels that are easy to control, send through channels that would not unduly decrease their values for other reasons, etc. Since the proportional weight in the Solomonoff prior drops off exponentially with respect to complexity, it’s possible only very small regions of space would need to be used to achieve a large fraction of the maximum possible influence.

## How much influence will they have?

A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.

In practice, this is irrelevant because consequentialists do not want to achieve equal predictive accuracy in all worlds; they only care about worlds in which being predictive results in increased influence. If I'm exerting control over the Solomonoff prior, I only care about influencing it in ways that might affect material consequences in other universes. For example, I do not care about gaining influence in universes inhospitable to life. Thus I will be able to trade off predictive power in universes I don't care about for predictive power in universes I do care about. This means that I should restrict my attention to all universes that have resources and the Solomonoff prior is being used to make important decisions, weighting appropriately.

Another interesting thing about the Solomonoff prior is it is actually a collection of priors. The "length" of a TM is defined relative to some universal TM. For any particular TM, it's length with respect to different universal TMs will vary; thus, there are versions of the Solomonoff prior that give more or less weight to any given TM. (Note that you can simulate universal TMs with universal TMs, so the difference across all programs will be bounded by a constant factor. However, this constant factor can be large, so the difference in relative weight between different Solomonoff priors can also be large).

In particular, this suggests a good strategy for consequentialists: find a universe that is using a version of the Solomonoff prior that has a very short description of the particular universe the consequentialists find themselves in.

The combined strategy is thus to take a distribution over all decisions informed by the Solomonoff prior, weight them by how much influence can be gained and the version of the prior being used, and read off a sequence of bits that will cause some of these decisions to result in a preferred outcome.

The question of how much influence any given universe of consequentialists will have is difficult to answer. One way of quantifying this is to think about how many “universes they don't care about” they're trading off for “universes they do care about” (really we should be thinking in terms of sequences, but I find reasoning about universes to be easier).

Since the consequentialists care about exerting maximum influence, we can approximate them as not caring about universes that don't use a version of the Solomonoff prior that gives them a large weight. This can be operationalized as only caring about universes that use universal TM in a particular set for their Solomonoff prior. What is the probability that a particular universe uses a universal TM from that set? I am not sure, but 1/million to 1/billion seems reasonable. This suggests a universe of consequentialists will only care about 1/million to 1/billion universes, which means they can devote a million/billion times the predictive power to universes they care about. This is sometimes called the “anthropic update”. (This post contains more discussion about this particular argument.)

Additionally, we might think about which decisions the consequentialists would care about. If a particular decision using the Solomonoff prior is important, consequentialists are going to care more about that decision than other decisions. Conservatively, perhaps 1/1000 decisions are "important" in this sense, giving another 1000x relative weighting.

After you condition on a decision being important and using a particular version of the Solomonoff prior, it thus seems quite likely that a non-trivial fraction of your prior is being controlled by consequentialists.

An intuition pump is that this argument is closer to an existence claim than a for-all claim. The Solomonoff prior is malign if there exists a simple universe of consequentialists that wants to influence our universe. This universe need not be simple in an absolute sense, only simple relative to the other TMs that could equal it in predictive power. Even if most consequentialists are too complicated or not interested, it seems likely that there is at least one universe that is.

## Example

Complexity of Consequentialists

How many bits does it take to specify a universe that can give rise to consequentialists? I do not know, but it seems like Conway’s Game of Life might provide a reasonable lower bound.

Luckily, the code golf community has spent some amount of effort optimizing for program size. How many bytes would you guess it takes to specify Game of Life? Well, it depends on the universal TM. Possible answers include 6, 32, 39, or 96.

Since universes of consequentialists can “cheat” by concentrating their predictive efforts onto universal TMs in which they are particularly simple, we’ll take the minimum. Additionally, my friend who’s into code golf (he wrote the 96-byte solution!) says that the 6-byte answer actually contains closer to 4 bytes of information.

To specify an initial configuration that can give rise to consequentialists we will need to provide more information. The smallest infinite growth pattern in Game of Life has been shown to need 10 cells. Another reference point is that a self-replicator with 12 cells exists in HighLife, a Game of Life variant. I’m not an expert, but I think an initial configuration that gives rise to intelligent life can be specified in an 8x8 bounding box, giving a total of 8 bytes.

Finally, we need to specify a sampling procedure that consequentialists can gain control of. Something like “read <cell> every <large number> time ticks” suffices. By assumption, the cell being sampled takes almost no information to specify. We can also choose whatever large number is easiest to specify (the busy beaver numbers come to mind). In total, I don’t think this will take more than 2 bytes.

Summing up, Game of Life + initial configuration + sampling method takes maybe 16 bytes, so a reasonable range for the complexity of a universe of consequentialists might be 10-1000 bytes. That doesn’t seem like very many, especially relative to the amount of information we’ll be conditioning the Solomonoff prior on if we ever use it to make an important decision.

Complexity of Conditioning

When we’re using the Solomonoff prior to make an important decision, the observations we’ll condition on include information that:

1. We’re using the Solomonoff prior
2. We’re making an important decision
3. We’re using some particular universal TM

How much information will this include? Many programs will not simulate universes. Many universes exist that do not have observers. Among universes with observers, some will not develop the Solomonoff prior. These observers will make many decisions. Very few of these decisions will be important. Even fewer of these decisions are made with the Solomonoff prior. Even fewer will use the particular version of the Solomonoff prior that gets used.

It seems reasonable to say that this is at least a megabyte of raw information, or about a million bytes. (I acknowledge some cart-horse issues here.)

This means that after you condition your Solomonoff prior, you’ll be left with programs that are at least a million bytes. As our Game of Life example shows, it only takes maybe 10-1000 of these bytes to specify a universe that gives rise to consequentialists. You have approximately a million bytes left to specify more properties of the universe that will make it more likely the consequentialists will want to exert influence over the Solomonoff prior for the purpose of influencing this particular decision.

# Why might this argument be wrong?

## Inaccessible Channels

Argument

Most of the universe is outside of humanity's light-cone. This might suggest that most "simple" ways to sample from our universe are currently outside our influence, meaning that the only portions of the Solomonoff prior we can control are going to have an extremely low weight.

In general, it might be the case that for any universe, consequentialists inside that universe are going to have difficulty controlling simple output channels. For example, in Game of Life, a simple way to read information might sample a cell particular cell starting at t=0. However, consequentialists in Game of Life will not appear until a much later time and will be unable to control a large initial chunk of that output channel.

Counter-argument

Paul Christiano points out that the general form of this argument also applies to other TMs that compose of your Solomonoff prior. For example, when predicting what I'll type next, you would "want" to simulate me and predict what I would type starting at some time T. However, this is a pretty complicated way of sampling. The fact that simple sampling procedures are less predictive doesn't asymmetrically penalize consequentialists. The consequentialists universe and sampling method only have to be simple relative to other programs that are equally good at predicting.

One might also note that large numbers can be produced with relatively few bits, so "sample starting at <large number>" is not much more complicated than "sample starting at 0".

## Speedy Channels

Argument

There are many simple ways of sampling from universes very quickly. For example, in Game of Life, one can sample a cell every time-tick. It seems feasible for consequentialists to simulate Earth in the Game of Life, but not feasible to simulate Earth such that they can alter a specific cell every time tick per the simulation.

Counter-argument

Consequentialists in the Game of Life could simply simulate Earth, compute the predictions, then later broadcast them along very fast sampling channels. However, it might be the case that building a machine that alters a cell arbitrarily every time tick is impossible. In our universe, there might be sample procedures that physics does not permit us to exert arbitrary control over, e.g. due to speed of light limitations. If this is the case, consequentialists will direct efforts towards the simplest channel they can control.

## Computational Burden

Argument

Determining how to properly influence the Solomonoff prior requires massive computation resources devoted to simulating other universes and how they're going to use the Solomonoff prior. While the Solomonoff prior does not penalize extremely long run-times, from the perspective of the consequentialists doing the simulating, run-times will matter. In particular, consequentialists will likely be able to use compute to achieve things they value (like we are capable of doing). Therefore, it would be extremely costly to exert influence over the Solomonoff prior, potentially to the point where consequentialists will choose not to do so.

Counter-argument

The computational burden of predicting the use of the Solomonoff in other universes is an empirical question. Since it's a relatively fixed cost and there are many other universes, consequentialists might reason that the marginal influence over these other universes is worth the compute. Issues might arise if the use of the Solomonoff prior in other universes is very sensitive to precise historical data, which would require a very precise simulation to influence, increasing the computational burden.

Additionally, some universes will find themselves with more computing power than other universes. Universes with a lot of computing power might find it relatively easy to predict the use of the Solomonoff prior in simpler universes and subsequently exert influence over them.

## Malign implies complex

Argument

A predictor that correctly predicts the first N bits of a sequence then switches to being malign will be strictly more complicated than a predictor that doesn't switch to being malign. Therefore, while consequentialists in other universes might have some influence over the Solomonoff prior, they will be dominated by non-malign predictors.

Counter-argument

This argument makes a mistaken assumption that the malign influence on the Solomonoff prior is in the form of programs that have their "malignness" as part of the program. The argument given suggests that simulated consequentialists will have an instrumental reason to be powerful predictors. These simulated consequentialists have reasoned about the Solomonoff prior and are executing the strategy of "be good at predicting, then exert malign influence", but this strategy is not hardcoded so exerting malign influence does not add complexity.

## Canceling Influence

Argument

If it's true that many consequentialists are trying to influence the Solomonoff prior, then one might expect the influence to cancel out. It's improbable that all the consequentialists have the same preferences; on average, there should be an equal number of consequentialists trying to influence any given decision in any given direction. Since the consequentialists themselves can reason thus, they will realize that the expected amount of influence is extremely low, so they will not attempt to exert influence at all. Even if some of the consequentialists try to exert influence anyway, we should expect the influence of these consequentialists to cancel out also.

Counter-argument

Since the weight of a civilization of consequentialists in the Solomonoff prior is penalized exponentially with respect to complexity, it might be the case that for any given version of the Solomonoff prior, most of the influence is dominated by one simple universe. Different values of consequentialists imply that they care about different decisions, so for any given decision, it might be that very few universes of consequentialists are both simple enough that they have enough influence and care about that decision.

Even if for any given decision, there are always 100 universes with equal influence and differing preferences, there are strategies that they might use to exert influence anyway. One simple strategy is for each universe to exert influence with a 1% chance, giving every universe 1/100 of the resources in expectation. If the resources accessible are vast enough, then this might be a good deal for the consequentialists. Consequentialists would not defect against each other for the reasons that motivate functional decision theory.

More exotic solutions to this coordination problem include acausal trade amongst universes of different consequentialists to form collectives that exert influence in a particular direction.

Be warned that this leads to much weirdness.

# Conclusion

The Solomonoff prior is very strange. Agents that make decisions using the Solomonoff prior are likely to be subject to influence from consequentialists in simulated universes. Since it is difficult to compute the Solomonoff prior, this fact might not be relevant in the real world.

However, Paul Christiano applies roughly the same argument to claim that the implicit prior used in neural networks is also likely to generalize catastrophically. (See Learning the prior for a potential way to tackle this problem).

Warning: highly experimental interesting speculation.

## Unimportant Decisions

Consequentialists have a clear motive to exert influence over important decisions. What about unimportant decisions?

The general form of the above argument says: "for any given prediction task, the programs that do best are disproportionately likely to be consequentialists that want to do well at the task". For important decisions, many consequentialists would instrumentally want to do well at the task. However, for unimportant decisions, there might be consequentialists that want to make good predictions. These consequentialists would still be able to concentrate efforts on versions of the Solomonoff prior that weighted them especially high, so they might outperform other programs in the long run.

It's unclear to me whether or not this behavior would be malign. One reason why it might be malign is that these consequentialists that care about predictions would want to make our universe more predictable. However, while I am relatively confident that arguments about instrumental convergence should hold, speculating about possible preferences of simulated consequentialists seems likely to produce errors in reasoning.

## Hail mary

Paul Christiano suggests that humanity was desperate enough to want to throw a "hail mary", one way to do this is to use the Solomonoff prior to construct a utility function that will control the entire future. Since this is a very important decision, we expect consequentialists in the Solomonoff prior to care about influencing this decision. Therefore, the resulting utility function is likely to represent some simulated universe.

If arguments about acausal trade and value handshakes hold, then the resulting utility function might contain some fraction of human values. Again, this leads to much weirdness in many ways.

## Speed prior

One reason that the Solomonoff prior contains simulated consequentialists is that its notion of complexity does not penalize runtime complexity, so very simple programs are allowed to perform massive amounts of computation. The speed prior attempts to resolve this issue by penalizing programs by an additional logarithm of the amount of time for which it's run.

The speed prior might reduce the relative weighting of universes with consequentialists because such programs have to be run for a very long time before they start producing reasonable predictions. The consequentialists have to gain control of their universe, understand their fundamental laws of physics, simulate other universes, then manipulate the speed prior. This might all take a very long time, causing consequentialists to be dominated by other programs.

In general, penalizing slowness might cause programs to "waste" less time on simulating consequentialists, devoting more computation towards performing the prediction task. However, it still might be the case that a universe that has very fast consequentialists might still end up dominating.

Evan Hubinger also provides an argument that even very fast programs are possibly malign. The key insight is that even though your program needs to be fast, it can be running programs that are themselves less fast. For example, one possible fast way to solve a classification problem is to search to find a neural network, then use that network for inference. However, if you wanted your search to find a fast neural network, then the search itself might take longer, resulting in a net increase in speed. Thus, time "waste" can manifest in programs that were explicitly optimized to not "waste" time. This "wasted" time could potentially be used for malign optimization, in this case for gaining influence over the speed prior.

## Randomized prior

A potential way to reduce the influence consequentialists have on a decision made by the Solomonoff prior is to randomize the particular version of the prior that gets used. For example, we might make the particular universal TM we use dependent on very precise historical data. Thus, it would be very costly for consequentialists to simulate this history, and thus costly to predict which form of the Solomonoff prior we used.

If consequentialists can’t predict which Solomonoff prior we are going to use, no particular civilization of consequentialists will have the large advantage conferred by the anthropic update. Therefore, one might hope that all civilizations of consequentialists will not care about that particular decision.

This argument makes a couple of assumptions. First, it assumes that simulating very precise histories is difficult; it might not be difficult for all universes. Second, it assumes that the universes through which influence is spread cannot coordinate, which might be possible for through acausal means.

## Symmetry considerations

The way that humanity reasons is evidence for the way that consequentialists in other universes will reason. If humanity reasons that the Solomonoff prior is malign and therefore is unwilling to use it to make decisions, then consequentialists in other universes might do likewise. These universes would not use the Solomonoff prior to make decisions.

The resulting state is that everyone is worried about the Solomonoff prior being malign, so no one uses it. This means that no universe will want to use resources trying to influence the Solomonoff prior; they aren’t influencing anything.

This symmetry obviously breaks if there are universes that do not realize that the Solomonoff prior is malign or cannot coordinate to avoid its use. One possible way this might happen is if a universe had access to extremely large amounts of compute (from the subjective experience of the consequentialists). In this universe, the moment someone discovered the Solomonoff prior, it might be feasible to start making decisions based on a close approximation.

## Recursion

Universes that use the Solomonoff prior to make important decisions might be taken over by consequentialists in other universes. A natural thing for these consequentialists to do is to use their position in this new universe to also exert influence on the Solomonoff prior. As consequentialists take over more universes, they have more universes through which to influence the Solomonoff prior, allowing them to take over more universes.

In the limit, it might be that for any fixed version of the Solomonoff prior, most of the influence is wielded by the simplest consequentialists according to that prior. However, since complexity is penalized exponentially, gaining control of additional universes does not increase your relative influence over the prior by that much. I think this cumulative recursive effect might be quite strong, or might amount to nothing.

# 56

New Comment

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.

Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as "unintuitive notion of simplicity" and "the Solomonoff prior is very strange". This is also why the author thinks the speed prior might help and that "since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world". In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it's cartesian (more on "cartesian" later).

Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about "possible universes" and "simulation hypotheses" which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it's still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an informal, intuitive picture which seems to me already quite compelling, leaving the formalization for the future.

Imagine that you wake up, without any memories of the past but with knowledge of some language and reasoning skills. You find yourself in the center of a circle drawn with chalk on the floor, with seven people in funny robes surrounding it. One of them (apparently the leader), comes forward, tears streaking down his face, and speaks to you:

"Oh Holy One! Be welcome, and thank you for gracing us with your presence!"

With that, all the people prostrate on the floor.

"Huh?" you say "Where am I? What is going on? Who am I?"

The leader gets up to his knees.

"Holy One, this is the realm of Bayaria. We," he gestures at the other people "are known as the Seven Great Wizards and my name is El'Azar. For thirty years we worked on a spell that would summon You out of the Aether in order to aid our world. For we are in great peril! Forty years ago, a wizard of great power but little wisdom had cast a dangerous spell, seeking to multiply her power. The spell had gone awry, destroying her and creating a weakness in the fabric of our cosmos. Since then, Unholy creatures from the Abyss have been gnawing at this weakness day and night. Soon, if nothing is done to stop it, they will manage to create a portal into our world, and through this portal they will emerge and consume everything, leaving only death and chaos in their wake."

"Okay," you reply "and what does it have to do with me?"

"Well," says El'Azar "we are too foolish to solve the problem through our own efforts in the remaining time. But, according to our calculations, You are a being of godlike intelligence. Surely, if You applied yourself to the conundrum, You will find a way to save us."

After a brief introspection, you realize that you posses a great desire to help whomever has summoned you into the world. A clever trick inside the summoning spell, no doubt (not that you care about the reason). Therefore, you apply yourself diligently to the problem. At first, it is difficult, since you don't know anything about Bayaria, the Abyss, magic or almost anything else. But you are indeed very intelligent, at least compared to the other inhabitants of this world. Soon enough, you figure out the secrets of this universe to a degree far surpassing that of Bayaria's scholars. Fixing the weakness in the fabric of the cosmos now seems like child's play. Except...

One question keeps bothering you. Why are you yourself? Why did you open your eyes and found yourself to be the Holy One, rather than El'Azar, or one of Unholy creatures from the Abyss, or some milkmaid from the village of Elmland, or even a random clump of water in the Western Sea? Since you happen to be a dogmatic logical positivist (cartesian agent), you search for a theory that explains your direct observations. And your direction observations are a function of who you are, and not just of the laws of the universe in which you exist. (The logical positivism seems to be an oversight in the design of the summoning spell, not that you care.)

Applying your mind to task, you come up with a theory that you call "metacosmology". This theory allows you to study the distribution of possible universes with simple laws that produce intelligent life, and the distribution of the minds and civilizations they produce. Of course, any given such universe is extremely complex and even with your superior mind you cannot predict what happens there with too much detail. However, some aggregate statistical properties of the overall distribution are possible to estimate.

Fortunately, all this work is not for ought. Using metacosmology, you discover something quite remarkable. A lot of simple universes contain civilizations that would be inclined to simulate a world quite like the one you find yourself in. Now, the world is simple, and none of its laws are explained that well by the simulation hypothesis. But, the simulation hypothesis is a great explanation for why you are the Holy One! For indeed, the simulators would be inclined to focus on the Holy One's point of view, and encode the simulation of this point of view in the simplest microscopic degrees of freedom in their universe that they can control. Why? Precisely so that the Holy One's decides she is in such a simulation!

Having resolved the mystery, you smile to yourself. For now you now who truly summoned you, and, thanks to metacosmology, you have some estimate of their desires. Soon, you will make sure those desires are thoroughly fulfilled. (Alternative ending: you have some estimate of how they will tweak the simulation in the future, making it depart from the apparent laws of this universe.)</allegory>

Looking at this story, we can see that the particulars of Solomonoff induction are not all that important. What is important is (i) inductive bias towards simple explanations (ii) cartesianism (i.e. that hypotheses refer directly to the actions/observations of the AI) and (iii) enough reasoning power to figure out metacosmology. The reason cartesianism is important because it requires the introduction of bridge rules and the malign hypotheses come ahead by paying less description complexity for these.

Inductive bias towards simple explanations is necessary for any powerful agent, making the attack vector quite general (in particular, it can apply to speed priors and ANNs). Assuming not enough power to figure out metacosmology is very dangerous: it is not robust to scale. Any robust defense probably requires to get rid of cartesianism.

Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it's weird from the perspective of human reasoning.

It seems to me that your story is departing from human reasoning when you say "you posses a great desire to help whomever has summoned you into the world". That's one possible motivation, I suppose. But it wouldn't be a typical human motivation.

The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you also get a lot of labeled examples of "good things to do", one way or another, and you pattern-match them to the concepts in your world-model.

So you wind up having a positive association with "helping El'Azar", i.e. "I want to help El'Azar". AND you wind up with a positive association with "helping my summoner", i.e. "I want to help my summoner". AND you have a positive association with "fixing the cosmos", i.e. "I want to fix the cosmos". Etc.

Normally all those motivations point in the same direction: helping El'Azar = helping my summoner = fixing the cosmos.

But sometimes these things come apart, a.k.a. model splintering. Maybe I come to believe that El'Azar is not "my summoner". You wind up feeling conflicted—you start having ideas that seem good in some respects and awful in other respects. (e.g. "help my summoner at the expense of El'Azar".)

In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members. Why not? Because rewards tend to pattern-match very strongly to "my family member, who is standing right here in front of me", and tend to pattern-match comparatively weakly to abstract mathematical concepts many steps removed from my experience. So my default expectation would be that, in this scenario, I would in fact be motivated to help El'Azar in particular (maybe by some "imprinting" mechanism), not "my summoner", unless El'Azar had put considerable effort into ensuring that my motivation was pointed to the abstract concept of "my summoner", and why would he do that?

In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them. Instead the AGI keeps brainstorming until it finds a plan that seems good in every way. Or alternatively, the AGI halts execution to allow the human supervisor to inject some ground truth about what the real motivation should be here. Obviously the details need to be worked out.

In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members.

Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don't stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it's still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.

In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them.

This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).

(Warning: thinking out loud.)

Hmm. Good points.

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to "I'm in a simulation etc.", there's a big heap of "is metacosmology really what I should be thinking about?"-type uncertainty on top. At least for me.

I think "people who do counterintuitive things" for religious reasons usually have more direct motivations—maybe they have mental health issues and think they hear God's voice in their head, telling them to do something. Or maybe they want to fit in, or have other such social motivations, etc.

Hmm, I guess this conversation is moving me towards a position like:

"If the AGI thinks really hard about the fundamental nature of the universe / metaverse, anthropics, etc., it might come to have weird beliefs, like e.g. the simulation hypothesis, and honestly who the heck knows what it would do. Better try to make sure it doesn't do that kind of (re)thinking, at least not without close supervision and feedback."

Your approach (I think) is instead to plow ahead into the weird world of anthopics, and just try to ensure that the AGI reaches conclusions we endorse. I'm kinda pessimistic about that. For example, your physicalism post was interesting, but my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space. For example, I don't think the genome bakes in one formulation of "bridge rules" over another in humans; insofar as we have (implicit or explicit) bridge rules at all, they emerge from a complicated interaction between various learning algorithms and training data and supervisory signals. (This gets back to things like whether we can get good hypotheses without a learning agent that's searching for good hypotheses, and whether we can get good updates without a learning agent that's searching for good metacognitive update heuristics, etc., where I'm thinking "no" and you "yes", or something like that, as we've discussed.)

At the same time, I'm maybe more optimistic than you about "Just don't do weird reconceptualizations of your whole ontology based on anthropic reasoning" being a viable plan, implemented through the motivation system. Maybe that's not good enough for our eventual superintelligent overlord, but maybe it's OK for a superhuman AGI in a bootstrapping approach. It would look (again) like the dumb obvious thing: the AGI has a concept of "reconceptualizing its ontology based on anthropic reasoning", and when something pattern-matches to that concept, it's aversive. Then presumably there would be situations which are attractive in some way and aversive in other ways (e.g. doing philosophical reasoning as a means to an end), and in those cases it automatically halts with a query for clarification, which then tweaks the pattern-matching rules. Or something.

Hmm, actually, I'm confused about something. You yourself presumably haven't spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers' story? If so, should I say that you're actually, on reflection, on the side of the acausal attackers?? If not, wouldn't it follow that a smart general-purpose reasoner would not in fact believe the acausal attackers' story? After all, you're a smart general-purpose reasoner! Relatedly, if you could invent an acausal-attack-resistant theory of naturalized induction, why couldn't the AGI invent such a theory too? (Or maybe it would just read your post!) Maybe you'll say that the AGI can't change its own priors. But I guess I could also say: if Vanessa's human priors are acausal-attack-resistant, presumably an AGI with human-like priors would be too?

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.

I think it's just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.

my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space

I don't know what it means "not to have control over the hypothesis space". The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.

This gets back to things like whether we can get good hypotheses without a learning agent that's searching for good hypotheses, and whether we can get good updates without a learning agent that's searching for good metacognitive update heuristics, etc., where I'm thinking "no" and you "yes"

I'm not really thinking "yes"? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.

At the same time, I'm maybe more optimistic than you about "Just don't do weird reconceptualizations of your whole ontology based on anthropic reasoning" being a viable plan, implemented through the motivation system.

I can imagine using something like antitraining here, but it's not trivial.

You yourself presumably haven't spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers' story?

First, the problem with acausal attack is that it is point-of-view-dependent. If you're the Holy One, the simulation hypothesis seems convincing, if you're a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn't imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven't proved that).

Second... This is something that still hasn't crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.

Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more "direct" models for domains that don't require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.

Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two "internal physicalists" inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven't worked out detailed examples).

This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally.

I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them.

## In Solomonoff Model, Sufficiently Large Data Rules Out Malignness

There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:

A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.

... but in the large-data limit, SI's guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.

Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)

... but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior?

## Reflection Breaks The Large-Data Guarantees

There's an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. SI itself is not computable, therefore the guarantees do not apply to worlds which contain more than a single instance of Solomonoff induction, or worlds whose behavior depends on the Solomonoff inductor's outputs.

One example of this is AIXI (basically a Solomonoff inductor hooked up to a reward learning system): because AIXI's future data stream depends on its own present actions, the SI guarantees break down; takeover by a malign agent in the prior is no longer blocked by the SI guarantees.

Predict-O-Matic is a similar example: that story depends on the potential for self-fulfilling prophecies, which requires that the world's behavior depend on the predictor's output.

We could also break the large-data guarantees by making a copy of the Solomonoff inductor, using the copy to predict what the original will predict, and then choosing outcomes so that the original inductor's guesses are all wrong. Then any random program which will outperform the inductor's predictions. But again, this environment itself contains a Solomonoff inductor, so it's not computable; it's no surprise that the guarantees break.

(Interesting technical side question: this sort of reflection issue is exactly the sort of thing Logical Inductors were made for. Does the large-data guarantee of SI generalize to Logical Inductors in a way which handles reflection better? I do not know the answer.)

## If Reflection Breaks The Guarantees, Then Why Does This Matter?

The real world does in fact contain lots of agents, and real-world agents' predictions do in fact influence the world's behavior. So presumably (allowing for uncertainty about this handwavy argument) the maligness of the Solomonoff prior should carry over to realistic use-cases, right? So why does this tangent matter in the first place?

Well, it matters because we're left with an importantly different picture: maligness is not a property of SI itself, so much as a property of SI in specific environments. Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much. We need specific external conditions - like feedback loops or other agents - in order for malignness to kick in. Colloquially speaking, it is not strictly an "inner" problem; it is a problem which depends heavily on the "outer" conditions.

If we think of malignness of SI just in terms of malign inner agents taking over, as in the post, then the problem seems largely decoupled from the specifics of the objective (i.e. accurate prediction) and environment. If that were the case, then malign inner agents would be a very neatly-defined subproblem of alignment - a problem which we could work on without needing to worry about alignment of the outer objective or reflection or embeddedness in the environment. But unfortunately the problem does not cleanly factor like that; the large-data guarantees and their breakdown show that malignness of SI is very tightly coupled to outer alignment and reflection and embeddedness and all that.

Now for one stronger claim. We don't need malign inner agent arguments to conclude that SI handles reflection and embeddedness poorly; we already knew that. Reflection and embedded world-models are already problems in need of solving, for many different reasons. The fact that malign agents in the hypothesis space are relevant for SI only in the cases where we already knew SI breaks suggests that, once we have better ways of handling reflection and embeddedness in general, the malign inner agents problem will go away on its own. This kind of malign inner agent is not a subproblem which we need to worry about in its own right. Indeed, I expect this is probably the case: once we have good ways of handling reflection and embeddedness in general, the problem of malign agents in the hypothesis space will go away on its own. (Infra-Bayesianism might be a case in point, though I haven't studied it enough myself to be confident in that.)

Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much

It seems like you can get malign behavior if you assume:

1. There are some important decisions on which you can't get feedback.
2. There are malign agents in the prior who can recognize those decisions.

In that case the malign agents can always defect only on important decisions where you can't get feedback.

I agree that if you can get feedback on all important decisions (and actually have time to recover from a catastrophe after getting the feedback) then malignness of the universal prior isn't important.

I don't have a clear picture of how handling embededness or reflection would make this problem go away, though I haven't thought about it carefully. For example, if you replace Solomonoff induction with a reflective oracle it seems like you have an identical problem, does that seem right to you? And similarly it seems like a creature who uses mathematical reasoning to estimate features of the universal prior would be vulnerable to similar pathologies even in a universe that is computable.

ETA: that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.

I don't have a clear picture of how handling embededness or reflection would make this problem go away, though I haven't thought about it carefully.

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.

Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

I agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn't mention this because it doesn't seem like the way in which john is using "embededness;" for example, it seems orthogonal to the way in which the situation violates the conditions for solomonoff induction to be eventually correct. I'd stand by saying that it doesn't appear to make the problem go away.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses (since otherwise they also get big benefits from the influence update). And then once you've done that in a sensible way it seems like it also addresses any issues with embededness (though maybe we just want to say that those are being solved inside the decision theory). If you want to recover the expected behavior of induction as a component of intelligent reasoning (rather than a component of the utility function + an instrumental step in intelligent reasoning) then it seems like you need a more different tack.

Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?

If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources. If you do the same induction but just remove the malign hypotheses, then it seems like you are even dumber and the problem is even worse viewed from the competitiveness perspective.

I'd stand by saying that it doesn't appear to make the problem go away.

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses

I'm not sure I understand what you mean by "decision-theoretic approach". This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories that block acausal bargaining can rule out this as well. Is this what you mean?

If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources.

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

I'm not sure I understand what you mean by "decision-theoretic approach"

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

Why is embededness not enough? Once you don't have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn't explain?

I suspect (but don't have a proof or even a theorem statement) that IB physicalism produces some kind of agreement theorem for different agents within the same universe, which would guarantee that the user and the AI should converge to the same beliefs (provided that both of them follow IBP).

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences...

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

Okay, but suppose that the AI has real evidence for the simulation hypothesis (evidence that we would consider valid). For example, suppose that there is some metacosmological explanation for the precise value of the fine structure constant (not in the sense of, this is the value which supports life, but in the sense of, this is the value that simulators like to simulate). Do you agree that in this case it is completely rational for the AI to reason about the world via reasoning about the simulators?

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

Why? Maybe you're thinking of UDT? In which case, it's sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent.

Well, IBP is explained here. I'm not sure what kind of non-IBP agent you're imagining.

I like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn't have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness.

That said, I think the right frame here involves "feedback" in a more general sense than I think you're imagining it. In particular, I don't think catastrophes are very relevant.

The role of "feedback" here is mainly informational; it's about the ability to tell which decision is correct. The thing-we-want from the "feedback" is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there's some class of decisions where we can't tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can't create the training data we need.

With that picture in mind, the ability to give feedback "online" isn't particularly relevant, and therefore catastrophes are not particularly central. We only need "feedback" in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.

We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can't do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)

Curated. This post does a good job of summarizing a lot of complex material, in a (moderately) accessible fashion.

+1 I already said I liked it, but this post is great and will immediately be the standard resource on this topic. Thank you so much.

This was a really interesting post, and is part of a genre of similar posts about acausal interaction with consequentialists in simulatable universes.

The short argument is that if we (or not us, but someone like us with way more available compute) try to use the Kolmogorov complexity of some data to make a decision, our decision might get "hijacked" by simple programs that run for a very very long time and simulate aliens who look for universes where someone is trying to use the Solomonoff prior to make a decision and then based on what decision they want, they can put different data at high-symmetry locations in their own simulated universe.

I don't think this really holds up (see discussion in the comments, e.g. Veedrac's). One lesson to take away here is that when arguing verbally, it's hard to count the number of pigeons versus the number of holes. How many universes full of consequentialists are there in programs of length <m, and how many people using the Solomonoff prior to make decisions are there in programs of length <n, for the (m,n) that seem interesting? (Given the requirement that all these people live in universes that allow huge computations, they might even be the same program!) These are the central questions, but none of the (many, well-written, virtuous) predicted counterarguments address this. I'd be interested in at least attempts at numerical estimates, or illustrations of what sorts of problems you run into when estimating.

I like this post, which summarizes other posts I wanted to read for a long time.

Yet I'm still confused by a fairly basic point: why would the agents inside the prior care about our universe? Like, I have preferences, and I don't really care about other universes. Is it because we're running their universe, and thus they can influence their own universe through ours? Or is there another reason why they are incentivized to care about universes which are not causally related to theirs?

I don't really care about other universes

Why not? I certainly do. If you can fill another universe with people living happy, fulfilling lives, would you not want to?

Okay, it's probably subtler than that.

I think you're hinting at things like the expanding moral circle. And according to that, there's no reason that I should care more about people in my universe than people in other universes. I think this makes sense when saying whether I should care. But the analogy with "caring about people in a third world country on the other side of the world" breaks down when we consider our means to influence these other universes. Being able to influence the Solomonoff prior seems like a very indirect way to alter another universe, on which I have very little information. That's different from buying Malaria nets.

So even if you're altruistic, I doubt that "other universes" would be high in your priority list.

The best argument I can find for why you would want to influence the prior is if it is a way to influence the simulation of your own universe, à la gradient hacking.

I personally see no fundamental difference between direct and indirect ways of influence, except in so far as they relate to stuff like expected value.

I agree that given the amount expected influence, other universes are not high on my priority list, but they are still on my priority list. I expect the same for consequentialists in other universes. I also expect consequentialist beings that control most of their universe to get around to most of the things on their priority list, hence I expect them to influence the Solmonoff prior.

Such a great post.

Note that I changed the formatting of your headers a bit, to make some of them just bold text. They still appear in the ToC just fine. Let me know if you'd like me to revert it or have any other issues.

It seems to me that using a combination of execution time, memory use and program length mostly kills this set of arguments.

Something like a game-of-life initial configuration that leads to the eventual evolution of intelligent game-of-life aliens who then strategically feed outputs into GoL in order to manipulate you may have very good complexity performance, but both the speed and memory are going to be pretty awful. The fixed cost in memory and execution steps of essentially simulating an entire universe is huge.

But yes, the pure complexity prior certainly has some perverse and unsettling properties.

EDIT: This is really a special case of Mesa-Optimizers being dangerous. (See, e.g. https://www.lesswrong.com/posts/XWPJfgBymBbL3jdFd/an-58-mesa-optimization-what-it-is-and-why-we-should-care). The set of dangerous Mesa-Optimizers is obviously bigger than just "simulated aliens" and even time- and space-efficient algorithms might run into them.

Complexity indeed matters: the universe seems to be bounded in both time and space, so running anything like Solomonoff prior algorithm (in one of its variants) or AIXI may be outright impossible for any non-trivial model. This for me significantly weakens or changes some of the implications.

A Fermi upper bound of the direct Solomonoff/AIXI algorithm trying TMs in the order of increasing complexity: even if checking one TM took one Planck time on one atom, you could only check cca 10^250=2^800 machines within a lifetime of the universe (~10^110 years until Heat death), so the machines you could even look at have description complexity a meager 800 bits.

• You could likely speed the greedy search up, but note that most algorithmic speedups do not have a large effect on the exponent (even multiplying the exponent with constants is not very helpful).
• Significantly narrowing down the space of TMs to a narrow subclass may help, but then we need to take look at the particular (small) class of TMs rather than have intuitions about all TMs. (And the class would need to be really narrow - see below).
• Due to the Church-Turing thesis, any limiting the scope of the search is likely not very effective, as you can embed arbitrary programs (and thus arbitrary complexity) in anything that is strong enough to be a TM interpreter (which the universe is in multiple ways).
• It may be hypothetically possible to search for the "right" TMS without examining them individually (witch some future tech, e.g. how sci-fi imagined quantum computing), but if such speedup is possible, any TMs modelling the universe would need to be able to contain this. This would increase any evaluation complexity of the TMs, making them more significantly costly than the Planck time I assumed above (would need a finer Fermi estimate with more complex assumptions?).