Some problems with making induction benign, and approaches to them

jessicata

The universal prior is malign. I'll talk about sequence of problems causing it to be malign and possible solutions.

(this post came out of conversations with Scott, Critch, and Paul)

Here's the basic setup of how an inductor might be used. At some point humans use the universal prior to make an important prediction that influences our actions. Say that humans construct an inductor, give it a sequence of bits $s$ from some sensor, and ask it to predict the next $k$ bits. Those $k$ bits are actually (in a way inferrable from $s$ ) going to be generated by a human in a sealed room who thinks for a year. After the human thinks for a year, the $k$ bits are fed into the inductor. This way, the humans can use the inductor to predict what the human who thinks for a year will say ahead of time. (This probably isn't the best way to use an inductor but it serves as a good demonstration).

The anthropic update is a problem for Solomonoff induction

The original problem Paul talks about in his post is that consequentialist agents can make an "anthropic update" on the fact that the string $s$ is fed into a Solomonoff inductor with a specific prior, while Solomonoff induction can't make this update on its own. Interestingly, this problem with Solomonoff induction is also a problem with sufficiently large convolutional neural networks (capable of simulating e.g. game of life).

Possibly, this could be solved by inventing a naturalized variant of Solomonoff induction that makes this anthropic update.

Simulation warfare is a problem for naturalized induction

Suppose alien civilizations devote some percentage of their resources to (a) coming up with plausible values for the prefix $s$ and (b) constructing naturalized inductors that are fed the string $s$ followed by $k$ bits such that the aliens would be able to take over the humans' system if these bits were the prediction of the humans' inductor. (For example, these $k$ bits could be a manipulative message).

Under this supposition, a large percentage of naturalized inductors in the universe are of this form. In particular, if the aliens are good at guessing $s$ , then a large percentage of naturalized inductors that are fed a string staring with the prefix $s$ will be of this form. So the naturalized inductor used by the humans will, under reasonable anthropic theories, predict a message chosen by some alien civilization with high probability.

Simulations can be replaced with zoos

Plausibly, there's some way to tell that the aliens' simulations of humans are "not real". But what if the "simulations" are actually real physical planets? Since there are many galaxies, it would not be hard to use some of them to create "zoos" similar to planets that life naturally evolves on. The aliens can deploy nanobots in the zoos to influence what the human in the box says. This way, on most physical Earths, the naturalized inductor will be fed $k$ bits chosen by aliens when these bits are used to make an important decision.

Does importance-weighting rescue naturalized induction?

Maybe we should somehow distinguish "natural" Earths from "zoo" Earths. One way they are distinguished is that an individual "natural" Earth will have more influence over the universe than an individual "zoo" Earth. So maybe there is a sense in which we can make an importance-weighted naturalized inductor, which assigns higher weight to naturalized inductors used to make decisions that influence a large part of the universe.

Designing a good importance metric requires knowing which parts of the universe are important a priori. For example, if some regions of space are more human-habitable than other regions, then inductors near the more-habitable regions should be assigned more weight. Unfortunately, if the importance metric is "incorrect", then the aliens may exploit this fact by e.g. giving zoo Earths access to a region of space that is actually unimportant but which the importance metric thinks is important. (At some level, this is kind of like acausal trade: doing acausal trade with aliens can be a bad idea without knowing ahead of time what parts of space and other resources you value). Getting the importance metric right before even being able to do induction seems like a pretty hard problem. I'm pessimistic about this, and so was Paul when I talked to him about this.

Does a bound on computational requirements help?

Intuitively, it would be nice if the inductor just focused on modeling the local phenomenon of the human in the box, rather than positing aliens. It shouldn't be necessary to think about aliens to predict humans well; all the ideal importance-weighting should add up to normality.

Maybe we can use something like the speed prior to do this. Paul discusses this idea here. Roughly, this will penalize the "zoo Earth" hypothesis on the basis that it posits a large surrounding context.

I don't expect the speed prior itself to work for this. In the regular Solomonoff induction case (rather than the naturalized induction case), the "zoo Earth" hypothesis doesn't have to explicitly represent the universe; it can instead do abstract reasoning about the probability that Earth is in a zoo. Such abstract reasoning can be both compact and efficient (e.g. it just says "observations are produced by running some algorithm similar to logical inductiors on this physics"). It's not clear how "abstract reasoning" hypotheses fit into naturalized inductors but it would be pretty weird if this combination solved the problem.

Paul points out that neural networks without weight sharing are probably benign; in some sense this is kind of like a severe penalty on additional computation (as each computation step in a neural network without weight sharing requires setting its parameters to the right value). Maybe it's useful to think about this case more in order to see if the same type of argument could apply to variants of neural networks that have weight sharing.

If I consider it likely that aliens created copies of me which are just like Earth-me but are going to see something completely different in the next hour, then it seems entirely rational for me to seriously consider the possibility that I'm not on Earth (and that therefore I am going to see weird things in the next hour). On the other hand, as you correctly observe in the part about importance weighing, if Earth-me has a much better chance of having large impact than the other copies, then I should behave as if I am Earth-me. This doesn't require defining an importance weighing by hand. It is enough that the agent is a consequentialist with the correct utility function.

The above reasoning doesn't really solve the problem, but rather moves it to a different place. How do we construct a consequentialist with the correct utility function? IMO, it is plausible that this can be solved using something like IRL. However, then we fall into the trap again. In IRL, the utility function is inferred by observing a "teacher" agent. If the aliens can pervert the agent's predictions concerning the teacher agent, they can pervert the utility function.

I think it is useful to think of the problem as having two tiers: In the first tier, we need to make sure the posterior probability of the correct hypothesis is in the same ballpark as the probabilities of the malicious hypotheses. In the second tier, we need to correctly deal with uncertainty assuming both the correct and the malicious hypotheses appear with non-negligible weights in the posterior.

To address the first tier, we need something like an anthropic update. Defining the anthropic update is tricky but we can address it indirectly by (i) allowing the agent to use its own source code with low weight in the complexity count; this way hypotheses of the form "look for a pointer in spacetime where this source code exists" become much simpler and maybe (ii) providing models of physics or even the agent's bridge rules that again can be used without a large complexity penalty.

To address the second tier, we can try to create a version of IRL that extracts instrumental values. That is, consider the agent's beliefs about the teacher's behavior at time $t$ . For some values of $t$ , the agent has high certainty because both the correct and the malicious hypotheses coincide. For other values of $t$ , these hypotheses diverge and uncertainty results. Importantly, the latter case cannot happen for all values of $t$ all the time, since each time the teacher's behavior on a "problematic" $t$ is observed, the malicious hypothesis is penalized. Presumably, the attackers will design the malicious hypothesis to diverge from the correct hypothesis only for sufficiently late values of $t$ , so that we cannot mount a defense just by having a large time span of passive observation. Now, imagine that you are running IRL while constraining the time discount function so that times with high uncertainty are strongly discounted. I consider it plausible that such a procedure can learn the instrumental goals of the teacher for the time span in which uncertainty is low. Optimizing for these instrumental goals should lead to desirable behavior (modulo other problems that are orthogonal to this acausal attack).

Agree that IRL doesn't solve this problem (it just bumps it to another level).

The second tier thing sounds a lot like KWIK learning. I think this is a decent approach if we're fine with only learning instrumental goals and are using a bootstrapping procedure.

KWIK learning is definitely related in the sense that we want to follow a "conservative" policy that is risk averse w.r.t. its uncertainty regarding the utility function, which is similar to how KWIK learning doesn't produce labels about which it is uncertain. Btw, do you know which of the open problems in the Li-Littman-Walsh paper are solved by now?

I don't know which open problems have been solved.