*Note: weird stuff, very informal.*

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by "too much compute." I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that's one of the major open questions for the feasibility of prosaic AGI alignment.

I suspect that daemons *aren't* a problem if we exclusively select for computational efficiency. That is, I suspect that **the fastest way to solve any particular problem doesn't involve daemons**.

I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).

But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it's a good thing to think about.

## Problem statement and intuition

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question "Should I predict well?" we could just skip straight to the answer "yes." This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well.

A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon.

Instead of showing that minimal circuits can't be daemons, we might end up concluding that they can be. That would be even more interesting.

Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn't apply to the minimal circuit that solves a problem. In this case we couldn't prove anything positive about the minimal circuit, but we would have "screened off" the possible cause for concern.

## Difficulties

The first and most serious difficulty is understanding what we are talking about.

I don't expect to get total clarity on concepts like "daemon" or "optimization" or "generic problem," but we need to have a better grip than we do right now. I expect that we'll develop better concepts in the course of solving the problem, rather than as a precondition for solving the problem (in general I think "define things so that you can prove the theorem" is often the right strategy).

A second difficulty is that the different parts of the computation can be tangled up in an extremely complex way. In an extreme case, the daemon may be cryptographically obfuscated.

We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

So we can't find any efficient constructive argument. That rules out most of the obvious strategies.

For human evolution, the problem is too little compute rather than too much, right? Meaning if evolution just gave humans the goal of "maximize inclusive fitness" then the human wouldn't be able to find a good policy for achieving that due to lack of computing power so instead we got a bunch of goals that would have been subgoals of "maximize inclusive fitness" in our ancestral environment (like eat tasty food and make friends/allies).

Suppose we wanted to make a minimal circuit that would do as well as humans in maximizing inclusive fitness in some range of environments. Wouldn't it make sense to also "help it out" by having it directly optimize for useful subgoals in those environments rather than having it do a big backchain from "maximize inclusive fitness"? And then it would be a daemon because it would keep optimizing for those subgoals even if you moved it outside of those environments?

I agree with this basic point and it seems important, thanks.

It seems like there are two qualitatively different concerns when trying to optimize for X, that probably need to be distinguished / thought about separately:

Obviously the real situation can be a complicated mixture, and this is not a clean distinction even apart from that.

The arguments in the OP only plausibly apply to downstream daemons. I think they make the most sense in terms of making induction benign.

I've normally thought of upstream daemons as much more likely, but much easier to deal with:

While I usually flag these as two potentially distinct concerns, they do run together a lot in my head as evidenced by this post. I'm not sure if it's possible to cleanly distinguish them, or how. The right distinction may also be something else, e.g focusing directly on the possibility of a treacherous turn.

I think it makes sense to classify daemons into two types the way you do. Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too. (What is it with people keeping important AI safety documents in private Google Docs these days, with no apparent plans of publication? Do you know any others that I'm not already shared on, BTW?)

I don't recall you writing about this before. How do you see this working? I guess with LBO you could train a complete "core for reasoning" and then amplify that to keep retraining the higher level agents on broader and broader distributions, but how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster? I'm also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you're planning.

I think people (including at MIRI) normally describe daemons as emerging from upstream optimization, but then describe them as becoming downstream daemons as they improve. Without the second step, it seems hard to be so pessimistic about the "normal" intervention of "test in a wider range of cases."

At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved---if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.

I spoke a bit too glibly though, I think there are lots of possible approaches for dealing with this problem, each of them slightly increases my optimism, this isn't the most important:

I think this is definitely an additional difficulty. Right now I think accidentally introducing consequentialists is a somewhat larger concern, either daemons from the distillation step or weird memetic patterns in the amplification step, but hopefully at some point I'll be focusing on this problem.

Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)

I don't think this can work if you're just doing naive imitation learning? Do you have some other training method in mind?

To be clear, I'm imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.

What obstruction do you have in mind?

(I'm imagining using imitation+RL rather than pure imitation, but the difference won't help with this question.)

By "naive imitation learning" I was thinking "without amplification". With amplification, I'm less sure it won't work but it still seems pretty iffy. The plan seems to depend on at least the following:

I don't see why to separate 1/2, the goal is to find training data that describes some "universal" core for behavior.

3. I don't think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.

4. Hard-to-predict inputs aren't intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to "reach inside" the model in order to stress test the behavior on inputs that you can't actually synthesize (e.g. by understanding that is checking the pope's signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.

(Of course, I don't think we'll be able to prevent benign failures in general.)

It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.

What if the path towards the universal core goes through an area where the AI wasn't trained on?

I think that makes sense but now you're making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying "I think there are lots of possible approaches for dealing with this problem" and listing retraining and optimizing worst case performance as separate approaches).

ETA: If you're able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don't need constant retraining to be aligned. If you're only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that's covered by the control guarantee.

I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent's moves. At this point the circuit already contains an adversarial agent. However, one could object that it's unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it's still a problem). To remove the objection, let's make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counter-response. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.

I'm having trouble thinking about what it would mean for a circuit to contain daemons such that we could hope for a proof. It would be nice if we could find a simple such definition, but it seems hard to make this intuition precise.

For example, we might say that a circuit contains daemons if it displays more optimization that necessary to solve a problem. Minimal circuits could have daemons under this definition though. Suppose that some function f describes the behaviour of some powerful agent, a function ~f is like f with noise added, and our problem is to predict sufficiently well the function ~f. Then, the simplest circuit that does well won't bother to memorize a bunch of noise, so it will pursue the goals of the agent described by f more efficiently than ~f, and thus more efficiently than necessary.

I don't know what the statement of the theorem would be. I don't really think we'd have a clean definition of "contains daemons" and then have a proof that a particular circuit doesn't contain daemons.

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

(Also, see the comment thread with Wei Dai above, clearly the plausible version of this involves something more specific than daemons.)

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).

Let c be a minimal circuit that takes as input a string of length 10100 that encodes a Turing machine, and outputs a string that is the concatenation of the first 10100 configurations in the simulation of that Turing machine (each configuration is encoded as a string).

Now consider a string x′ that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input x′, the computation of the output of c simulates a consequentialist; and c is a minimal circuit.

I consider the argument in this post a reasonably convincing negative answer to this question---a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.

This suggests a second informal clarification of the problem (in addition to Wei Dai's comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?

If the search for minimal circuits was itself safe then there's still some hope for solutions that avoid deception by somehow penalizing computational cost. Namely: if that techniques is competitive, then we can try to provide a loss that encourages any learned optimization to use the same techniques.

(I've previously thought about this mostly in the high-stakes setting, but I'm now thinking about it in the context of incentivizing honest answers in the low-stakes setting. The following story will focus on the low-stakes setting since I don't want to introduce extra ingredients to handle high stakes.)

To illustrate, suppose there was a trick where you can divide your model into two parts, one of which is aggressively penalized for computation, and that this lets you do search in a way that is competitive but never produces deception. Call this a "safe" search. The analogy to Evan's counterexample is a model M produced by "safe" search which internally performs an unsafe search for a new model X that performs well on a mini-training set, then uses X to predict something, and therefore gets treacherous behavior. (Note that in this setting the learned model X* normally generalizes well to its "mini-test" set, and only fails on the

outertest set.)Consider a human (using amplification+imitative generalization+etc.) which is evaluating M's forecast to decide whether it is good, in a context where the human can't directly evaluate the answer (e.g. because it is about a long time horizon).

If M makes the prediction P, the human's best evidence about the quality of P will likely be (i) X is good because it performs well on the mini-training set, (ii) X outputs the forecast P. But if safe search were competitive, then M could instead do a safe search for a model X* that did well on the mini-training set, and then use X* to make a prediction P*. If the evaluator understands the rationale for performing safe search, then in cases where X* and X disagree they will expect X*'s predictions to be better. This means that the human's evaluation pushes towards safely learning X* instead of unsafely learning X, as desired.

So all we needed was to implement safe search on the outside

andensure that the evaluation process itself is familiar with the arguments for safe search. It's not too surprising if we need something like this---anytime our model performs internal cognitive work that presents an alignment risk, we need it to take all of the same alignment precautions that we would. (If successful this would be another example of a mundane solution to an exotic problem.) It's unclear if its efforts will succeed, but our hope is just to make sure that our model is in as good a place as we are.I think it's totally unclear if a "speed-prior-like" approach to improving generalization could actually be applied recursively in this way. But I think it looks plausible enough that we shouldn't let this counterexample scare us away from "speed-prior-like" approaches.

This post grounds a key question in safety in a relatively simple way. It led to the useful distinction between upstream and downstream daemons, which I think is necessary to make conceptual progress on understanding when and how daemons will arise.

This post formulated a concrete open problem about what are now called 'inner optimisers'. For me, it added 'surface area' to the concept of inner optimisers in a way that I think was healthy and important for their study. It also spurred research that resulted in this post giving a promising framework for a negative answer.

I don't think the procedure needs to be efficient to solve the problem, since we only care about existence of a smaller circuit (not an efficient way to produce it).

Does this mean you do not expect daemons to occur in practice because they are too complicated?

No, I think a simplicity prior clearly leads to daemons in the limit.