“Stable self-improvement" seems to be a primary focus of MIRI’s work. As I understand it, the problem is "How do we build an agent which rationally pursues some goal, is willing to modify itself, and with very high probability continues to pursue the same goal after modification?"

The key difficulty is that it is impossible for an agent to formally "trust" its own reasoning, i.e. to believe that "anything that I believe is true." Indeed, even the natural concept of "truth" is logically problematic. But without such a notion of trust, why should an agent even believe that its own continued existence is valuable?

I agree that there are open philosophical questions concerning reasoning under logical uncertainty, and that reflective reasoning highlights some of the difficulties. But I am not yet convinced that stable self-improvement as an especially important problem; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be "safely" delegated to a human-level AI. I would prefer for energy to be used on other aspects of the AI safety problem.


Consider an agent A which shares our values and is able to reason "as well as we are"---for any particular empirical or mathematical quantity, A's estimate of its expectation is as good as ours. For notational convenience, suppose that A's preferences are the same as "our" preferences, and let U be the associated utility function.

Now suppose that A is thinking about an outcome including the existence of an agent B. (Perhaps B is a new AI that A is considering designing; perhaps B is a version of A that has made some further observations; whatever.) We'd like the agent to evaluate this outcome on its merits. It should think about how good the existence of B is. If B also maximizes U, then A should correctly understand that B's existence will tend to be good.

The expected value of U conditioned on this outcome is just another empirical quantity. If A is as good at estimation as humans, then it won't predictably over- or under-estimate this quantity. And so it will weigh B's existence correctly when considering the consequences of its actions.

So if we really had a “human-level” reasoner in the sense I assumed at the outset, our problem would be solved. There are a number of reasons to think the problem might be important anyway. I haven’t seen any of these arguments fleshed out in much detail, and for the most part I am skeptical.

#Self-modification requires high confidence

If we anticipate a long sequence of ever-more-powerful AI's, then we might want to be very sure that each change is really an improvement. There are two sides to this concern.

First is the idea that an AI might not exercise sufficient caution when designing a successor. But if the AI has well-calibrated beliefs and shares our values, then by construction it will make the appropriate tradeoffs between reliability and efficiency. So I don't take this concern very seriously.

Second is the concern that, if the required confidence is very high, then it might be very difficult to be confident enough to go ahead with a proposed AI design. In this scenario, an AI might correctly realize that it should not make any risky changes; but this restriction might introduce unacceptable efficiency losses. While the "good guys" proceed cautiously, competitors will race ahead (allowing their systems' values to change over time).

On this view, by working out these issues farther in advance we can save some time for the “good guys,” or push research in a direction which makes their task easier.

But this problem can be straightforwardly delegated to machine intelligences. An AI with human-level reasoning is also human-level at assessing the reliability of a system or engineering a highly-reliable system. We are left with a quantitative question: how useful is doing this work in advance, rather than doing it as its needed?

My intuition is "not very valuable," and I don't think anyone has yet made a strong argument for the other side. I think the main disagreement is whether this is an extra-especially hard problem. Briefly, I think:

  1. The biggest concerns are from design errors that would conceal themselves (presumably because the resulting AI becomes motivated to conceal them). It's hard to make changes that are bug-free, but it's probably much easier to avoid this very specific kind of bug, which (1) leaves the program still behaving almost exactly as intended, but (2) subtly changes what it's optimizing.

  2. You can achieve relatively high confidence using normal empirical methods. Bugs arise in unanticipated and untested situations, but those aren't mostly the kinds of bugs you are worried about (see #1). It is unlikely that an unanticipated situation would cause bugs for several different supervisory systems at once.

  3. Previous versions of a system can literally watch their successors operate and ensure that new changes function as intended. New systems can be deployed in such a way that deception is very hard to conceal (even the cost-benefit analysis necessary to decide when to deceive is presumably the result of some calculation which can be monitored).

  4. Before AI is doing most of the AI design work, bugs are not catastrophic. Once AI is doing most of the AI design work, you don't have to make many more changes before you have massively increased your AI design capabilities (obsoleting your earlier work). For example, 10 improvements of +30% efficiency (which is a relatively small improvement for a significant software change) results in a >13x total improvement. So talking about "billions" of changes is misleading, in the most extreme case we should maybe talk about "tens" of changes.

  5. Applying changes in batches rather than continuously doesn't really matter to productivity. Waiting until you have assembled a +30% productivity increase before deploying a package of changes reduces your speed by less than 25%, possibly much less.

  6. Just because a change can be decomposed as many smaller changes doesn't mean it's probability of failure is overwhelming. Normal software often fails, but each change is composed of a large number of small features. Moreover, the failures of many related changes are highly correlated.

  7. If a proposed change results in only a small productivity benefit, then it is particularly easy for the original system to provide end-to-end oversight for the improved version, e.g. listening to its arguments and evaluating its final product. And there should be very few changes with big productivity benefits before you have obsoleted your original investment.

  8. Bugs are most challenging in a security setting, where adversaries search for improbable corner cases—when your buffer overflows, you don’t normally expect it to overwrite your code in a subtle way, unless you are facing an adversary. There is generally no adversary in AI design, and I don’t know of any promising approaches to AI safety that involve hardening software against even an implicit adversary. The closest you get is the AI’s adversarial desire to maximize its goals; I don’t think this is relevantly similar, but I’m happy to have an argument about it.

  9. There are many pairs of “nearby” agents such that one is “friendly” and the other “unfriendly.” In these cases it might be easy to make a programming misstep that takes you from one category to the other. But this is nearly irrelevant; we get to choose what AI we design, and in particular we can focus on designs that are robustly good (and which are separated from adversarial deceivers by a chasm of non-functional or obviously malignant intermediates).

I could talk more about this, but I don't have a good enough handle on the arguments I should be addressing.

#Standards of reasoning can't be outsourced

When we design an AI, we are (at least implicitly) specifying what kind of reasoning it considers “valid.” If we get the answer wrong, for example by leaving out some important pattern of reasoning X that humans accept, then the problem might be permanent: our first AI thinks that accepting X is an error, so it designs successors that also reject X. In principle, the result could be a system which is an effective reasoner but which is not able to reason about its own behavior.

I’m skeptical. An effective reasoner interested in making empirical predictions will tend to (provisionally) accept whatever patterns of reasoning lead to correct predictions. This can include the laws of arithmetic just as well as it can include natural laws. (See my writeup of this view here.) If some pattern of reasoning X is important for making accurate predictions then an effective AI will accept X, at least as suggestive evidence. If this is how human reasoning works, than any sufficiently effective reasoner would recover the same patterns of reasoning.

I would be surprised if, in contrast to this view, human brains were simply constituted to automatically accept certain rules of logic. Certainly the history of logic and mathematics suggests that rules of reasoning are subject to debate, and are developed to fit the empirical facts. And even if human brains are wired to reason logically, they were produced by natural selection (which most definitely wasn't).

There may be a remaining problem in understanding how a system can learn to treat a pattern of reasoning as a useful source of evidence, and more generally where does human logical reasoning come from. I think these are interesting questions, but (1) they are quite distant from the current approach to stable self-improvement, (2) I suspect they have to be resolved to produce human-level reasoners.

Even more clear is that humans don't have any kind of axiomatic "self-trust;" they trust themselves, to the extent they do, based on empirical observations of their own trustworthiness. This brings us to…

#Self-improvement highlights open questions about reasoning

In some sense self-modification is just a special case of reasoning under logical uncertainty. But we don't understand reasoning under logical uncertainty in general; reflective reasoning might be a productive challenge problem for thinking about logical uncertainty. I agree with this, but it doesn’t seem to be the motivation for MIRI’s research program.

For example, I think that the attitude an agent has towards its own judgments should be similar to the attitude it has towards the views of wise peers in general. These views aren’t characterized by strong, monotonic forms of trust (such that if a peer says X, I believe X regardless of what other evidence is available). Instead, I view my peers’ judgment as evidence—potentially strong evidence, depending on how much I trust them—which might be overturned by new considerations.

On this perspective, Godelian difficulties don’t seem like especially problematic cases. If I learn that I assign the sentence X = “I assign this sentence less than 50% probability” a probability of 49%, then I change my mind and start believing X. Similarly, if I learn “I assign Y a probability of 49%; also Y” then I change my mind and start believing Y. Though I trust myself, if I learn further evidence it can screen off my previous beliefs. Sometimes, knowing exactly what I believe can be further evidence in and of itself, screening off the content of those beliefs. (See here for a discussion of my views.)

It’s still unclear how an agent learns that some source is trustworthy (just like it’s unclear how it learns that most ravens are black). And it would be great to understand how the output of a reasoning process can constitute evidence (as a deductive fact, rather than an inductive generalization or as an axiom). But these are rather different questions, which one would attack with rather different techniques.

In contrast, techniques for “licensing” the creation of successors are not promising on this perspective, unless they correspond to actual improvements in an agent’s predictions.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 9:44 PM

Paul, do you have a list of "other aspects of the AI safety problem" that you think should be prioritized higher?

I would be curious to see more thoughts on this from people who have thought more than I have about stable/reliable self-improvement/tiling. Broadly speaking, I am also somewhat skeptical that it's the best problem to be working on now. However, here are some considerations in favor:

It seems plausible to me that an AI will be doing most of the design work before it is a "human-level reasoner" in your sense. The scenario I have in mind is a self-improvement cycle by a machine specialized in CS and math, which is either better than humans at these things, or is changing too rapidly for humans to effectively help it. This would create what Bostrom has called (in private correspondence) a "competence gap", where the AI can and does self-improve, but may not solve the tiling problem or balance risk the way we would have liked it to. In this case, being able to solve this problem for it directly is helpful.

30% efficiency improvement seems quite large, even for major software changes, in machine learning. I'm not sure how much this affects your overall point.

On the value of work now vs. later, I would probably try to determine this mostly by thinking about how much this work will help us grow interest in the area among people who will wield useful skills an influence later. So far, work on the Löbian obstacle has been pretty good on this metric (if you count it as partially responsible for attracting Benja and Nate, attention from mathematicians, its importance to past workshops, Nik Weaver, etc.).

I'll very quickly remark that I think that the competence gap is indeed the main issue. If we imagine an AI built to a level where it was as smart as all the mathematicians who could work on the problem in advance, but able to do the same work faster, which didn't use any self-improvement along the way, and it was otherwise within a Friendliness framework that well-decided its preferences among what decision framework would control whatever stability framework it invented, then clearly there's no advantage to trying to do the work in advance. But I think the competence gap is much larger than that zero level.

Note that we care about the gap between {Ability to design powerful AI} and {Ability to design powerful AI that will do what the original AI wants}. I think the main difference is that you see the second one as a super-hard problem. I don't see it as a super-hard problem, especially if we have already successfully built one AI that does what we want. I tried to flesh out this disagreement in the post.

I do see a gap as plausible, since I expect capabilities to be uneven and who knows what will come first.

But it would be surprising if an AI was good at figuring out what other AI's would be effective, but wasn't able to understand that itself was effective--since presumably these other AI's would be quite similar to itself, and would be leveraging the same insights. The concern seems to be the case where the AI understands why it is able to do so much cool stuff, but is not able to understand why it is motivated to do the right cool stuff (and can't figure it out, despite the motivation to do so and the availability of human explainers who do understand).

To me this scenario seems unlikely. I assume you have a different picture than I do.

I think the main disagreement is about whether it's possible to get an initial system which is powerful in the ways needed for your proposal and which is knowably aligned with our goals; some more about this in my reply to your post, which I've finally posted, though there I mostly discuss my own position rather than Eliezer's.

I enjoyed this and found it to be a surprising deconstruction of the goal of provably safe self-modification.

I think there is also a more general thrust toward reflectively consistent AI architectures, which has been quite fruitful in highlighting open problems. This could be justified in terms of self-modification (and probably has been in most cases), but also might stand on its own as a reasonable desideratum.

I'm not fully convinced on the "standards of reasoning can't be outsourced" point.

As things stand, I don't think there is a plausible story for how an AI which started out having uncertainty over theories in 1st-order logic (as has been discussed fairly extensively) could later come to conceive of the standard model for the natural numbers, and other such concepts which lack a finite or R.E. axiomatization in 1st-order logic (or in any effective logic). This is just Skolem's paradox.

The best story which I can surmise is that the axioms of set theory may be accepted on pragmatic grounds (they allow convenient description of many useful entities). This would then allow the existence and uniqueness of the standard model to be proved relative to those axioms.

Actually, this isn't so bad; I think I habitually give this explanation too little credit.

I'm concerned, though; my feeling is that there should be something more resolving Skolem's paradox (a difference in how we perform probabilistic reasoning for 2nd-order entities as opposed to 1st-order). If there is something more, it seems possible that an AI would miss it (view it as human irrationality).