Stable self-improvement as a research problem

Paul, do you have a list of "other aspects of the AI safety problem" that you think should be prioritized higher?

I would be curious to see more thoughts on this from people who have thought more than I have about stable/reliable self-improvement/tiling. Broadly speaking, I am also somewhat skeptical that it's the best problem to be working on now. However, here are some considerations in favor:

It seems plausible to me that an AI will be doing most of the design work before it is a "human-level reasoner" in your sense. The scenario I have in mind is a self-improvement cycle by a machine specialized in CS and math, which is either better than humans at these things, or is changing too rapidly for humans to effectively help it. This would create what Bostrom has called (in private correspondence) a "competence gap", where the AI can and does self-improve, but may not solve the tiling problem or balance risk the way we would have liked it to. In this case, being able to solve this problem for it directly is helpful.

30% efficiency improvement seems quite large, even for major software changes, in machine learning. I'm not sure how much this affects your overall point.

On the value of work now vs. later, I would probably try to determine this mostly by thinking about how much this work will help us grow interest in the area among people who will wield useful skills an influence later. So far, work on the Löbian obstacle has been pretty good on this metric (if you count it as partially responsible for attracting Benja and Nate, attention from mathematicians, its importance to past workshops, Nik Weaver, etc.).

[-]Eliezer Yudkowsky11y00

I'll very quickly remark that I think that the competence gap is indeed the main issue. If we imagine an AI built to a level where it was as smart as all the mathematicians who could work on the problem in advance, but able to do the same work faster, which didn't use any self-improvement along the way, and it was otherwise within a Friendliness framework that well-decided its preferences among what decision framework would control whatever stability framework it invented, then clearly there's no advantage to trying to do the work in advance. But I think the competence gap is much larger than that zero level.

[-]paulfchristiano11y00

Note that we care about the gap between {Ability to design powerful AI} and {Ability to design powerful AI that will do what the original AI wants}. I think the main difference is that you see the second one as a super-hard problem. I don't see it as a super-hard problem, especially if we have already successfully built one AI that does what we want. I tried to flesh out this disagreement in the post.

I do see a gap as plausible, since I expect capabilities to be uneven and who knows what will come first.

But it would be surprising if an AI was good at figuring out what other AI's would be effective, but wasn't able to understand that itself was effective--since presumably these other AI's would be quite similar to itself, and would be leveraging the same insights. The concern seems to be the case where the AI understands why it is able to do so much cool stuff, but is not able to understand why it is motivated to do the right cool stuff (and can't figure it out, despite the motivation to do so and the availability of human explainers who do understand).

To me this scenario seems unlikely. I assume you have a different picture than I do.

[-]Benya_Fallenstein11y00

I think the main disagreement is about whether it's possible to get an initial system which is powerful in the ways needed for your proposal and which is knowably aligned with our goals; some more about this in my reply to your post, which I've finally posted, though there I mostly discuss my own position rather than Eliezer's.

[-]abramdemski11y00

I enjoyed this and found it to be a surprising deconstruction of the goal of provably safe self-modification.

I think there is also a more general thrust toward reflectively consistent AI architectures, which has been quite fruitful in highlighting open problems. This could be justified in terms of self-modification (and probably has been in most cases), but also might stand on its own as a reasonable desideratum.

I'm not fully convinced on the "standards of reasoning can't be outsourced" point.

As things stand, I don't think there is a plausible story for how an AI which started out having uncertainty over theories in 1st-order logic (as has been discussed fairly extensively) could later come to conceive of the standard model for the natural numbers, and other such concepts which lack a finite or R.E. axiomatization in 1st-order logic (or in any effective logic). This is just Skolem's paradox.

The best story which I can surmise is that the axioms of set theory may be accepted on pragmatic grounds (they allow convenient description of many useful entities). This would then allow the existence and uniqueness of the standard model to be proved relative to those axioms.

Actually, this isn't so bad; I think I habitually give this explanation too little credit.

I'm concerned, though; my feeling is that there should be something more resolving Skolem's paradox (a difference in how we perform probabilistic reasoning for 2nd-order entities as opposed to 1st-order). If there is something more, it seems possible that an AI would miss it (view it as human irrationality).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

5

Stable self-improvement as a research problem

5