On unfixably unsafe AGI architectures

Steven Byrnes

There's loads of discussion on ways that things can go wrong as we enter the post-AGI world. I think an especially important one for guiding current research is:

Maybe we'll know how to build unfixably unsafe AGI, but can't coordinate not to do so.

As a special case, I will suggest that we might have a x-risk-level accident as the culmination of a series of larger and larger accidents.

(This is an extreme case of what John Maxwell (following Nate Soares) calls an alignment roadblock.)

I'm sure this has been discussed before, but it sometimes seems to slip through the cracks in recent discussions, where instead I sometimes an implicit assumption that x-risk-level catastrophic accidents will not happen if we have ample warning in the form of minor accidents—and thus (this theory goes) we should think only about (1) fast takeoff, (2) deceptive systems (such as Paul Christiano's "influence-seekers") that pretend to be beneficial until it's too late to stop them, (3) researchers being reckless due to race dynamics, and (4) other problems that are not "accidents" per se. But even if we avoid all those problems, and thus get ample experience in the form of minor accidents, I don't think that's necessarily enough.

1. Is there such a thing as an "unfixably unsafe AGI"?

By "unfixable", I mean that to solve the problem, we need to massively backtrack and take a different path to AGI (see Appendix) ... or that a safer AGI architecture simply doesn't exist.

By "unsafe", I mean ... well, I'm not really sure what this term should mean. Is it "less unsafe than the non-AGI status quo humanity on fast-forward" (a low bar!), or "the most safe that's technologically possible" (an almost impossibly high bar!), or some absolute metric like "<X% chance of extinction" for some X? It's your choice, readers! As your safety standards get lower, the existence of "unfixably unsafe AGI" becomes less likely, but a bigger problem if it does happen.

To keep things concrete, let's have in mind an Example failure mode: Goal instability under learning and reflection: The AGI will have an internal concept of (for example) "doing what the human overseer would want", and this concept will develop and churn as the agent develops better understanding of people and the world. (See "Ontological crisis".) At some point—according to this failure mode—the internal goals / constraints / etc. may fall out of alignment with the safe, benevolent, corrigible behavior we want.

Is this failure mode plausible? If so, would it really be "unfixable" (within a certain approach to AGI)? Well, I don't know! Maybe, maybe not. As far as I know, it can't be ruled out.

Also, without directly solving the problem, there are plenty of possible indirect solutions—boxing, supervisory systems, transparency, etc. etc. But we don't know that any of them will work reliably, and it's possible that they will work only by limiting the system's capability, and then there's still a coordination problem (we can change the topic to "we know how to build this unboxed AGI, and can't coordinate not to do so").

(Again, let's keep this in mind as a running example—but note that there are other possible examples too.)

2. Is it possible that we will know how to build this unfixably unsafe AGI, but can't coordinate not to do so?

I think this is especially plausible if:

2A: There's very little work to do to run this AGI, e.g. there is well-documented open-source code that runs on commodity hardware.

I think this would eventually become true with very high probability (by default); thus a key goal would be to discover the problem as early as possible, when there are still many person-years of R&D left to do.

2B: The arguments that the AGI is unfixably unsafe are complex and uncertain (or, even worse, we don't have such arguments).

In our running example, it is probably impossible to think on the object level about every possible way in which an intelligence might re-conceptualize "doing what the overseer wants me to do" as it continuously learns and reflects. And maybe meta-level "reasoning about reasoning" can't conclude anything useful.

We can hope that, in the course of learning how to build an AGI, we will get insight into the "goal stability upon learning & reflection" problem, but this does not seem guaranteed by any means—for example, humans do not have goal stability, and if we reverse-engineer human brain algorithms then they won't magically start having goal stability, and as I've learned more nuts-and-bolts details about how human brain algorithms work in the past year, I don't feel like it's helped me all that much to better understand this problem, or to find and verify solutions.

2C: Relatedly, given a proposed approach to solve the problem, there is no easy, low-risk way to see whether it works.

In our running example, proposed solutions may have the problem that they just delay the problem instead of solving it—maybe the AGI still has a goal instability problem, but it hasn't learned enough and reflected enough for it to manifest yet.

2D: A safer AGI architecture doesn't exist, or requires many years of development and many new insights.

Here, an important consideration is how early the development paths diverged between our unfixably unsafe AGI and the safer alternative. Can we keep most of the code and make a small change, or do we have to go back and develop a fundamentally different type of AGI from scratch? See Appendix for more on this.

Summary: A possible story of coordination failure

If most or all these things are true, the coordination problem seems hopelessly unsolvable to me. Countless actors around the world would be well aware of the transformative potential of the technology, and able to have a go. Not everyone is risk-averse—imagine people saying "This is the only way we can stop climate change and save the planet, we have to try!!" Many will have superficially plausible ideas about how to solve the safety problem, and critics won't have air-tight, legible arguments that the ideas will not work. Even as a series of worse and worse AGI accidents occur, wih out-of-control AGIs self-replicating around the internet etc., a few people will keep trying to fix the unfixable AGI, seeing this as the only path to get this slow-rolling catastrophe under control (while actually making it worse). Even hypothetical ideal rational altruists might have a go with a design they know is a long-shot, if they believe that others will keep trying with even less plausible ideas.

Even if there is an international treaty, it would seem to be utterly unenforceable, especially given the existence of secret government labs, leakers / cyber-espionage, and grillions of GPUs, CPUs, and FPGAs off the grid around the world. I think this is true today and will continue to be true for the foreseeable future.^[1]

So, if we have an unfixably unsafe AGI scenario in which the factors 2A-2D are all unfavorable, it just seems utterly hopeless to me. (If anyone has ideas, I'm very interested to hear them!) Instead, I would say the priority is to do technical safety work well in advance, to not get stuck in that kind of situation. I'm very interested in other people's thoughts on this.

Appendix: My list of early-branching paths to AGI

I find that there are a number of grand visions for what AGI will look like and how we'll get there, and these involve years or decades of substantially non-overlapping R&D. (Of course some of these have some overlap.) This is why I think AGI safety work is urgent, even if AGI were centuries away—because it will inform us about which of these paths is more or less promising. Then we can build the AGI that's best, and not just wait and see which R&D program happens to reach the finish-line first.

So here's my little list. I doubt all of them are technically feasible R&D paths that yield very different AGIs at the end, but I'm pretty sure some of them are.

Massively improved brain-computer interfaces (Elon Musk, Ray Kurzweil)
Whole-brain emulation
Make a non-agential world-model-building AGI and probe it using interpretability tools (Chris Olah)
Debate (OpenAI)
IDA (OpenAI)
Understand and copy brain algorithms (Vicarious, Numenta). Within this category, we could copy just the intelligence part (neocortex), or we could also copy emotions etc.
Comprehensive AI Services
There is a general spectrum between how much of the AGI is conventional computer code versus ML models—after all, any specific thing that can be learned can also (given enough engineer-hours) be hand-coded.
System that talks talks to humans and helps them reason better (David Ferrucci)
Maybe whatever MIRI is doing in their undisclosed research program (involving Haskell I guess)?
In prosaic AI, models can be trained by RL, versus supervised learning versus self-supervised (predictive) learning versus recursive reward modeling, etc.

I'm sure I'm leaving stuff out. I'm curious to what extent other people see many parallel paths to AGI, as I do, versus thinking only one path is really plausible, or that the paths will converge at the end, or that the paths mostly overlap, or some other opinion.

I guess in principle maybe someday there could be a world government that institutes the Nick Bostrom "freedom tag", but I can't see how that would actually come to pass. ↩︎

16