Part of the “Intro to brain-like-AGI safety” post series.
(If you’re already an AGI safety expert, you can probably skip this short post—I don’t think anything here is new, or too specific to brain-like AGIs.)
In the previous post, I talked about “the alignment problem” for brain-like AGIs. Two points are worth emphasizing: (1) the alignment problem for brain-like AGIs is currently unsolved (just like the alignment problem for any other type of AGI), and (2) solving it would be a giant leap towards AGI safety.
That said, “solving AGI alignment” is not exactly the same as “solving AGI safety”. This post is about how the two may come apart, at least in principle.
As a reminder, here’s the terminology:
Thus, these are two different things. And my goal in this post is to describe how they may come apart:
To skip to the final answer, my takeaway is that, although it is not technically correct to say “AGI alignment is necessary and sufficient for AGI safety”, it’s damn close to correct, at least in the brain-like AGIs we’re talking about in this series.
This is the case where an AGI is aligned (i.e., trying to do things that its designers had intended for it to try to do), but still causes catastrophic accidents. How?
One example: maybe, as designers, we didn’t think carefully about what we had intended for the AGI to do. John Wentworth gives a hypothetical example here: humans ask the AGI for a nuclear fusion power plant design, but they neglect to ask the follow-up question of whether the same design makes it much easier to make nuclear weapons.
Another example: maybe the AGI is trying to do what we had intended for it to try to do, but it screws up. For example, maybe we ask the AGI to build a new better successor AGI, that is still well-behaved and aligned. But the AGI messes up. It makes a successor AGI with the wrong motivations, and the successor gets out of control and kills everyone.
I don’t have much to say in general about alignment-without-safety. But I guess I’m modestly optimistic that, if we solve the alignment problem, then we can muddle our way through to safety. After all, if we solve the alignment problem, then we’ll be able to build AGIs that are sincerely trying to help us, and the first thing we can use them for is to ask them for help clarifying exactly what they should be doing and how, thus hopefully avoiding failure modes like those above.
That said, I could be wrong, and I’m certainly happy for people to keep thinking hard about the non-alignment aspects of safety.
Conversely, there are various ideas of how to make an AGI safe without needing it to make it aligned. They all seem hard or impossible to me. But hey, perfect alignment seems hard or impossible too. I’m in favor of keeping an open mind, and using multiple layers of protection. I’ll go through some possibilities here (this is not a comprehensive list):
The idea here is to put an AGI in a box, with no internet access, no actuators, etc. We can unplug the AGI whenever we want. Even if the AGI has dangerous motivations, who cares? What harm could it possibly do? Oh, umm, it could send out radio signals with RAM. So we also need a Faraday cage. Hopefully there’s nothing else we forgot!
Actually, I am quite optimistic that people could make a leakproof AGI box if they really tried. I love bringing up Appendix C of Cohen, Vellambi, Hutter (2020), which has an awesome box design, complete with air-tight seals and Faraday cages and laser interlocks and so on. Someone should totally build that. When we’re not using it for AGI experiments, we can loan it to movie studios as a prison for supervillains.
A different way to make a leakproof AGI box is using homomorphic encryption. This has the advantage of being provably leakproof (I think), but the disadvantage of dramatically increasing the amount of compute required to run the AGI algorithm.
What’s the problem with boxing? Well, we made the AGI for a reason. We want to use it to do things.
For example, something like the following could be perfectly safe:
Yes, that would be safe! But not useful! Nobody is going to spend gazillions of dollars to do that.
Instead, for example, maybe we’ll have a human interact with the AGI through a text terminal, asking questions, making requests, etc. The AGI may print out blueprints, and if they look good, we’ll follow them. Oops. Now our box has a giant gaping security hole—namely, us! (See the AI-box experiment.)
So I don’t see any path from “boxing” to “solving the AGI safety problem”.
That said “won’t solve the AGI safety problem” is different from “literally won’t help at all, not even a little bit on the margin”. I do think boxing can help on the margin. In fact, I think it’s a terrible idea to put an AGI on an insecure OS that also has an unfiltered internet connection—especially early in training, when the AGI’s motivations are still in flux. I for one am hoping for a gradual culture shift in the machine learning community, such that eventually “Let’s train this new powerful model on an air-gapped server, just in case” is an obviously reasonable thing to say and do. We’re not there yet. Someday!
In fact, I would go further. We know that a learning-from-scratch AGI will have some period of time when its motivations and goals are unpredictable and possibly dangerous. Unless someone thinks of a bootstrapping approach, we’re going to need a secure sandbox in which the infant-AGI can thrash about without causing any real damage, until such time as our motivation-sculpting systems have made it corrigible. There would be a race between how fast we can refine the AGI’s motivations, versus how quickly the AGI can escape the sandbox—see previous post (Section 10.5.4.2). Thus, making harder-to-escape sandboxes (that are also user-friendly and full of great features, such that future AGI developers will actually choose to use them rather than less-secure alternatives) seems like a useful thing to do, and I endorse efforts to accelerate progress in this area.
But regardless of that progress, we would still need to solve the alignment problem.
Let’s say we fail to solve the alignment problem, so we’re not sure about the AGI’s plans and intentions, and we’re concerned about the possibility that the AGI may be trying to trick or manipulate us.
One way to tackle this problem is to ensure that the AGI has no idea that we humans exist and are running it on a computer. Then it won’t try to trick us, right?
As one example along those lines, we can make a “mathematician AGI” that knows about the universe of math, but knows nothing whatsoever about the real world. See Thoughts on Human Models for more along these lines.
I see two problems:
By the way, another idea in this vicinity is putting the AGI in a virtual sandbox environment, and not telling it that it’s in a virtual sandbox environment (further discussion). This seems to me to have both of the same two problems as above, or at least one of them, depending on the detailed setup. Interestingly, some humans spend inordinate amounts of time pondering whether they themselves are running in a virtual sandbox environment, in the absence of any direct evidence whatsoever. Surely a bad sign! That said, doing tests of an AGI in a virtual sandbox is still almost definitely a good idea, as mentioned in the previous section. It doesn’t solve the whole AGI safety problem, but we still ought to do it.
We humans have an intuitive notion of the “impact” of a course of action. For example, removing all the oxygen from the atmosphere is a “high-impact action”, whereas making a cucumber sandwich is a “low-impact action”.
There’s a hope that, even if we can’t really control an AGI’s motivations, maybe we can somehow restrict the AGI to “low-impact actions”, and thus avoid catastrophe.
Defining “low impact” winds up being quite tricky. See Alex Turner’s work for one approach. Rohin Shah suggests that there are three desiderata that seem to be mutually incompatible: “objectivity (no dependence on [human] values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things)”. If that’s right, then clearly we need to throw out objectivity. One place we may wind up is something like AGIs that try to follow human norms, for example.
From my perspective, I find these ideas intriguing, but the only way I can see them working in a brain-like AGI is to implement them via the motivation system. I imagine that the AGI would follow human norms because it wants to follow human norms. So this topic is absolutely worth keeping in mind, but for my purposes, it’s not a separate topic from alignment, but rather an idea about what motivation we should be trying to put into our aligned AGIs.
There’s an appealing intuition, dating back at least to this 2012 post by Holden Karnofsky, that maybe there’s an easy solution: just make AIs that aren’t “trying” to do anything in particular, but instead are more like “tools” that we humans can use.
While Holden himself changed his mind and is now a leading advocate of AGI safety research, the idea of non-agentic AI lives on. Prominent advocates of this approach include Eric Drexler (see his “Comprehensive AI Services”, 2019), and people who think that large language models (e.g. GPT-3) are on the path to AGI (well, not all of those people, it’s complicated).
As discussed in this reply to the 2012 post, we shouldn’t take for granted that “tool AI” would make all safety problems magically disappear. Still, I suspect that tool AI would help with safety for various reasons.
I’m skeptical of “tool AI” for a quite different reason: I don’t think such systems will be powerful enough. Just like the “mathematician AGI” in Section 11.3.2 above, I think a tool AI would be a neat toy, but it wouldn’t help solve the big problem—namely, that the clock is ticking until some other research group comes along and makes an agentic AGI. See my discussion here for why I think that agentic AGIs will be able to come up with creative new ideas and inventions in a way that non-agentic AGIs can’t.
But also, this is a series on brain-like AGI. Brain-like AGI (as I’m using the term) is definitely agentic. So non-agentic AI is off-topic for this series, even if it were a viable option.
Thus, I consider safety and alignment to be quite close, and that’s why I’ve been talking about AGI motivation and goals so frequently throughout this series.
The next three posts will talk about possible paths to alignment. Then I’ll close out the series with my wish-list of open problems, and how to get involved.
As described in a footnote of the previous post, be warned that not everyone defines “alignment” exactly as I’m doing here.
By this definition of “safety”, if an evil person wants to kill everyone, and uses AGI to do so, that still counts as successful “AGI safety”. I admit that this sounds rather odd, but I believe it follows standard usage from other fields: for example, “nuclear weapons safety” is a thing people talk about, and this thing notably does NOT include the deliberate, authorized launch of nuclear weapons, despite the fact that the latter would not be “safe” for anyone, not by any stretch of the imagination. Anyway, this is purely a question of definitions and terminology. The problem of people deliberately using AGI towards dangerous ends is a real problem, and I am by no means unconcerned about it. I’m just not talking about in this particular series. See Post #1, Section 1.2.
A more problematic case would be if we can align our AGIs such that they’re trying to do a certain thing we want, but only for some things, and not others. Maybe it turns out that we know how to make AGIs that are trying to solve a certain technological problem without destroying the world, but we don’t know how to make AGIs that are trying to help us reason about the future and about our own values. If that happened, my proposal of “ask the AGIs for help clarifying exactly what those AGIs should be doing and how” wouldn’t work.
For example, can we initialize the AGI’s world-model from a pre-existing human-legible world model like Cyc, instead of from scratch? I dunno.
At first glance, I think there’s a plausible case that language models like GPT-3 are more “tools” than “agents”—that they’re not really “trying” to do anything in particular, in a way that’s analogous to how RL agents are “trying” to do things. (Note that GPT-3 is trained by self-supervised learning, not RL.) At second glance, it’s more complicated. For one thing, if GPT-3 is currently calculating what Person X will say next, does GPT-3 thereby temporarily “inherit” the “agency” of Person X? Could simulated-Person-X figure out that they are being simulated in GPT-3, and hatch a plot to break out?? Beats me. For another thing, even if RL is in fact a prerequisite to “agency” / “trying”, there are already lots of researchers hard at work stitching together language models with RL algorithms.
Anyway, my claim in Section 11.3.4 is that there’s no overlap between (A) “systems that are sufficiently powerful to solve ‘the big problem’” and (B) “systems that are better thought of as tools rather than agents”. Whether language models are (or will be) in category (A) is an interesting question, but orthogonal to this claim, and I don’t plan to talk about it in this series.