NB: I doubt any of this is very original. In fact, it's probably right there in the original Friendly AI writings and I've just forgotten where. Nonetheless, I think this is something worth exploring lest we lose sight of it.
Consider the following argument:
This means that, if you buy this argument, huge swaths of AI design space is off limits for building aligned AI, and means many proposals are, by this argument, doomed to fail. Some examples of such doomed approaches:
So what options are left?
Not building AI is probably not a realistic option unless industrial civilization collapses. And so far we don't seem to be making progress on creating Friendly AI. That just leaves bootstrapping to alignment.
If I'm honest, I don't like it. I'd much rather have the guarantee of Friendly AI. Alas, if we don't know how to build it, and if we're in a race against folks who will build unaligned superintelligent AI if aligned AI is not created first, bootstrapping seems the only realistic option we have.
This puts me in a strange place with regards to how I think about things like HCH, debate, IRL, and CIRL. On the one hand, they might be ways to bootstrap to something that's aligned enough to use to build Friendly AI. On the other, they might overshoot in terms of capabilities, we probably wouldn't even realize we overshot, and then we suffer an existential catastrophe.
One way we might avoid this is by being more careful about how we frame attempts to build aligned AI and being clear if they are targeting "strong", perfect alignment like Friendly AI or "weak", optimization-based alignment like HCH. I think this would help us avoid confusion in a few places:
It also seems like it would clear up some of the debates we fall into around various alignment techniques. Plenty of digital ink has been spilled trying to suss out if, say, debate would really give us alignment or if it's too dangerous to even attempt, and I think a lot of this could have been avoided if we thought of debate as a weak alignment techniques we might use to bootstrap strong alignment.
Hopefully this framing is useful. As I say, I don't think it's very original, and I think I've read a lot of this framing expressed in comments and buried in articles and posts, so hopefully it's boring rather than controversial. Despite this, I can't recall it being crisply laid out like above, and I think there's value in that.
Let me know what you think.
Reminds me of a quote from this Paul Christiano post: "It's a solution built to last (at most) until all contemporary thinking about AI has been thoroughly obsoleted...I don’t think there is a strong case for thinking much further ahead than that."
as a weak alignment techniques we might use to bootstrap strong alignment.
Yes, it also reminded me Christiano approach of amplification and distillation.
Thanks both! I definitely had the idea that Paul had mentioned something similar somewhere but hadn't made it a top-level concept. I think there's similar echos in how Eliezer talked about seed AI in the early Friendly AI work.
Planned summary for the Alignment Newsletter:
This post distinguishes between three kinds of “alignment”:1. Not building an AI system at all,2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,3. _Bootstrapped alignment_, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.
This post distinguishes between three kinds of “alignment”:1. Not building an AI system at all,2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,3. _Bootstrapped alignment_, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.
The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.
Looks good to me! Thanks for planning to include this in the AN!
I'm still holding out hope for jumping straight to FAI :P Honestly I'd probably feel safer switching on a "big human" than a general CIRL agent that models humans as Boltzmann-rational.
Though on the other hand, does modern ML research already count as trying to use UFAI to learn how to build FAI?
Seems like it probably does, but only incidentally.
I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.
I do think there's something interesting about a direction not considered in this post related to intelligence enhancement of humans and human emulations (ems) as a means to working on alignment, but I think realistically current projections of AI capability timelines suggest they're unlikely to have much opportunity for impact.