NB: I doubt any of this is very original. In fact, it's probably right there in the original Friendly AI writings and I've just forgotten where. Nonetheless, I think this is something worth exploring lest we lose sight of it.

Consider the following argument:

  1. Optimization unavoidably leads to Goodharting (as I like to say, Goodhart is robust)
    • This happens so long as we optimize (make choices) based on an observation, which we must do because that's just how the physics work.
    • We can at best make Goodhart effects happen slower, say by quantilization or satisficing.
  2. Attempts to build aligned AI that rely on optimizing for alignment will eventually fail to become or remain aligned due to Goodhart effects under sufficient optimization pressure.
  3. Thus the only way to build aligned AI that doesn't fail to become and stay aligned is to not rely on optimization to achieve alignment.

This means that, if you buy this argument, huge swaths of AI design space is off limits for building aligned AI, and means many proposals are, by this argument, doomed to fail. Some examples of such doomed approaches:

  • HCH
  • debate
  • IRL/CIRL

So what options are left?

  • Don't build AI
    • The AI you don't build is vacuously aligned.
  • Friendly AI
    • AI that is aligned with humans right from the start because it was programmed to work that way.
    • (Yes I know "Friendly AI" is an antiquated term, but I don't know a better one to distinguish the idea of building AI that's aligned because it's programmed that way from other ways we might build aligned AI.)
  • Bootstrapped alignment
    • Build AI that is aligned via optimization that is not powerful enough or optimized (Goodharted) hard enough to cause existential catastrophe. Use this "weakly" aligned AI to build Friendly AI.

Not building AI is probably not a realistic option unless industrial civilization collapses. And so far we don't seem to be making progress on creating Friendly AI. That just leaves bootstrapping to alignment.

If I'm honest, I don't like it. I'd much rather have the guarantee of Friendly AI. Alas, if we don't know how to build it, and if we're in a race against folks who will build unaligned superintelligent AI if aligned AI is not created first, bootstrapping seems the only realistic option we have.

This puts me in a strange place with regards to how I think about things like HCH, debate, IRL, and CIRL. On the one hand, they might be ways to bootstrap to something that's aligned enough to use to build Friendly AI. On the other, they might overshoot in terms of capabilities, we probably wouldn't even realize we overshot, and then we suffer an existential catastrophe.

One way we might avoid this is by being more careful about how we frame attempts to build aligned AI and being clear if they are targeting "strong", perfect alignment like Friendly AI or "weak", optimization-based alignment like HCH. I think this would help us avoid confusion in a few places:

  • thinking work on weak alignment is actually work on strong alignment
  • forgetting work on weak alignment we meant to use to bootstrap to strong alignment is not itself a mechanism for strong alignment
  • thinking we're not making progress towards strong alignment because we're only making progress on weak alignment

It also seems like it would clear up some of the debates we fall into around various alignment techniques. Plenty of digital ink has been spilled trying to suss out if, say, debate would really give us alignment or if it's too dangerous to even attempt, and I think a lot of this could have been avoided if we thought of debate as a weak alignment techniques we might use to bootstrap strong alignment.

Hopefully this framing is useful. As I say, I don't think it's very original, and I think I've read a lot of this framing expressed in comments and buried in articles and posts, so hopefully it's boring rather than controversial. Despite this, I can't recall it being crisply laid out like above, and I think there's value in that.

Let me know what you think.

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 2:18 PM

Reminds me of a quote from this Paul Christiano post: "It's a solution built to last (at most) until all contemporary thinking about AI has been thoroughly obsoleted...I don’t think there is a strong case for thinking much further ahead than that."

 as a weak alignment techniques we might use to bootstrap strong alignment.

Yes, it also reminded me Christiano approach of amplification and distillation.

Thanks both! I definitely had the idea that Paul had mentioned something similar somewhere but hadn't made it a top-level concept. I think there's similar echos in how Eliezer talked about seed AI in the early Friendly AI work.

Planned summary for the Alignment Newsletter:

This post distinguishes between three kinds of “alignment”:
1. Not building an AI system at all,
2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,
3. _Bootstrapped alignment_, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.

The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.

Looks good to me! Thanks for planning to include this in the AN!

I'm still holding out hope for jumping straight to FAI :P Honestly I'd probably feel safer switching on a "big human" than a general CIRL agent that models humans as Boltzmann-rational.

Though on the other hand, does modern ML research already count as trying to use UFAI to learn how to build FAI?

Seems like it probably does, but only incidentally.

I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.

I do think there's something interesting about a direction not considered in this post related to intelligence enhancement of humans and human emulations (ems) as a means to working on alignment, but I think realistically current projections of AI capability timelines suggest they're unlikely to have much opportunity for impact.