I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?”
This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.1
Below I’ll briefly recap my overall picture of what success might look like (with links to other things I’ve written on this), then discuss four key categories of interventions: alignment research, standards and monitoring, successful-but-careful AI projects, and information security. For each, I’ll lay out:
Overall, I feel like there is a pretty solid playbook of helpful interventions - any and all of which can improve our odds of success - and that working on those interventions is about as much of a “plan” as we need for the time being.
The content in this post isn’t novel, but I don’t think it’s already-consensus: two of the four interventions (standards and monitoring; information security) seem to get little emphasis from existential risk reduction communities today, and one (successful-but-careful AI projects) is highly controversial and seems often (by this audience) assumed to be net negative.
Many people think most of the above interventions are doomed, irrelevant or sure to be net harmful, and/or that our baseline odds of avoiding a catastrophe are so low that we need something more like a “plan” to have any hope. I have some sense of the arguments for why this is, but in most cases not a great sense (at least, I can’t see where many folks’ level of confidence is coming from). A lot of my goal in posting this piece is to give such people a chance to see where I’m making steps that they disagree with, and to push back pointedly on my views, which could change my picture and my decisions.
As with many of my posts, I don’t claim personal credit for any new ground here. I’m leaning heavily on conversations with others, especially Paul Christiano and Carl Shulman.
I’ve written a number of nearcast-based stories of what it might look like to suffer or avoid an AI catastrophe. I’ve written a hypothetical “failure story” (How we might stumble into AI catastrophe); two “success stories” that assume good decision-making by key actors; and an outline of how we might succeed with “minimal dignity.”
The essence of my picture has two phases:
Here I’ll discuss the potential impact of both small and huge progress on each of 4 major categories of interventions.
For more detail on interventions, see Jobs that can help with the most important century;
What AI companies can do today to help with the most important century; and How major governments can help with the most important century.
How a small improvement from the status quo could nontrivially improve our odds. I think there are various ways we could “get lucky” such that aligning at least the first human-level-ish AIs is relatively easy, and such that relatively small amounts of progress make the crucial difference.
How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. The big win here would be some alignment (or perhaps threat assessment) technique that is both scalable (works even for systems with far-beyond-human capabilities) and cheap (can be used by a given AI lab without having to pay a large “alignment tax”). This seems pretty unlikely to be imminent, but not impossible, and it could lead to a world where aligned AIs heavily outnumber misaligned AIs (a key hope).
Concerns and reservations. Quoting from a previous piece, three key reasons people give for expecting alignment to be very hard are:
AI systems could quickly become very powerful relative to their supervisors, which means we have to confront a harder version of the alignment problem without first having human-level-ish aligned systems.
I think it’s certainly plausible this could happen, but I haven’t seen a reason to put it at >50%.
To be clear, I expect an explosive “takeoff” by historical standards. I want to give Tom Davidson’s analysis more attention, but it implies that there could be mere months between human-level-ish AI and far more capable AI (but that could be enough for a lot of work by human-level-ish AI).
One key question: to the extent that we can create a feedback loop with AI systems doing research to improve hardware and/or software efficiency (which then increases the size and/or capability of the “automated workforce,” enabling further research ...), will this mostly be via increasing the number of AIs or by increasing per-AI capabilities? There could be a feedback loop with human-level-ish AI systems exploding in number, which seems to present fewer (though still significant) alignment challenges than a feedback loop with AI systems exploding past human capability.11
It’s arguably very hard to get even human-level-ish capabilities without ambitious misaligned aims. I discussed this topic at some length with Nate Soares - notes here. I disagree with this as a default (though, again, it’s plausible) for reasons given at that link.
Expecting “offense-defense” asymmetries (as in this post) such that we’d get catastrophe even if aligned AIs greatly outnumber misaligned ones. Again, this seems plausible, but not the right default guess for how things will go, as discussed at the end of the previous section.
How a small improvement from the status quo could nontrivially improve our odds. Imagine that:
I think this kind of situation would bring major benefits to the status quo, if only via incentives for top AI labs to move more carefully and invest more energy in alignment. Even a squishy, gameable standard, accompanied by mostly-theoretical possibilities of future regulation and media attention, could add to the risks (bad PR, employee dissatisfaction, etc.) and general pain of scaling up and releasing models that can’t be shown to be safe.
This could make it more attractive for companies to do their best with less capable models while making serious investments in alignment work (including putting more of the “results-oriented leadership effort” into safety - e.g., “We really need to make better alignment progress, where are we on that?” as opposed to “We have a big safety team, what more do you want?”) And it could create a big financial “prize” for anyone (including outside of AI companies) who comes up with an attractive approach to alignment.
How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. A big potential win is something like:
Concerns and reservations. A common class of concerns is along the lines of, “Any plausible standards would be squishy/gameable”; I think this is significantly true, but squishy/gameable regulations can still affect behavior a lot.9
Another concern: standards could end up with a dynamic like “Slowing down relatively cautious, high-integrity and/or law-abiding players, allowing less cautious players to overtake them.” I do think this is a serious risk, but I also think we could easily end up in a world where the “less cautious” players have trouble getting top talent and customers, which does some combination of slowing them down and getting them to adopt standards of their own (perhaps weaker ones, but which still affect their speed and incentives). And I think the hope of affecting regulation is significant here.
I think there’s a pretty common misconception that standards are hopeless internationally because international cooperation (especially via treaty) is so hard. But there is precedent for the US enforcing various things on other countries via soft power, threats, cyberwarfare, etc. without treaties or permission, and in a high-stakes scenario, it could do quite a lot of this..
Conflict of interest disclosure: my wife is co-founder and President of Anthropic and owns significant equity in both Anthropic and OpenAI. This may affect my views, though I don't think it is safe to assume specific things about my takes on specific AI labs due to this.10
How a small improvement from the status quo could nontrivially improve our odds. If we just imagine an AI lab that is even moderately competitive on capabilities while being substantially more concerned about alignment than its peers, such a lab could:
How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. If an AI lab ends up with a several-month “lead” on everyone else, this could enable huge amounts of automated alignment research, threat assessment (which could create very strong demonstrations of risk in the event that automated alignment research isn’t feasible), and other useful tasks with initial human-level-ish systems.
Concerns and reservations. This is a tough one. AI labs can do ~unlimited amounts of harm, and it currently seems hard to get a reliable signal from a given lab’s leadership that it won’t. (Up until AI systems are actually existentially dangerous, there’s ~always an argument along the lines of “We need to move as fast as possible and prioritize fundraising success today, to stay relevant so we can do good later.”) If you’re helping an AI lab “stay in the race,” you had better have done a good job deciding how much you trust leadership, and I don’t see any failsafe way to do that.
That said, it doesn’t seem impossible to me to get this right-ish (e.g., I think today’s conventional wisdom about which major AI labs are “good actors” on a relative basis is neither uninformative (in the sense of rating all labs about the same) nor wildly off), and if you can, it seems like there is a lot of good that can be done by an AI lab.
I’m aware that many people think something like “Working at an AI lab = speeding up the development of transformative AI = definitely bad, regardless of potential benefits,” but I’ve never seen this take spelled out in what seems like a convincing way, especially since it’s pretty easy for a lab’s marginal impact on speeding up timelines to be small (see above).
I do recognize a sense in which helping an AI lab move forward with AI development amounts to “being part of the problem”: a world in which lots of people are taking this action seems worse than a world in which few-to-none are. But the latter seems off the table, not because of Molochian dynamics or other game-theoretic challenges, but because most of the people working to push forward AI simply don’t believe in and/or care about existential risk ~at all (and so their actions don’t seem responsive in any sense, including acausally, to how x-risk-concerned folks weigh the tradeoffs). As such, I think “I can’t slow down AI that much by staying out of this, and getting into it seems helpful on balance” is a prima facie plausible argument that has to be weighed on the merits of the case rather than dismissed with “That’s being part of the problem.”
I think helping out AI labs is the trickiest and highest-downside intervention on my list, but it seems quite plausibly quite good in many cases.
How a small improvement from the status quo could nontrivially improve our odds. It seems to me that the status quo in security is rough (more), and I think a small handful of highly effective security people could have a very large marginal impact. In particular, it seems like it is likely feasible to make it at least difficult and unreliable for a state actor to steal a fully-developed powerful AI system.
How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. I think this doesn’t apply so much here, except for a potential somewhat far-fetched case in which someone develops (perhaps with assistance from early powerful-but-not-strongly-superhuman AIs) a surprisingly secure environment that can contain even misaligned AIs significantly (though probably not unboundedly) more capable than humans.
Concerns and reservations. My impression is that most people who aren’t excited about security think one of these things:
I disagree with #2 for reasons given here (I may write more on this topic in the future).
I disagree with #1 as well.
After drafting this post, I was told that others had been making this same distinction and using this same term in private documents. I make no claim to having come up with it myself! ↩
Phase 1 in this analysis ↩
Phase 2 in this analysis ↩
I think there are ways things could go well without any particular identifiable “pivotal act”; see the “success stories” I linked for more on this. ↩
“Safe actors” corresponds to “cautious actors” in this post. I’m using a different term here because I want to include the possibility that actors are safe mostly due to luck (slash cheapness of alignment) rather than caution per se. ↩
The latter, more dangerous possibility seems more likely to me, but it seems quite hard to say. (There could also of course be a hybrid situation, as the number and capabilities of AI grow.) ↩
In the judgment of an auditor, and/or an internal evaluation that is stress-tested by an auditor, or simply an internal evaluation backed by the risk that inaccurate results will result in whistleblowing. ↩
I.e, given access to its own weights, it could plausibly create thousands of copies of itself with tens of millions of dollars at their disposal, and make itself robust to an attempt by a few private companies to shut it down. ↩
A comment from Carl Shulman on this point that seems reasonable: "A key difference here seems to be extremely rapid growth, where year on year effective compute grows 4x or more. So a defector with 1/16th the resources can produce the same amount of danger in 1-2 years, sooner if closer to advanced AGI and growth has accelerated. The anti-nuclear and anti-GMO movements cut adoption of those technologies by more than half, but you didn't see countries with GMO crops producing all the world's food after a few years, or France making so much nuclear power that all electricity-intensive industries moved there.
For regulatory purposes you want to know if the regulation can block an AI capabilities explosion. Otherwise you're buying time for a better solution like intent alignment of advanced AI, and not very much time. That time is worthwhile, because you can perhaps get alignment or AI mind-reading to work in an extra 3 or 6 or 12 months. But the difference with conventional regulation interfering with tech is that the regulation is offsetting exponential growth; exponential regulatory decay only buys linear delay to find longer-term solutions.
There is a good case that extra months matter, but it's a very different case from GMO or nuclear power. [And it would be far more to the credit of our civilization if we could do anything sensible at scale before the last few months or years.]" ↩
We would still be married even if I disagreed sharply with Anthropic’s strategy. In general, I rarely share my views on specific AI labs in public. ↩
One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalignment risk, such as (next generation) alignment research and coordination mechanisms between humans, human-AI teams, or AIs aligned to different humans.
I think this suggests the intervention of doing research aimed at improving the philosophical abilities of the AIs that we'll build. (Aside from misalignment risk, it would help with many other AI-related x-risks that I won't go into here, but which collectively outweigh misalignment risk in my mind.)
I agree that this is a major concern. I touched on some related issues in this piece.
This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).
I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.