Stephen Casper, scasper@mit.edu. Thanks to Rose Hadshar and Daniel Dewey for some discussions and feedback. 

The goal of this post is to sort through some questions involving the difficulty of avoiding perpetual risk from transformative AI. Feedback in the comments is welcome!

TL;DR

Getting AI to go well means that at some point in time, the acute period of risk posed by the onset of transformative AI must end. Ending that period will require establishing a regime for transformative AI that is exclusive, benevolent, stable, and successful at alignment. This post argues that this type of regime may be very difficult to establish and that this will largely be an AI governance problem. If so, this gives an argument for emphasizing work on AI safety challenges other than just alignment.

Intro

Often–at least in conversation–we talk about how we need “a solution” to AI safety. And maybe AI safety is a problem that has a once-and-for-all solution. If we can avoid doom, maybe we can use aligned AI to solve our key challenges, avoid all other X-risks, and set ourselves on a sustainable course for the future. This seems possible, and it would be great. Thinking in these terms stems back at least to Nick Bostrom’s Superintelligence. Bostrom discusses the possibility that in the same way that extinction is an attractor state, meeting our cosmic endowment may be as well. If we can get AI right just once, maybe that’s the last key challenge we need to solve. 

From Superintelligence:

However, achieving immortality would mean a lot of things would have to go right. Establishing a regime that avoids perpetual risk will be complex, and it will definitely involve more than just figuring out how to align AI systems with our goals. 

Four requirements for avoiding perpetual risk

Suppose that highly transformative AI (TAI) technologies are someday developed and that they are powerful enough to do catastrophically dangerous things. By definition I will refer to whatever set of institutions that control the TAI and could cause catastrophic risks with it as the TAI regime. The regime could consist of human institutions, AI institutions, or both. 

To avoid perpetual AI risk, four things need to be true about the TAI regime. And to the extent that any are false, the probability of perpetual risk will rapidly increase. Note that this only applies within some sort of cosmic sphere of influence. 

Exclusivity: There will almost certainly exist bad actors who would cause major risks if they had the ability to create and deploy the TAI without being stopped at some point in the process. (For example, the same applies with nuclear weapons.) To avoid this, the regime needs to be effectively closed to those who would cause havoc if they joined it. 

Benevolence: The TAI regime needs to be one that will try to lock in a stable and good future instead of pursuing more evil or myopic goals.

Stability: The regime needs to be stable over time and not be overthrown or degenerate into an unaligned one.  

Success at alignment: The TAI regime needs to solve outer alignment and inner alignment sufficiently well to ensure that their TAI does not itself cause major risks.  

It’s also worth adding that resilience against disasters is also really nice to have if risks become reality. 

Each of these four may be difficult 

Exclusivity:

Developing TAI may offer substantial first-mover advantages that gives the first members of the regime a lot of influence. And there is some precedent for certain sectors that are really hard for new actors to break into and be competitive in such as the search engine industry. But exclusivity still seems hard. 

  • The research and development ecosystems for AI are competitive. Due in part to arms races and copycatting, there is a lot of precedent for different groups making similar breakthroughs at the same time. For example, OpenAI’s DALLE-2, Google’s Imagen, Midjourney’s Midjourney, Meta’s Make-A-Video, and IBM’s eDiff-I were all released between April and November of this year. It may be difficult for the earliest developers of powerful TAI to outcompete and exclude others right on their tail. 
  • The mechanisms by which the members of a nascent TAI regime could exclude competitors might be hard and/or illegal. Espionage is illegal. Antitrust legislation might make it hard to simply outcompete competitors. And if a potential new regime member is in a different part of the world than the existing members, it may be especially hard to exclude them. 
  • If the regime is not unipolar, standard coordination problems and tragedies of the commons may arise involving the question of who should pay for exclusionary actions. 

Benevolence:

Extreme power might make one more inclined to be altruistic, but this seems to be a tenuous hope at best. 

  • Being primarily altruistic and trying to make the world a good, non-risky place to live in doesn’t seem to be the norm among powerful companies or people. Precedent suggests that if it is possible to make money off of something, some company will want to do it even if it is risky for the planet. The fossil fuel industry is an example of this. And even having immense amounts of wealth and power may not drive entities to benevolence. For example, this isn’t the case most of the time among the billionaires of the world who already have more than enough money to buy everything a single person could need or want. 
  • Members of the AI safety community are not a majority among AI researchers or developers. 
  • The alignment tax might be expensive, so there may be a strong disincentive to pay it, and those who are willing to do so may be uncompetitive. 

Stability:

Structures such as constitutions that create checks and balances for regimes seem good and probably necessary for a tenable TAI regime. But things could go wrong even if the TAI regime is well-structured. 

  • Corruption happens. Most governments and big companies develop it. There’s nothing like having a substantial amount of control over the world to incentivize internal power grabs.
  • Regimes can be overhauled or overthrown. Powerful competitors, uprisings, or armies might make this likely. The longest lasting continuous government in world history was the republic of Venice which lasted 1,100 years and underwent substantial evolution over this time period. 

Success at alignment:

These troubles are simple.

  • Outer alignment is hard. 
  • Inner alignment is hard. 

Hopefully though, containment, corrigibility, and off-switches can temper the risks of failures here. 

What might this mean for AI safety work?

Consider a taxonomy of AI safety strategies that groups them into three types.

Strategy 1: Making it easier to make safer AI. The whole field of AI alignment is the key example of this. But this also includes governance strategies that promote safer work or establish healthy norms in the research and development ecosystems.

Strategy 2: Making it harder to make less safe AI. Examples include establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply chains to slow things down, and avoiding arms races.

Strategy 3: Responding to problems as they arise. There might be a decent amount of time to act between the creation of an X-risky AI and extinction from it. This is especially true if the extinction happens via a cascade of globally-destabilizing events. It would probably be hard for TAI systems to gain influence over the world for some of the same reasons it’s hard for people/companies/countries to do the same. The world seems awfully big and able to adapt/fight-back for the path to extinction to be super short. Given this, some strategies to make us more resilient might be very useful including giving governments powers to rapidly detect and respond to firms doing risky things with TAI, hitting killswitches involving global finance or the internet, cybersecurity, and generally being more resilient to catastrophes as a global community. 

Note that we could also consider a fourth category: meta work that aims to build good paradigms and healthy institutions. But this isn’t direct work, so I'll only mention it here on the side.

Non-alignment aspects of AI safety are key!

Strategy 1 only addresses ensuring that the TAI regime is successful at alignment. Strategy 2 is key for exclusivity and benevolence. And Strategy 3 is useful for stability and resilience against disasters if we end up in a regime of significant risk.

I think this is important to bear in mind because the most common and most interesting types of AI safety work that are the easiest to nerd-snipe researchers with seem to fall into Strategy 1. But Strategies 2 and 3 may be at least as important to work on if not more. As someone who works on problems in Strategy 1, I am currently thinking about is whether I should work more toward 2 and 3. These seem to be mostly (but not entirely) governance-related problems that are relatively neglected. I’d appreciate feedback and discussion in the comments!


 

8

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 8:34 AM

I'm a bit skeptical about calling this an "AI governance" problem. This sounds more like "governance" or maybe "existential risk governance"---if future technologies make irreversible destruction increasingly easy, how can we govern the world to avoid certain eventual doom?

Handling that involves political challenges, fundamental tradeoffs, institutional design problems, etc., but I don't think it's distinctive to risks posed by AI, don't think that a solution necessarily involves AI, don't think it's right to view "access to TAI" as the only or primary lever of political power to prevent destructive acts, and I'm not convinced that this problem should be addressed by a community focused on AI in particular.

It seems good for people to think about the general long-term challenge as well as to think about the concrete possible destructive technologies on the horizon, in case there is narrower work that can help mitigate the risks they pose and thereby delay the need to implement a general solution. But in some sense this is just "delaying the inevitable."

I wrote some of my thoughts on this relationship in Handling destructive technology.

One potential difference is that I don't see TAI as automatically posing a catastrophic risk. Alignment itself could pose a catastrophic risk. But if we resolve that, then I think we get some (unknown) amount of subjective time until the next thing goes wrong, which might be AI enabling access to destructive physical technology or might be something more conceptually gnarly. The further off that next risk is, the more political change is likely to happen in the interim.

This is an interesting point. But I'm not convinced, at least immediately, that this isn't likely to be largely a matter of AI governance. 

There is a long list of governance strategies that aren't specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples:

establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply chains to slow things down, and avoiding arms races.

And I think that some of the things I mentioned for strategy 3 do too:

giving governments powers to rapidly detect and respond to firms doing risky things with TAI, hitting killswitches involving global finance or the internet, cybersecurity, and generally being more resilient to catastrophes as a global community.

So ultimately, I won't make claims about whether avoiding perpetual risk is mostly an AI governance problem or mostly a more general governance problem, but certainly there are a bunch of AI specific things in this domain. I also think they might be a bit neglected relative to some of the strategy 1 stuff.