Nearcast-based "deployment problem" analysis

HoldenKarnofsky

When thinking about how to make the best of the most important century, two “problems” loom large in my mind:

The AI alignment problem: how to build AI systems that perform as intended, and avoid a world run by misaligned AI.
The AI deployment problem (briefly discussed here): the question of how and when to (attempt to) build and deploy powerful AI systems, under conditions of uncertainty about how safe they will be and how close others are to deploying powerful AI of their own.

This piece is part of a series in which I discuss what both problems might look like under a nearcast: trying to answer key strategic questions about transformative AI, under the assumption that key events (e.g., the development of transformative AI) will happen in a world that is otherwise relatively similar to today's.

A previous piece discussed the alignment problem; this one discusses the deployment problem.

I’m using the scenario laid out in the previous post, in which a major AI company (“Magma,” following Ajeya’s terminology) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years¹) to set up special measures to reduce the risks of misaligned AI before there’s much chance of someone else deploying transformative AI. I discuss what Magma would ideally do in this situation.

I’m also introducing another hypothetical actor in this scenario, “IAIA²”: an organization, which could range from a private nonprofit to a treaty-backed international agency, that tracks³ transformative AI projects and takes actions to censure or shut down dangerous ones, as well as doing other things where a central, neutral body (as opposed to an AI company) can be especially useful. (More on IAIA below.)

I’m going to discuss what Magma’s and IAIA’s major goals and priorities should be in the “nearcast” situation I’m contemplating; a future piece will go through what a few stylized success stories might look like. I’ll be bracketing discussion of the details of how Magma can reduce the risk that its own AI systems are misaligned (since I discussed that previously), and focusing instead on what Magma and IAIA should be looking to do before and after they achieve some level of confidence in Magma’s systems’ alignment.

I focus on Magma and IAIA for concreteness and simplicity (not because I expect there to be only two important actors, but because my takes on what most actors should be doing can be mostly inferred from how I discuss these two). I sometimes give more detail on Magma, because IAIA is a bit more speculative and unlike actors that exist today.

My discussion will be very high-level and abstract. It leaves a lot of room for variation in the details, and it doesn’t pin down how Magma and IAIA should prioritize between possible key activities - this is too sensitive to details of the situation. Nonetheless, I think this is more specific than previous discussions of the deployment problem, and for one who accepts this broad picture, it implies a number of things about what we should be doing today. I’ll discuss these briefly in the final section, and more in a future post.

Summary of the post (bearing in mind that within the nearcast, I’m using present tense and not heavily flagging uncertainty):

I’ll first give a bit more information on the hypothetical setting of this nearcast (specifically, on the addition of IAIA to the scenario discussed previously).
I’ll break this scenario up into three stylized “phases,” even though in practice I think the boundaries between them could be fuzzy.
- “Phase 1” refers to the period of time when there aren’t yet dramatic new (safe) capabilities available to the world via highly powerful (e.g., transformative) AI systems. In this phase, Magma believes itself to be close to developing transformative AI systems, but has not yet done so - and/or has not yet deployed such AI systems because it can’t be confident enough that they’re aligned. A major goal is “try to get some AI system(s) to be both highly powerful (to the point where they could qualify as transformative) and reliably aligned.”
- “Phase 2” refers to the period of time after Magma has succeeded in getting some AI system to be both highly powerful (e.g., transformative) and reliably aligned - but there is still a major threat of other, less cautious actors around the world possibly deploying powerful misaligned AI. In this phase, Magma and IAIA focus on reducing that risk, hopefully with help from powerful technologies that didn’t exist in Phase 1.
- “Phase 3” comes in once Magma and IAIA have succeeded at this, so there is a very low risk, globally, of anyone deploying misaligned AI systems. Now the main risks come from things like human misuse of powerful AI systems that behave as their human users intend.
- A frame I think many readers will find more familiar: “Phase 1 is before the alignment problem has been solved; Phase 2 starts when the alignment problem has been solved by one actor, for the particular transformative AI systems they’re using; Phase 3 starts when misaligned AI is broadly not a threat anymore globally (even from incautious actors).” A footnote explains why I disprefer this framing,⁴ but I’ve included it in case it helps readers understand what I’m getting at.
In “Phase 1” - before both-transformative-and-aligned AI systems - major priorities should include the following:
- Magma should be centrally focused on increasing the odds that its systems are aligned, discussed in a previous post. It should also be prioritizing internal security (both to prevent its AI systems from using security exploits and to prevent exfiltration of critical information, especially its AI systems' weights); exploring deals with other companies to reduce “racing” pressure (among other benefits); and producing “public goods” that can help actors worldwide reduce their level of risk (e.g., evidence about whether misaligned AI is a real risk and about what alignment methods are/aren’t working).
- IAIA can be working on monitoring AI companies (with permission and help in the case where IAIA is a nonprofit, with legal backing in the case where it is e.g. a regulatory body); ensuring that companies developing potentially transformative AI systems have good security practices, good information security and good information sharing practices; and helping disseminate the sorts of “public goods” noted above. (You could think of this as “getting other AI projects to act like Magma.”) IAIA should ideally be in a position to detect cases where changes are needed, and to enforce such changes by censuring or even shutting down labs that don’t take sufficient alignment measures.
- Both Magma and IAIA should be operating with the principle of selective information sharing in mind - e.g., sharing some information with cautious actors but not with incautious ones. In particular, information that is likely to accelerate other actors should be treated cautiously, while information that primarily is useful for making AI systems safer can be shared more liberally.
In “Phase 2” - as aligned-and-transformative AI systems become available - major priorities should continue to include the above, as well as a number of additional tactics for reducing risks from other actors (briefly noted in a previous piece):
- Magma and IAIA should be deploying aligned AI systems, in partnership with governments and via commercial and nonprofit means, that can contribute to defense/deterrence/hardening. For example, aligned AI systems could be (a) finding and patching security vulnerabilities that misaligned AI systems would otherwise exploit (both as part of government programs and via e.g. commercial antivirus programs); (b) raising alerts when they detect signs of dangerous actions by misaligned AIs or AI-assisted humans.
- Magma should be developing ever-better (which includes being cheaper and easier) approaches to aligning AI systems, as well as generating other insights about how to handle the situation as a whole, which can be offered to IAIA and other actors throughout the world. IAIA should be incorporating new technological capabilities and alignment evaluation methods into its efforts to ensure that anyone developing potentially transformative AI is taking strong safety measures.
- Magma should be continuing to improve its AI systems’ capabilities, so that Magma’s aligned systems continue to be more capable than others’ (potentially less safe) systems. It should then be helping IAIA to take full advantage of any ways in which these highly capable systems might be helpful (e.g., for tracking and/or countering dangerous projects).
- Note that not all of these activities are necessarily feasible - it depends on exactly what sorts of capabilities Magma’s aligned AI systems have.
It may end up turning out that it looks like, absent government involvement, other actors will deploy powerful unsafe systems whose harm can’t be stopped/contained even with the help of the best safe systems. In this case, IAIA - with help from Magma and other AI companies - should take more drastic actions (and/or recommend that governments take these actions), such as:
- Clamping down on AI development (generally, or in particular dangerous settings).
- To the extent feasible and needed, credibly threatening to employ (or if necessary employing) powerful technologies that could help enforce regulatory agreements, e.g. via resource accumulation, detecting violations of the regulatory framework, military applications, etc.
At some point (“Phase 3”), the risk of a world run by misaligned AI hopefully falls to very low levels. At this point, it’s likely that many actors are using advanced, aligned AI systems.
- From there, the general focus becomes working toward a world in which humans are broadly more capable and more inclined to prioritize the good of all beings across the world and across time.
- From the perspective of Magma (and other companies) and IAIA, this could include helping to empower particular governments and institutions, as well as developing “merge” technologies to greatly increase human capabilities and options.
I’ll briefly run through some implications that seem to follow if my above picture is accepted, largely to highlight the ways in which (despite being vague in many respects) my picture is implying nontrivial things. Future pieces will go into more detail about implications for today’s world.

One more note before I go into more detail: this post generally focuses on an end goal of advanced technology being safe and broadly available - by default leaving the world’s governance relations mostly as they are (e.g., same governments overseeing the same populations), and figuring that improving on those is a task for the world as a whole rather than for Magma or AI systems specifically. This is a way of avoiding a number of possible distractions, and hopefully laying out a vision that a large number of parties can agree would be acceptable, even if they don’t find it ideal.

The roles of Magma and IAIA in the scenario

This post mostly uses the same scenario laid out previously, in which a major AI company (“Magma,” following Ajeya’s terminology) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years) to set up special measures to reduce the risks of misaligned AI before there’s much chance of someone else deploying transformative AI.

I’m also introducing another hypothetical actor, “IAIA⁵”: an organization, which could range from a private nonprofit to a treaty-backed international agency, that tracks⁶ transformative AI projects and takes actions to censure or shut down dangerous ones, as well as doing other things where a central, neutral body (as opposed to an AI company) can be especially useful. (Some more details on the specifics of what IAIA can do, and what sort of “power” it might have, in a footnote.⁷)

I assume throughout this post that both Magma and IAIA are “good actors” - doing what they can (in actuality, not just in intention) to achieve a positive long-run outcome for humanity - and that they see each other this way. I think assuming this relatively rosy setup is the right way to go for purposes of elucidating lots of potential strategies that might be possible. But Magma-IAIA relations could end up being less trusting and more complicated than what I portray here. In that case, Magma may end up being cautious about how it approaches IAIA, and sticking more (where necessary) to the actions that don’t require IAIA’s cooperation; conversely, IAIA may end up acting adversarially toward Magma.

As noted above, I focus on Magma and IAIA for concreteness and simplicity (not because I expect there to be only two important actors, but because my takes on what most actors should be doing can be mostly inferred from how I discuss these two).

Phase 1: before transformative AI that can safely help with “inaction risk”

I previously wrote about Magma’s “predicament” as it becomes clear that transformative AI could be developed shortly:

Magma is essentially navigating action risk vs. inaction risk:

Action risk. Say that Magma trains extremely powerful AI systems … The risk here is that (per the previous section) Magma might unwittingly train the systems to pursue some unintended goal(s), such that once the systems are able to find a path to disempowering humans and taking control of all of their resources, they do so.

So by developing and deploying transformative AI, Magma may bring about an existential catastrophe for humanity …

Inaction risk. Say that Magma’s leadership decides: “We don’t want to cause an existential catastrophe; let’s just not build AI advanced enough to pose that kind of risk.” In this case, they should worry that someone else will develop and deploy transformative AI, posing a similar risk (or arguably a greater risk -- any company/coalition that chooses to deploy powerful AI when Magma doesn’t may be less careful than Magma overall) .

Magma’s goals: alignment, security, deals with other companies, producing public goods.

Magma’s central goal during this phase should be along the lines of my previous piece: working toward AI systems that are both transformative and safe, and that thus might be especially helpful for reducing “inaction risk.” (Phase 2 starts when such AI systems are available, and one can think of the goal of Phase 1 as getting to Phase 2).
Prioritizing internal security. This could matter both for (a) preventing incautious hackers from “stealing” Magma’s AI systems (e.g., by stealing the weights or information about training processes); (b) containing not-yet-aligned AI systems Magma is developing. (As AI systems advance, Magma will likely need to increase its use of AI for security).
Deals with other companies. Magma might be able to reduce some of the pressure to “race” by making explicit deals with other companies doing similar work on developing AI systems, up to and including mergers and acquisitions (but also including more limited collaboration and information sharing agreements).
- Benefits of such deals might include (a) enabling freer information sharing and collaboration; (b) being able to prioritize alignment with less worry that other companies are incautiously racing ahead; (c) creating incentives (e.g., other labs’ holding equity in Magma) to cooperate rather than compete; and thus (d) helping Magma get more done (more alignment work, more robustly staying ahead of other key actors in terms of the state of its AI systems).
- These sorts of deals could become easier to make once Magma can establish itself as being likely to lead the way on developing transformative AI (compared to today, when my impression is that different companies have radically different estimates of which companies are likely to end up being most relevant in the long run).
Producing public goods that can help other actors better understand and reduce risks from misaligned AI, e.g.:
- Evidence about the size of the risk from misaligned AI. This could include demonstrations of AI systems’ exploiting security holes (or not doing so when some would expect them to), manipulating supervisors, etc.
- Information that other actors can use to reduce misalignment risk and fix security holes. See my previous piece for how Magma might produce this information.
- Offering trainings, briefings, etc.

IAIA’s goals: monitoring, encouraging good safety practices, encouraging good information sharing practices (including prioritizing security), sharing/disseminating public goods.

Monitoring for signs of transformative/potentially dangerous AI. Ideally IAIA would have a strong sense of the state of every major AI development project; the case that any given project is close to transformative or otherwise highly dangerous AI (this could include AI that simply speeds up AI development at first, leading to transformative AI later); and what sorts of alignment measures and safety testing are taking place in each case. Some of this could be accomplished via voluntary cooperation with a monitoring arrangement. For both voluntary and (in the case of a more formally empowered IAIA) enforced monitoring, there could be major policy, logistical and technological challenges, which I won’t go further into here but which IAIA could be working to address.
Trying to ensure that companies’ safety measures are sufficient and that they aren’t preparing to deploy potentially dangerous AI. IAIA’s measures here could include (a) discussions, statements and other “soft” (non-legally-backed) pressure; (b) “carrot” incentives (offering endorsements that can help companies recruit top talent, access more resources,⁸ etc.); (c) recommending (or even mandating as part of some regulatory framework) that AI companies be penalized or even shut down if they aren’t complying with audits, taking appropriate safety measures, etc.; (d) opportunistically looking for and working on mutually beneficial deals between particular parties, technologies that could facilitate these deals, etc.⁹
Trying to ensure that companies developing powerful safety systems are prioritizing information security and selective information sharing. As discussed below, companies should be sharing information in ways that reduce risks (e.g., “public goods”) but not indiscriminately. As with the previous point, there are a variety of ways that IAIA might try to ensure good practices here.
Serving as a hub for sharing/disseminating public goods, technical assistance, etc. IAIA could be helping to disseminate public goods such as the ones Magma is creating (above); conducting trainings; aggregating lessons learned from many companies and sharing them widely; helping to coordinate deals between different AI companies; etc.

If IAIA suspects that someone could imminently deploy systems leading to a global catastrophe, it should consider drastic actions, discussed below.

A crucial theme for Magma and IAIA: selective information sharing. Certain classes of information sharing can increase risks (e.g., if an incautious actor got access to the weights for powerful but unaligned Magma models, such that they could deploy them themselves without much further effort); others can decrease risk (e.g., insights about where likely security holes and misalignment risks come from, which could cause even incautious actors to change their training setup and patch holes).

Both Magma and IAIA should be deliberate about the pros and cons of sharing information with different parties. For example, they might want frameworks for sharing more information with cautious actors than with incautious (or otherwise dangerous¹⁰) ones. I expect that in many cases, “information about how to avoid misaligned AI” and “information about how to produce powerful AI” will overlap; so will “information about the size of the risk” and “information about how powerful AI systems are getting.” It would be best if Magma and IAIA could sometimes (based on case-by-case analysis) share this sort of “dual-use” information with other AI labs that are more “cautious” (in the sense that they’re more likely to make major efforts to reduce the risk of misaligned AI) without necessarily making it public. This leaves open how Magma and IAIA are to define and determine which actors count as sufficiently “cautious.”

Phase 2: as aligned-and-transformative AI systems become available

Hopefully, at some point it becomes possible to be confident that some of Magma’s AI systems are both very powerful and unlikely to cause catastrophe via misalignment. (A previous piece discussed what this might look like.)

This could open up new ways of reducing “inaction risk” (the risk that others deploy powerful, misaligned systems), in addition to the key actions from Phase 1 (which Magma and IAIA should be continuing and perhaps intensifying).

Magma and IAIA should both be working to deploy AI systems that can reduce the risk that other actors cause a catastrophe.

AI systems could be deployed toward the following (these were briefly mentioned previously):

Alignment.
- Magma can use safe systems to align still-more-powerful systems, in the hopes of maintaining its position as the developer with the most advanced systems.
- Magma can use safe systems to develop ever-more-robust techniques for alignment (this could include entirely new methods, as well as simply better versions of the measures discussed previously). These could then be disseminated by both Magma and IAIA as “public goods” as outlined in the previous section.
- IAIA might require (within a voluntary or enforced regime) something along the lines of: “Systems with capability level ___ need to do at least as well in alignment testing as if they were trained with the best alignment techniques developed by Magma (or anyone else).”
Security.
- Magma can offer (commercial or free) products for finding and patching cybersecurity holes (and other exploits).
- IAIA may be able to require security audits and patches for a variety of actors, including government actors, taking advantage of advanced AI systems’ security-improving capabilities.
Basic defense/deterrence/hardening (with aligned AI systems deployed throughout the economy, misaligned AI systems would have to “compete with” them on a variety of fronts to gain power).
- Magma can offer (commercial or free) products for a variety of other economic uses, which could create competition for any misaligned or otherwise dangerous¹¹ systems that might be out there (e.g., making it harder for developers of unsafe systems - and perhaps misaligned systems themselves - to make money).
- It might be possible to restrict access to products like these to states that are in compliance with a broader regulatory regime (e.g., states allowing IAIA to monitor for, and potentially shut down, non-compliant AI development projects).
Detecting and obstructing misaligned (or otherwise dangerous) AI systems.
- Magma can use its AI systems to look through data it has access to (whether via its products or because the data is public) for signs of activity by misaligned AI systems, which can then be exposed; it can offer help with using its aligned systems to obstruct activity by these systems.
- IAIA may (if it has enough authority) be able to conduct more thorough monitoring, by requiring “monitor AIs” be granted the ability to monitor for, and perhaps even obstruct, misaligned AIs in certain contexts. (This could be tough to work out the details of.)
Better enforcing agreements. IAIA may be able to use advanced, aligned AI systems to better monitor and enforce terms such as “Nobody should be developing AI models of capability level __ without passing __ safety tests” - as well as terms such as “Nobody should be deploying even aligned AI systems in illegal or otherwise highly dangerous ways.” (Note that since Phase 1 and before, IAIA has hopefully been developing the ability to do this as well as possible with or without help from advanced, aligned AI systems.)
Better understanding of the overall situation. Magma can try to develop “advisor AI” systems that can give both Magma and IAIA general help with the strategic situation - new ideas, insights about where the biggest future dangers are likely to come from (including regarding the “offense-defense balance” mentioned immediately below), analysis of terms on which different AI labs should collaborate/merge, etc.

Another important aspect of Phase 2 is trying to prevent misuse of aligned-and-transformative AI systems. I think this topic is relatively unexplored compared to reducing alignment risk, and without much to draw from, I’ll only discuss it briefly. Two key paths to reducing misuse risk could be:

Continuing to work on making advanced systems ever more capable (subject to their remaining safe), and ensuring that they are deployed mostly (or entirely) in contexts where they’re (a) subject to appropriate regulation/oversight; (b) ultimately under the jurisdiction of governments that will not use them for e.g. totalitarian ends. Magma can focus on making systems ever more capable, if it believes they’ll be deployed mostly in such contexts; IAIA (or perhaps some other body that is focused on misuse rather than alignment) can focus on developing regulations and recommendations to governments, companies, etc. to prevent AI systems from being sold, licensed, etc. in particularly dangerous contexts.
Trying to develop AI systems that have resistance to misuse “baked in.” For example, AI systems might be designed to resist helping users with illegal actions, and even with actions that are legal but considered dangerous or bad for some other reason (this could include abusive uses by governments, uses by individuals that make illegal or dangerous activity easier, etc.)¹² (The degree to which an AI system “restricts” what it can help with might vary by who’s using it, just as the kinds of weapons one can buy depend on whether one can pass a background check, whether one has an official law enforcement or military role, etc.) IAIA (or, again, another agency) can work on developing and enforcing standards for what sorts of behaviors AI systems should resist; Magma can work on designing its AI systems to do so.

A key overall hope here is that actors such as Magma can (a) roll out powerful but safe AI systems before more dangerous¹³ actors can deploy comparably advanced (and potentially less safe) systems; (b) build a substantial advantage for these systems, as the fact that they’re seen as non-dangerous leads to wide rollouts and a lot of resources for them; (c) use such systems to help with research toward still-more-advanced systems, maintaining the aggregate advantage of relatively cautious actors and their safe AI systems over more dangerous actors and their more dangerous AI systems.

That hope might not work out, for a number of reasons:

As discussed in a previous piece, alignment measures that work at first may become less effective as systems become more advanced, so it could be hard to maintain an advantage for safe systems as safety becomes more difficult to ensure.
Perhaps the “offense-defense balance” is unfavorable - that is, perhaps in a world with several times as much resources behind safe AI systems as dangerous AI systems, the dangerous AI systems would still do catastrophic damage. This could mean that as incautious or otherwise bad actors get closer to deploying dangerous AI systems, Magma and IAIA can’t rely on safer systems’ early lead and resource advantage, and need to find a way to stop ~all dangerous deployments.
Perhaps dangerous actors are simply putting more resources into, and pulling ahead on, AI development.

If it looks like things are going this way, IAIA and Magma should pursue more drastic measures, as discussed next.

Drastic measures

In either Phase 1 or Phase 2, Magma and/or IAIA might come to believe that it’s imperative to suppress deployment of dangerous AI systems worldwide - and that the measures described above can’t accomplish this.

In this case, IAIA might recommend (or authorize/mandate, depending on the full scope of its authority¹⁴) that governments around the world suppress AI development and/or deployment (as they have in the past for dangerous technologies such as nuclear, chemical and biological weapons, as well as e.g. chlorofluorocarbons) - by any means necessary, including via credible threat of military intervention, by cyberattack, etc.¹⁵ Magma might also advocate for such things, though presumably with less sway than an effective version of IAIA would have.

For this kind of scenario, it seems important that whoever is approaching key governments for this kind of intervention should have a good sense - ideally informed by strong pre-existing relationships - of how to approach them in a way that will lead to good outcomes (and not merely to some reaction like “We’d better deploy the most advanced AI systems possible before our rivals do”).

If advanced AI systems are capable of developing powerful advanced technologies - such as highly advanced surveillance, weapons, persuasion techniques, etc. - they could be used to help governments suppress deployment of dangerous systems. This would hopefully involve deploying systems that are highly likely to be safe, but it’s imaginable that Magma and IAIA should advocate for taking a real risk of deploying catastrophically misaligned AI rather than stand by as other actors deploy their systems.¹⁶ I think the latter should be a true last resort.

Phase 3: low-misalignment-risk period

A major goal of phases 1 and 2 was to get to the point where there’s no longer significant worry about a world run by misaligned AI.

By this time, it’s possible that the world is already very unfamiliar from today’s vantage point. There may have been several rounds of “using powerful aligned AI systems to help build even more powerful aligned systems”; there may now be many (very powerful by today’s standards) AI systems in wide use, and developed by a number of actors; drastic government action may have been taken.

The world of phase 3 faces enormous challenges. Among other things, I expect some people (and some governments) could be looking to deploy advanced technologies to seize power from others and perhaps lock in bad worlds.

It’s possible that technology will advance rapidly enough, and be destabilizing enough, that there are multiple actors making credible attempts to become an entrenched global hegemon. In such a situation:

Private actors such as Magma might focus on helping some particular beneficial coalition gain (or solidify) its dominance of global affairs, as well as lobbying this coalition to govern in a way that’s conducive to a maximally flourishing future.
IAIA (if its mandate goes beyond reducing misalignment risk) might be trying to do something like “brokering a peaceful, enforceable ‘compromise’ agreement for global governance (this could include agreeing that different parts of the world will be governed by different actors).”

Alternatively, it’s also possible that this phase-3 world will be largely like today’s in key respects: a number of powerful governments, mostly respecting each others’ borders and governing on their own. In this world, I think the ideal focus of most AI-involved actors is to push toward AI systems’ being used toward causing humans to be broadly more capable and more inclined to prioritize the good of all beings across the world and across time. This could include:

Designing AI systems (and advocating that they be designed) to be maximally helpful for goals like becoming more the sort of person one wishes to be.
To the extent one has a key role in developing the frontier of AI capabilities, trying to use these to advantage governments and other actors that are relatively positive forces in the world.
Following a “rowing” model: developing technologies that (at least if deployed responsibly) could increase humans’ options and capabilities, which would hopefully lead to moral progress as well (as has arguably happened in the past). As an example, I’ve written about how digital people could mean a world much better or much worse than today’s - depending on how exactly the rollout goes (example).

Implications

Future pieces will go into detail about what I think this whole picture implies for what the most helpful actions are today.

Here, I want to briefly run through some implications that seem to follow if my above picture is accepted, largely to highlight the ways in which (despite being vague in many respects) my picture is “sticking its neck out” and implying nontrivial things.

If we figure that something like the above guidelines (and “success stories” I’ll outline in a future piece) give us our best shot at a good outcome, this implies that:

Working toward the creation of something like IAIA should arguably have started already (although I don’t personally think it would be a good idea to push for an actual regulatory body now). Many of the key levers above - particularly monitoring for dangerous AI projects and pressuring them to improve or getting them censured or shut down, as well as advocating to governments for (or requiring) drastic measures - would go much better with a strong, effective IAIA. Steps toward that end today might include:
- Developing frameworks for what it might look like to monitor AI projects globally (technically, logistically, politically).
- Developing frameworks for how to tell how dangerous an AI system is (how powerful it is, how likely it is to be aligned vs. misaligned).
- Developing key talent, e.g. people who have experience doing things like “aggregating information from multiple AI labs about the state of their research, and distilling it into useful information that could potentially inform both AI labs and governments; ideally building credibility with both.”
- Other actions could imaginably be taken as well (e.g., trying to create an actual IAIA with some degree of formal policy authority in particular countries), but my view is that this is unlikely to be helpful before there’s more progress on the above.
As I argued previously, putting a lot of extra effort into reducing misalignment risk could be very important. This isn’t just about developing approaches to alignment; it’s substantially about choosing to prioritize and invest in already-known approaches to alignment such as accurate reinforcement, adversarial training, AI checks and balances and rigorous testing. So the amount of effort leading AI labs are ready to put in - even while facing commercial pressures and risks that others will deploy first, and not having definitive evidence of how significant misalignment risk is - could be crucial.
A dynamic I think is central in the most optimistic scenarios (such as the “success stories” I’ll outline in a future piece) is that cautious actors have the most advanced systems, and time to de-risk them before incautious actors deploy comparably advanced systems. Actions that lead to a proliferation of (and lots of resources for) incautious actors would (in my opinion) make success quite a bit harder to picture.
This means that projects seeking to develop powerful AI need to walk a sort of tightrope between being on the cutting edge of AI capabilities and investing a lot in things that don’t directly help with this, but could reduce misalignment risk. Simultaneously, governments need to walk a tightrope between avoiding the deployment of dangerous systems and heading off this threat from other countries.
It generally seems like we need some sort of counter-force to normal market dynamics. If companies are exclusively focused on commercialization, and racing each other to deploy the most impressive products possible, then I think there’s a substantial risk that they end up with “safe enough for commercialization, but ultimately catastrophic” AI systems. Furthermore, some companies doing this could make it harder for everyone to invest heavily in measures to reduce risk of misaligned AI (since such investment could cause them to “fall behind” in terms of having the most advanced systems). In a world where we don’t have a strong IAIA, this could mean that AI companies need to do a lot of unusual things to avoid a race to commercialize.
Selective information sharing could be extremely important. AI labs should probably be building a number of relevant - and not “default” - capabilities today, such as:
- Extremely strong information security, probably including significant red-teaming.
- Legal frameworks for sharing information with trusted parties (including competitors) but not the general public.
- Internal frameworks for restricting the dissemination of new capabilities, especially when they are exciting and could lead to hype and entry of incautious actors.
Governments may need to take drastic actions, e.g. heavily regulating the use of AI - although I doubt it would be productive for them to do so today. To the extent it’s possible to start laying the groundwork for such things without immediately advocating them,¹⁷ this could be important.
Outreach and advocacy, especially to competitors and governments, could be extremely important for AI labs (especially in the absence of a strong IAIA). AI labs should probably be building relevant capacities and relationships today; they should know which parties are particularly likely to be cautious and trustworthy, and should know whom they’d approach (with expectations of a fair hearing and a reasonable response) if they came to believe that transformative AI was around the corner.
Particular applications of AI systems seem especially important for reducing risk of misaligned AI, and AI labs should potentially be focusing on these sorts of applications today. These include finding and patching security holes, pointing out unintended behaviors from other AI systems, and helping with general alignment research. Demonstrating unintended behaviors at an early stage (e.g., via training on underhanded C) could also be valuable.
The testing part of the AI development process seems extremely high-stakes and easy to get wrong. I think the “success stories” I outline in a future piece will illustrate where I’m coming from here: the situation looks a lot better if Magma’s systems pass key safety tests on the first try.
Designing AI systems to resist misuse could be important, and the kind of thing AI labs ought to be investing in on both technical fronts (e.g., training AI systems not to provide help with dangerous activities) and on other fronts (e.g., defining “dangerous activities”). I believe this work is an interest of some of today’s AI labs, but seems to get little attention in a lot of discussions about AI risk.¹⁸
The situation overall looks extremely challenging, with lots of different threats and conflicting incentives that the world doesn’t seem to have strong frameworks today for handling. Key actors need to balance “action risk” (deploying dangerous systems) against “inaction risk” (falling behind as others do so), and need to contend simultaneously with alignment risk and misuse risk.

Overall, my picture stands in significant contrast to what I perceive as a somewhat common view that alignment is purely a technical problem, “solvable” by independent researchers. In my picture, there are a lot of moving parts and a lot of room for important variation in how leading AI labs behave, beyond just the quality of the schemes they generate for reducing misalignment risk. My picture also stands in significant contrast to another common view, which I might summarize as “We should push forward with exciting AI applications as fast as possible and deploy them as widely as possible.” In my view, the range of possible outcomes is wide - as is the range of important inputs along technical, strategic, corporate and political dimensions.

Thanks to Paul Christiano, Allan Dafoe, Daniel Kokotajlo, Jade Leung, Cullen O'Keefe, Carl Shulman, Nate Soares and especially Luke Muehlhauser for particularly in-depth comments on drafts.

Notes

This doesn’t mean the whole situation discussed in this post plays out in a span of 6 months to 2 years. It just means that there isn’t much chance of someone deploying comparably transformative systems to Magma’s first transformative systems within that amount of time. Much of this piece has Magma making attempts to “stay ahead” of others, such that the scenario could take longer to play out. ↩
A hypothetical International AI Agency (name inspired by IAEA). Pronunciation guide here. ↩
Monitoring would be with permission and assistance in the case where IAIA is a private nonprofit, i.e., in this case AI companies would be voluntarily agreeing to be monitored. ↩
I don’t like the framing of “solving” “the” alignment problem. I picture something like “Taking as many measures as we can (see previous post) to make catastrophic misalignment as unlikely as we can for the specific systems we’re deploying in the specific contexts we’re deploying them in, then using those systems as part of an ongoing effort to further improve alignment measures that can be applied to more-capable systems.” In other words, I don’t think there is a single point where the alignment problem is “solved”; instead I think we will face a number of “alignment problems” for systems with different capabilities. (And I think there could be some systems that are very easy to align, but just not very powerful.) So I tend to talk about whether we have “systems that are both aligned and transformative” rather than whether the “alignment problem is solved.” ↩
A hypothetical International AI Agency (name inspired by IAEA). Pronunciation guide here. ↩
Monitoring would be with permission and assistance in the case where IAIA is a private nonprofit, i.e., in this case AI companies would be voluntarily agreeing to be monitored. ↩
There’s a wide variety of possible powers for IAIA. For most of this post, I tend to assume that it is an agency designed for flexibility and adaptiveness, not required or enabled to execute any particular formal scheme along the lines of “If verifiable event X happens, IAIA may/must take pre-specified action Y.”
Instead, IAIA’s central tool is its informal legitimacy. It has attracted top talent and expertise, and when it issues recommendations, the recommendations are well-informed, well-argued, and commonly seen as something governments should follow by default.
In the case where IAIA has official recognition from governments or international bodies, there may be various formal provisions that make it easier for governments to quickly take IAIA’s recommendations (e.g., Congressional pre-authorizations for the executive branch to act on formal IAIA recommendations). ↩
E.g., it’s imaginable that large compute providers could preferentially provide compute to IAIA-endorsed organizations. ↩
One example (also mentioned in a later footnote): key AI-relevant chips could have features to enable others to monitor their utilization, or even to shut them down in some circumstances, and parties could make deals giving each other access to these mechanisms (somewhat in the spirit of the Treaty on Open Skies). ↩
E.g., actors who seem likely to use any aligned AI systems for dangerous purposes. ↩
E.g., aligned AI systems that people are trying to use for illegal and/or highly dangerous activities. ↩
For example, if the “offense-defense balance” is such that an individual might be able to ask an AI system to design powerful weapons with which they could successfully threaten governments, AI systems might be trained not to help with this sort of goal. There is a nonobvious line to be drawn here, because AI systems shouldn’t necessarily e.g. refuse to help individuals work on developing better clean energy technology, which could be relevant for weapons development.
This line doesn’t have to be drawn algorithmically - it could be based on human judgments about what sorts of AI assistance constitute “helping with illegal activity or excessive power gain” - but who gets to make those judgments, and how they make them, is still a hairy area with a lot of room for debate and judgment calls. ↩
Whether due to less caution about alignment, or for other reasons ↩
See previous section for some discussion of how exactly IAIA’s authority might work. ↩
One speculative possibility could be for IAIA and others to push for key AI-relevant chips to have features enabling others to monitor their utilization, or even to shut them down in some circumstances. ↩
The basic reasoning might be: “The systems we have a real chance of causing global catastrophe, but if we stand by, others will deploy systems that are even more likely to.” I think it’s worth having a high bar for making such a call, as a given actor might naturally be biased toward thinking the world is better off with them acting first. ↩
I think there are quite a few things one can do to “lay the groundwork” for future policy changes; some of them are gestured at in this Open Philanthropy blog post from 2013. I expect a given policy change to be much easier if many of the pros and cons have already been analyzed, the details have already been worked out, and there are a number of experts working in governments and at government-advising organizations (e.g., think tanks) who can give good advice on it; all of these are things that can be worked on in advance. ↩
One casual conversation I had with an AI researcher implied that training AI systems to refuse dangerous requests could be relatively easy, but it also seems relatively easy (by default) for others to train this behavior back out via fine-tuning. It might be interesting to explore AI system designs that would be hard to use in unintended ways without hugely expensive re-training. ↩

43