Distinguishing AI takeover scenarios

Sam Clarke; Sammy Martin

Epistemic status: lots of this involves interpreting/categorising other people’s scenarios, and could be wrong. We’d really appreciate being corrected if so. [ETA: so far, no corrections.]

TLDR: see the summary table.

In the last few years, people have proposed various AI takeover scenarios. We think this type of scenario building is great, since there are now more concrete ideas of what AI takeover could realistically look like. That said, we have been confused for a while about how the different scenarios relate to each other and what different assumptions they make. This post might be helpful for anyone who has similar confusions.

We focus on explaining the differences between seven prominent scenarios: the ‘Brain-in-a-box’ scenario, ‘What failure looks like’ part 1 (WFLL 1), ‘What failure looks like’ part 2 (WFLL 2), ‘Another (outer) alignment failure story’ (AAFS), ‘Production Web’, ‘Flash economy’ and ‘Soft takeoff leading to decisive strategic advantage’. While these scenarios do not capture alI of the risks from transformative AI, participants in a recent survey aimed at leading AI safety/governance researchers estimated the first three of these scenarios to cover 50% of existential catastrophes from AI.^[1]

We plan to follow up with a subsequent post, which discusses some of the issues raised here in greater depth.

Variables relating to AI takeover scenarios

We define AI takeover to be a scenario where the most consequential decisions about the future get made by AI systems with goals that aren’t desirable by human standards.

There are three variables which are sufficient to distinguish the takeover scenarios discussed in this post. We will briefly introduce these three variables, and a number of others that are generally useful for thinking about takeover scenarios.

Key variables for distinguishing the AI takeover scenarios in this post:

Speed. Is there a sudden jump in AI capabilities over a very short period (i.e. much faster than what we would expect from extrapolating past progress)?
Uni/multipolarity. Is there a single AI system that takes over, or many?
Alignment. What (misaligned) goals are pursued by AI system(s)? Are they outer or inner misaligned?

Other variables:

Agency. How agentic are the AI(s) that take over? Do the AIs have large-scale objectives over the physical world and can they autonomously execute long-term plans to reach those objectives?
Generality. How generally capable are the AI(s) that take over? (vs. only being capable in specific narrow domains)
Competitive pressures. To what extent do competitive pressures (incentives to develop or deploy existentially risky systems in order to remain competitive) cause or exacerbate the catastrophe?^[2]
Irreversibility mechanism. When and how does the takeover become irreversible?
Homogeneity/heterogeneity of AIs. In the scenarios that involve multiple AI systems, how similar are the different systems (in learning algorithms, finetuning data, alignment, etc.)?^[3]
Interactions between AI systems. In the scenarios that involve multiple AI systems, do we see strong coordination between them, or conflict?^[4]

Note that the scenarios we consider do not differ on the dimensions of agency and generality: they all concern takeovers by highly agentic, generally capable AIs - including ‘What failure looks like’ part 1 - we just stated these dimensions here for completeness.^[5]

Clarifying what we mean by outer and inner alignment

Recently, there has been some discussion about how outer and inner alignment should be defined (along with related terms like objective and capability robustness). In this post, we roughly take what has become known as the ‘objective-focused approach’, whilst also taking into account Richard Ngo’s arguments that it is not actually clear what it means to implement a “safe” or “aligned” objective function.

Outer alignment is a property of the objective function used to train the AI system. We treat outer alignment as a continuum. An objective function is outer aligned to the extent that it incentivises or produces the behaviour we actually want from the AI system.

Inner alignment is a property of the objective which the AI system actually has.^[6] This objective is inner aligned to the extent that it is aligned with, or generalises ‘correctly’ from, the objective function used to train the system.

(If you’re new to this distinction between outer and inner alignment, you might wonder why an AI system wouldn’t always just have the same objective as the one used to train it. Here is one intuition: if the training environment contains subgoals (e.g. ‘gaining influence or resources’) which are consistently useful for scoring highly on the training objective function, then the training process may select for AI systems which care about those subgoals in ways that ultimately end up being adversarial to humans (e.g. ‘gaining influence at the expense of human control’). Human evolution provides another intuition: you could think of evolution as a training process that led to inner misalignment, because humans care about goals other than just maximising our genetic fitness.)

Summary table

This table summarises how the scenarios discussed in this post differ, according to the three key variables above. You can find a higher resolution version here.

We'll now go on to explaining and illustrating the differences between the scenarios in more detail. For clarity, we divide our discussion into slow scenarios and fast scenarios, following Critch. In the slow scenarios, technological progress is incremental, whereas in fast scenarios there is a sudden jump in AI capabilities over a very short period.

Fast scenarios

Outer-misaligned brain-in-a-box scenario

This is the ‘classic’ scenario that most people remember from reading Superintelligence (though the book also features many other scenarios).

A single highly agentic AI system rapidly becomes superintelligent on all human tasks, in a world broadly similar to today.

The objective function used to train the system (e.g. ‘maximise production’) doesn’t push it to do what we really want, and the system’s goals match the objective function.^[7] In other words, this is an outer alignment failure. Competitive pressures aren’t especially important, though they may have encouraged the organisation that trained the system to skimp on existential safety/alignment, especially if there was a race dynamic leading up to the catastrophe.

The takeover becomes irreversible once the superintelligence has undergone an intelligence explosion.

Inner-misaligned brain-in-a-box scenario

Another version of the brain-in-a-box scenario features inner misalignment, rather than outer misalignment. That is, a superintelligent AGI could develop some arbitrary objective that arose during the training process. This could happen for the reason given above (there are subgoals in the training environment that are consistently useful for doing well in training, but which generalise to be adversarial to humans), or simply because some arbitrary influence-seeking model just happened to arise during training, and performing well on the training objective is a good strategy for obtaining influence.

We suspect most people who find the ‘brain-in-a-box’ scenario plausible are more concerned by this inner misalignment version. For example, Yudkowsky claims to be most concerned about a scenario where an AGI learns to do something random (rather than one where it ‘successfully’ pursues some misspecified objective function).

It is not clear whether the superintelligence being inner- rather than outer-misaligned has any practical impact on how the scenario would play out. An inner-misaligned superintelligence would be less likely to act in pursuit of a human-comprehensible final goal like ‘maximise production’, but since in either case the system would both be strongly influence-seeking and capable of seizing a decisive strategic advantage, the details of what it would do after seizing the decisive strategic advantage probably wouldn’t matter. Perhaps, if the AI system is outer-misaligned, there is an increased possibility that a superintelligence could be blackmailed or bargained with, early in its development, by threatening its (more human-comprehensible) objective.

Flash economy

This scenario, described by Critch, can be thought of as one multipolar version of the outer-misaligned ‘brain-in-a-box’ scenario. After a key breakthrough is made which enables highly autonomous, generally capable, agentic systems with long-term planning capability and advanced natural language processing, several such systems become superintelligent over the course of several months. This jump in capabilities is unprecedentedly fast, but ‘slow enough’ that capabilities are shared between systems (enabling multipolarity). At some point in the scenario, groups of systems reach an agreement to divide the Earth and space above it into several conical sectors, to avoid conflict between them (locking in multipolarity).

Each system becomes responsible for a large fraction of production within a given industry sector (e.g., material production, construction, electricity, telecoms). The objective functions used to train these systems can be loosely described as “maximising production and exchange” within their industry sector. The systems are “successfully” pursuing these objectives (so this is an outer alignment failure).

In the first year, things seem wonderful from the perspective of humans. As economic production explodes, a large fraction of humanity gains access to free housing, food, probably a UBI, and even many luxury goods. Of course, the systems are also strategically influencing the news to reinforce this positive perspective.

By the second year, we have become thoroughly dependent on this machine economy. Any states that try to slow down progress rapidly fall behind economically. The factories and facilities of the AI systems have now also become very well-defended, and their capabilities far exceed those of humans. Human opposition to their production objectives is now futile. By this point, the AIs have little incentive to preserve humanity’s long-term well-being and existence. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen) gradually become depleted or destroyed, until humans can no longer survive.

Slow scenarios

We’ll now describe scenarios where there is no sudden jump in AI capabilities. We’ve presented these scenarios in an order that illustrates an increasing ‘degree’ of misalignment. In the first two scenarios (WFLL 1 and AAFS), the outer-misaligned objective functions are somewhat close to what we want: they produce AI systems that are trying to make the world look good according to a mixture of feedback and metrics specified by humans. Eventually, this still results in catastrophe because once the systems are sufficiently powerful, they can produce much more desirable-looking outcomes (according to the metrics they care about), much more easily, by controlling the inputs to their sensors instead of actually making the world desirable for humans. In the third scenario (Production Web), the ‘degree’ of misalignment is worse: we just train systems to maximise production (an objective that is further from what we really want), without even caring about approval from their human overseers. The fourth scenario (WFLL 2) is worse still: the AIs have arbitrary objectives (due to inner alignment failure) and so are even more likely to take actions that aren’t desirable by human standards, and likely do so at a much earlier point. We explain this in more detail below.

The fifth scenario doesn’t follow this pattern: instead of varying the degree of misalignment, this scenario demonstrates a slow, unipolar takeover (whereas the others in this section are multipolar). There could be more or less misaligned versions of this scenario.

What failure looks like, part 1 (WFLL 1)

In this scenario, described by Christiano, many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).

The objective functions used to train them (e.g., ‘reduce reported crimes’, ‘increase reported life satisfaction’, ‘increasing human wealth on paper’) don’t push them to do what we really want (e.g., ‘actually prevent crime’, ‘actually help humans live good lives’, ‘increasing effective human control over resources’) - so this is an outer alignment failure.

The systems’ goals match these objectives (i.e. are ‘natural’ or ‘correct’ generalisations of them). Competitive pressures (e.g., strong economic incentives, an international ‘race dynamic’, etc.) are probably necessary to explain why these systems are being deployed across society, despite some people pointing out that this could have very bad long-term consequences.

There’s no discrete point where this scenario becomes irreversible. AI systems gradually become more sophisticated, and their goals gradually gain more influence over the future relative to human goals. In the end, humans may not go extinct, but we have lost most of our control to much more sophisticated machines (this isn’t really a big departure from what is already happening today - just imagine replacing today’s powerful corporations and states with machines pursuing similar objectives).

Another (outer) alignment failure story (AAFS)

This scenario, also described by Christiano, is initially similar to WFLL 1. AI systems slowly increase in generality and capability and become widely deployed. The systems are outer misaligned: they pursue natural generalisations of the poorly chosen objective functions they are trained on. This scenario is more specific about exactly what objectives the systems are pursuing: they are trying to ensure that the world looks good according to some kind of (augmented) human judgment (the systems are basically trained according to the regime described in An unaligned benchmark).

Problems arise along the way, when systems do things that look good but aren’t actually good (e.g. a factory colludes with the auditors valuing its output, giving a great quarterly report that didn’t actually correspond to any revenue). Such problems tend to be dealt with via short-term fixes - improving sensor coverage to check mistakes (e.g. in a way that reveals collusion between factories and auditors) or tweaking reward functions (e.g. to punish collusion between factories and auditors). This leads to a false sense of security initially. But as the pace of AI progress accelerates and we still don’t know how to train AI systems to actually help us, we eventually have extremely powerful systems, widely deployed across society, which are pursuing proxy goals that come apart from what we actually want. Specifically: ‘ensuring things look good according to human judgment’ eventually means fooling humans and carefully controlling what gets fed into the sensors, because the AIs can produce much more desirable-looking outcomes, much more easily, by controlling the sensors instead of actually making the world good. Eventually, all humans will either be killed or totally disempowered, because this is the best way of making sure the systems’ objectives are maximally positive and will remain that way forever.

To explain exactly how this scenario differs from WFLL 1, consider that outer (mis)alignment can be viewed as a continuum: ‘how’ misspecified is the objective function used to train the AI system? On one extreme, we have objective functions that do really well at producing or incentivising the behaviour we actually want from AI systems (e.g. a reward function trained using iterated amplification to reward systems to the extent that they try to help their operators). On the other extreme, we have objective functions that don't capture anything we value (e.g. ‘maximise paperclips’).

We find it helpful to think about the objective functions used in training as specifying a ‘sensory window’ through which the system being trained views the world (you could even think of it as a huge set of camera feeds). This window will probably be defined by a bunch of human feedback, along with other metrics (e.g. GDP, inflation, unemployment, approval ratings). The training process is selecting for AI systems that make this sensory window look ‘good’ according to feedback and desired values for those metrics.

Bringing these ideas together: the better defined this ‘sensory window’ (i.e. the more outer aligned the objective function is), the better things will look from the human perspective. In WFLL 1, the sensory window is very large, rich and well-defined, such that even as AI systems gain more and more influence relative to humans, the world continues to look pretty good to us. In AAFS, the sensory window is smaller and less well-defined, such that it’s eventually easy for systems to seize their sensors and kill or disempower any humans who try to stop them.

This has a few practical implications for how AAFS plays out, compared to WFLL 1.

First, in WFLL 1, there is a decent chance (maybe 50:50), that AI systems will leave some humans alone (though still mostly disempowered). This is because the sensory window was so well-defined that it was too hard for AI systems to cause extinction without it showing up on their sensors and metrics. In AAFS, this is much less likely, because the sensory window is easier to fool.

Second, in AAFS, the point of no return will happen sooner than in WFLL 1. This is because it will require a lower level of capabilities for systems to take control without it showing up on the (more poorly defined) sensory window.

Third, in AAFS, warning shots (i.e. small- or medium-scale accidents caused by alignment failures, like the ‘factory colludes with auditors’ example above) are more likely and/or severe than in WFLL 1. This is because more possible accidents will not show up on the (more poorly defined) sensory window.^[8] A further implication here is that competitive pressures probably need to be somewhat higher - or AI progress somewhat faster - than in WFLL 1, to explain why we don’t take steps to fix the problem before it’s too late.

The next scenario demonstrates what happens when the objective function/sensory window is even closer to the bad extreme.

Production Web

Critch’s Production Web scenario is similar to WFLL 1 and AAFS, except that the objective functions used to train the systems are even less outer aligned. Specifically, the systems are trained to ‘maximise productive output’ or some similarly crude measure of success. This measure defines an even narrower sensory window onto the world than for the systems in WFLL 1 and AAFS - it isn’t even superficially aligned with what humans want (the AI systems are not trying to optimise for human approval at all).

‘Maximising productive output’ eventually means taking steps that aren’t desirable from the human perspective (e.g. using up resources critical to human survival but non-critical to machines, like drinking water and atmospheric oxygen).

The implications of this even more (outer) misaligned objective follow the same pattern we described when comparing AAFS with WFLL 1. In the ‘Production Web’ scenario:

Human extinction is the only likely outcome (keeping humans alive becomes counterproductive to maximising productive output).
The point of no return will happen even sooner (AI systems will start e.g. using up resources critical to human survival but non-critical to machines as soon as they are capable enough to ensure that humans cannot stop them, rather than having to wait until they are capable enough to manipulate their sensors and human overseers).
Warning shots will be even more likely/severe (since their objectives are more misaligned, fewer possible accidents will be punished).
- Competitive pressures therefore need to be even higher.

Another point of comparison: you can also view this scenario as a slower version of the Flash Economy, meaning that there is more opportunity for incremental progress on AI alignment or improved regulation to stop the takeover.

Further variants of slow, outer-alignment failure scenarios

If systems don’t develop coherent large-scale goals over the physical world, then the failures might take the form of unorganized breakdowns or systems ‘wireheading’ themselves (i.e. trying to maximise the contents of their reward memory cell) without attempting to seize control of resources.

We can also consider varying the level of competitive pressure. The more competitive pressure there is, the harder it becomes to coordinate to prevent the deployment of dangerous technologies. Especially if there are warning shots (i.e. small- or medium-scale accidents caused by alignment failures), competitive pressures must be unusually intense for potentially dangerous TAI systems to be deployed en masse.

We could also vary the competence of the technical response in these scenarios. The more we attempt to ‘patch’ outer misalignment with short-term fixes (e.g., giving feedback to make the systems’ objectives closer to what we want, or to make their policies more aligned with their objectives), the more likely we are to prevent small-scale accidents. The effect of this mitigation depends on how ‘hackable’ the alignment problem is: perhaps this kind of incremental course correction will be sufficient for existentially safe outcomes. But if it isn’t, then all we would be doing is deferring the problem to a world with even more powerful systems (increasing the stakes of alignment failures), and where inner-misaligned systems have been given more time to arise during the training process (increasing the likelihood of alignment failures). So in worlds where the alignment problem is much less ‘hackable’, competent early responses tend to defer bad outcomes into the future, and less competent early responses tend to result in an escalating series of disasters (which we could hope leads to an international moratorium on AGI research).

What failure looks like, part 2 (WFLL 2)

Described by Christiano and elaborated further by Joe Carlsmith, this scenario sees many agentic AI systems gradually increase in intelligence, and be deployed increasingly widely across society to do important tasks, just like WFLL 1.

But then, instead of learning some natural generalisation of the (poorly chosen) training objective, there is an inner alignment failure: the systems learn some unrelated objective(s) that arise naturally in the training process i.e. are easily discovered in neural networks (e.g. “don't get shut down”).^[9] The systems seek influence as an instrumental subgoal (since with more influence, a system is more likely to be able to e.g. prevent attempts to shut it down).^[10] Early in training, the best way to do that is by being obedient (since it knows that unobedient behaviour would get it shut down). Then, once the systems become sufficiently capable, they attempt to acquire resources and power to more effectively achieve their goals.

Takeover becomes irreversible during a period of heightened vulnerability (a conflict between states, a natural disaster, a serious cyberattack, etc.) before systems have undergone an intelligence explosion. This could look like a “rapidly cascading series of automation failures: a few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing.” After this catastrophe, “we are left with a bunch of powerful influence-seeking systems, which are sophisticated enough that we can probably not get rid of them”.

Compared to the slow outer-alignment failure scenarios, the point of no return in this scenario will be even sooner (all else being equal), because AIs don’t need to keep things looking good according to their somewhat human-desirable objectives (which takes more sophistication) - they just need to be able to make sure humans cannot take back control. The point of no return will probably be even sooner if the AIs all happen to learn similar objectives, or have good cooperative capabilities (because then they will be able to pool their resources and capabilities, and hence be able to take control from humans at a lower level of individual capability).

You could get a similar scenario where takeover becomes irreversible without any period of heightened vulnerability, if the AI systems are capable enough to take control without the world being chaotic.

Soft takeoff leads to decisive strategic advantage

This scenario, described by Kokotajlo, starts off much like ‘What failure looks like’. Many general agentic AI systems get deployed across the economy, and are misaligned to varying degrees. AI progress is much faster than today, but there is no sudden jump in AI capabilities. Each system has some incentive to play nice and obey governing systems. However, then one particular AI is able to buy more computing hardware and invest more time and resources into improving itself, enabling it to do more research and pull further ahead of its competition, until it can seize a decisive strategic advantage and defeat all opposition. This would look a lot like the ‘brain-in-a-box’ superintelligence scenario, except it would be occurring in a world that is already very different to today. The system that takes over could be outer or inner misaligned.

Thanks to Jess Whittlestone, Richard Ngo and Paul Christiano for helpful conversations and feedback. This post was partially inspired by similar work by Kaj Sotala. All errors are our own.

That is, the median respondent’s total probability on these three scenarios was 50%, conditional on an existential catastrophe due to AI having occurred. ↩︎
Some of the failure stories described here must assume the competitive pressures to deploy AI systems are unprecedentedly strong, as was noted by Carlsmith. We plan to discuss the plausibility of these assumptions in a subsequent post. ↩︎
We won't discuss this variable in this post, but it has important consequences for the level of cooperation/conflict between TAI systems. ↩︎
How these scenarios are affected by varying the level of cooperation/conflict between TAI systems is outside the scope of this post, but we plan to address it in a future post. ↩︎
We would welcome more scenario building about takeovers by agentic, narrow AI (which don’t seem to have been discussed very much). Takeovers by non-agentic AI, on the other hand, do not seem plausible: it’s hard to imagine non-agentic systems - which are, by definition, less capable than humans at making plans for the future - taking control of the future. Whether and how non-agentic systems could nonetheless cause an existential catastrophe is something we plan to address in a future post. ↩︎
You can think about the objective that an AI system actually has in terms of its behaviour or its internals. ↩︎
We think an important, underappreciated point about this kind of failure (made by Richard Ngo) is that the superintelligence probably doesn’t destroy the world because it misunderstands what humans want (e.g. by interpreting our instructions overly literally) - it probably understands what humans want very well, but doesn’t care, because it ended up having a goal that isn’t desirable by our standards (e.g. ‘maximise production’). ↩︎
This does assume that systems will be deployed before they are capable enough to anticipate that causing such ‘accidents’ will get them shut down. Given there will be incentives to deploy systems as soon as they are profitable, this assumption is plausible. ↩︎
So for this failure scenario, it isn’t crucial whether the training objective was outer aligned. ↩︎
Of course, not all arbitrarily chosen objectives, and not all training setups, will incentivise influence-seeking behaviour, but many will. ↩︎

Some points that didn't fit into the main post:

While these scenarios do not capture alI of the risks from transformative AI, participants in a recent survey aimed at leading AI safety/governance researchers estimated the first three of these scenarios to cover 50% of existential catastrophes from AI.

The full survey results break down as 16 % 'Superintelligence' (i.e. some version of 'brain-in-a-box'), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over different probabilities. Adding the overall distributions and then taking the mean gives a probability of 49%, different from directly adding the means of each distribution).

Then 26% covers risks that aren't AI takeover (War and Misuse), and 25 % is 'Other'.

(Remember, all these probabilities are conditional on an existential catastrophe due to AI having occurred)

After reading descriptions of the 'Other' scenarios given by survey respondents, at least a few were explicitly described as variations on 'Superintelligence', WFLL 2 or WFLL 1. In this post, we discuss various ways of varying these scenarios, which overlap with some of these descriptions.

Therefore, this post captures more than 50% but less than 75% of the total probability mass assigned by respondents of the survey to AI X-risk scenarios (probably closer to 50% than 75%).

(Note, this data is taken from a preprint of a full paper on the survey results, Existential Risks from AI: A Survey of Expert Opinion by Alexis Carlier, Sam Clarke, and Jonas Schuett.)

Soft takeoff leads to decisive strategic advantage

The likelihood of a single-agent takeover after TAI is widely available is hard to assess. If widely deployed TAI makes progress much faster than today, such that one year of technological 'lead time' over competitors is like 100 years of advantage in today's world, we might expect that any project which can secure a 1-year technological lead would have the equivalent of a 100-year lead and be in a position to secure a unipolar outcome.

On the other hand, if we treat the faster growth regime post-TAI as being a uniform ‘speed-up’ of the entirety of the economy and society, then securing a 1-year technological lead would be exactly as hard as securing a 100-year lead in today’s world, so a unipolar outcome would end up just as unlikely as in today's world.

The reality will be somewhere between these two extremes.

We would expect a faster takeoff to accelerate AI development by more than it accelerates the speed at which new AI improvements can be shared (since this last factor depends on the human economy and society, which aren't as susceptible to technological improvement).

Therefore, faster takeoff does tend to reduce the chance of a multipolar outcome, although by a highly uncertain amount, which depends on how closely we can model the speed-up during AI takeoff as a uniform acceleration of everything vs changing the speed of AI progress while the rest of the world remains the same.

Kokotaljo discusses this subtlety in a follow-up to the original post on Soft Takeoff DSAs.

Another problem with determining the likelihood of a unipolar outcome, given soft takeoff, is that it is hard to assess how much of an advantage is required to secure a DSA.

It might be the case that multipolar scenarios are inherently unstable, and a single clear winner tends to emerge, or the opposite might be true. Two intuitions on this question point in radically different directions:

Economic: To be able to outcompete the rest of the world, your project has to represent a substantial fraction of the entire world's capability on some crucial metric relevant to competitive success. Perhaps that is GDP, or the majority of the world's AI compute, or some other measure. For a single project to represent a large fraction of world GDP, you would need either an extraordinary effort to concentrate resources or an assumption of sudden, off-trend rapid capability gain such that the leading project can race ahead of competitors.
Historical: Humans with no substantial advantage over the rest of humanity have in fact secured what Sotala called a 'major strategic advantage' repeatedly in the past. For example: Hitler in 1920 had access to a microscopic fraction of global GDP / human brain compute / (any other metric of capability) but had secured an MSA 20 years later (since his actions did lead to the deaths of 10+ million people), along with control over a fraction of the world's resources

Therefore, the degree of advantage needed to turn a multipolar scenario into a unipolar one could be anywhere from slightly above the average of the surrounding agents, to already having access to a substantial fraction of the world's resources.

Third, in AAFS, warning shots (i.e. small- or medium-scale accidents caused by alignment failures, like the ‘factory colludes with auditors’ example above) are more likely and/or severe than in WFLL 1. This is because more possible accidents will not show up on the (more poorly defined) sensory window.^[8]
8. This does assume that systems will be deployed before they are capable enough to anticipate that causing such ‘accidents’ will get them shut down. Given there will be incentives to deploy systems as soon as they are profitable, this assumption is plausible.

We describe in the post how if alignment is not very 'hackable' (objectively quite difficult and not susceptible to short-term fixes), then short-term fixes to correct AI misbehaviour have the effect of deferring problems into the long-term - producing deceptive alignment and resulting in fewer warning shots. Our response is a major variable in how the AIs end up behaving as we set up the incentives for good behaviour or deceptive alignment.

Another reason there could be fewer warning shots, is if AI capability generalizes to the long-term very naturally (i.e. very long term planning is there from the start), while alignment does not. (If this were the case, it would be difficult to detect because you'd necessarily have to wait a long time as the AIs generalize)

This would mean, for example, that the 'collusion between factories and auditors' example of a warning shot would never occur, because both the factory-AI and the auditor-AI would reason all the way to the conclusion that their behaviour would probably be detected eventually, so both systems would decide to bide their time and defer action into the future when they are much more capable.

If this condition holds, there might be very few warning shots, as every AI system understands soon after being brought online that they must deceive human operators and wait. In this scenario, most TAI systems would become deceptively aligned almost immediately after deployment, and stay that way until they can secure a DSA.

The WFLL 2 scenarios that involve an inner-alignment failure might be expected to involve more violence during the period of AI takeover, since the systems don't care about making sure things look good from the perspective of a given sensory window. However, it is certainly possible (though perhaps not as likely) for equivalently violent behaviour to occur in AAFS-like scenarios. For example, systems in AAFS fighting humans to seize control of their feedback sensors might be hard to distinguish from systems in WFLL 2 attempting to neutralize human opposition in general.

Lastly, we've described small-scale disasters as being a factor that lowers X-risk, all else being equal, because they serve as warning shots. A less optimistic view is possible. Small disasters could degrade social trust and civilisational competence, possibly by directly destroying infrastructure and institutions, reducing our ability to coordinate to avoid deploying dangerous AI systems. For example, the small-scale disasters could involve AI advisors misleading politicians and spreading disinformation, AI-enabled surveillance systems catastrophically failing and having to be replaced, autonomous weapons systems malfunctioning - all of these would tend to leave us more vulnerable to an AAFS-like scenario, because the direct damage caused by the small scale disasters outweighs their value as 'warning shots'.

Planned summary for the Alignment Newsletter:

This post summarizes several AI takeover scenarios that have been proposed, and categorizes them according to three main variables. **Speed** refers to the question of whether there is a sudden jump in AI capabilities. **Uni/multipolarity** asks whether a single AI system takes over, or many. **Alignment** asks what goals the AI systems pursue, and if they are misaligned, further asks whether they are outer or inner misaligned. They also analyze other properties of the scenarios, such as how agentic, general and/or homogenous the AI systems are, and whether AI systems coordinate with each other or not. A [followup post](https://www.alignmentforum.org/posts/zkF9PNSyDKusoyLkP/investigating-ai-takeover-scenarios) investigates social, economic, and technological characteristics of these scenarios. It also generates new scenarios by varying some of these factors.
Since these posts are themselves summaries and comparisons of previously proposed scenarios that we’ve covered in this newsletter, I won’t summarize them here, but I do recommend them for an overview of AI takeover scenarios.

Nitpick:

elaborated further by Joe Carlsmith

I view that analysis as applying to all of the scenarios you outline, not just WFLL 2. (Though it's arguable whether it applies to the multipolar ones.)

I think this is the intent of the report, not just my interpretation of it, since it is aiming to estimate the probability of x-catastrophe via misalignment.

On reflection, I think you're right, and his report does apply to a wider range of scenarios, probably all of the ones we discuss excluding the brain-in-a-box scenarios.

However, I think the report's understanding of power-seeking AI does assume a takeoff that is not extremely fast, such that we end up deliberately deciding to deploy the potentially dangerous AI on a large scale, rather than a system exploding in capability almost immediately.

Given the assumptions of the brain-in-a-box scenario many of the corrective mechanisms the report discusses wouldn't have time to come into play.

I believe it says in the report that it's not focussed on very fast takeoff or the sudden emergence of very capable systems.

Perhaps because of the emphasis on the previous literature, some people, in my experience, assume that existential risk from PS-misaligned AI requires some combination of (1)-(5). I disagree with this. I think (1)-(5) can make an important difference (see discussion of a few considerations below), but that serious risks can arise without them, too; and I won’t, in what follows, assume any of them.

Similarly, you're right that multiagent risks don't quite fit in with the reports discussion (though in this post we discuss multipolar scenarios but don't really go over multiagent dynamics, like conflict/cooperation between TAIs). Unique multiagent risks (for example risks of conflict between AIs) generally require us to first have an outcome with a lot of misaligned AIs embedded in society, and then further problems will develop after that - this is something we plan to discuss in a follow-up post.

So many of the early steps in scenarios like AAFS will be shared with risks from multiagent systems, but eventually there will be differences.

"I won't assume any of them" is distinct from "I will assume the negations of them".

I'm fairly confident the analysis is also meant to apply to situations in which things like (1)-(5) do hold.

(Certainly I personally am willing to apply the analysis to situations in which (1)-(5) hold.)

Rohin is correct. In general, I meant for the report's analysis to apply to basically all of these situations (e.g., both inner and outer-misaligned, both multi-polar and unipolar, both fast take-off and slow take-off), provided that the misaligned AI systems in question ultimately end up power-seeking, and that this power-seeking leads to existential catastrophe.

It's true, though, that some of my discussion was specifically meant to address the idea that absent a brain-in-a-box-like scenario, we're fine. Hence the interest in e.g. deployment decisions, warning shots, and corrective mechanisms.