Investigating AI Takeover Scenarios

by Samuel Dylan Martin40 min read17th Sep 20211 comment

15

Threat ModelsAI RiskAI
Frontpage

Epistemic status: lots of this involves interpreting/categorising other people’s scenarios, and could be wrong. We’d really appreciate being corrected if so.

TLDR: see the summary table.

This post was written with the help of Sam Clarke.

In the last few years, people have proposed various AI takeover scenarios. We think this type of scenario building is great, since there are now more concrete ideas of what AI takeover could realistically look like. That said, we have been confused for a while about what different assumptions are made when outlining each scenario. This post investigates these assumptions, and might be useful for anyone interested in the plausibility of scenarios like What Failure looks like or Production Web.

This post builds on our previous post on how to distinguish AI takeover scenarios. Here, we discuss variable social, economic and technological characteristics of the worlds described in each of seven takeover scenarios. These characteristics are:

  • Crucial decisions: the specific (human) decisions necessary for takeover
  • Competitive Pressures: the strength of incentives to deploy AI systems despite the dangers they might pose
  • Takeover capabilities: how powerful the systems executing the takeover are
  • Hackability of alignment: the difficulty of correcting misaligned behaviour through incremental fixes

We begin by explaining why we investigated these particular properties of AI takeover scenarios: they are characteristics along which slow scenarios (which describe loss of control to AI occurring over years) and fast scenarios (which involve AIs gaining capability rapidly over a much shorter period) differ quite a lot (Different assumptions between slow and fast scenarios). In particular, slow scenarios make stronger assumptions about competitive pressures but weaker assumptions about takeover capabilities, compared to fast scenarios. 

In sharing this post, we want to reveal assumptions of AI takeover scenarios that might not be obvious; understanding these assumptions is essential for predicting which risks are most serious.

Therefore, in the ‘Takeover Characteristics’ section, we present (our interpretation of) the 7 AI takeover scenarios discussed in our original post from the perspective of the four characteristics this post discusses, in the form of a table.

In the following Discussion of Scenarios’ section we elaborate on the information in this table - describing in detail the nature of the crucial decisions made, the competitive pressures in play, the key capabilities of the AI system(s) and the ‘hackability’ of alignment in each of the seven scenarios. 

Because we have identified new characteristics of AI takeover, we have been able to come up with new takeover scenarios by considering all the ways these characteristics might vary. Some of these are described in New Scenarios’. 

Finally, in the section on ‘Discussion of Characteristics’, we describe each of the four characteristics of takeover in more depth, discuss how they interact with each other and evaluate some arguments about what values they will likely take.

This post builds on previous work investigating these questions. Joe Carlsmith’s report on power-seeking AI discussed deployment decisions and the role of competitive pressures in AI takeover scenarios in general (sections 5,6). Kaj Sotala’s report on disjunctive scenarios for AI takeover investigated competitive pressures and crucial decisions, primarily as they pertained to ‘brain-in-a-box’ scenarios (several of the scenarios we discuss here had not been devised when that report was written).

 

AI Takeover Scenarios

Our original post on Distinguishing AI takeover scenarios examined seven proposed ways that agentic AI systems with values contrary to those of humans could seize control of the future. These scenarios are summarized briefly, and we will use the below names to refer to them:

The links will take you to the more detailed descriptions of each scenario from our first post, including a discussion of uni/multipolarity, speed of takeoff and type of misaligned behaviour

Fast scenarios

Outer-misaligned brain-in-a-box scenario:

This is the ‘classic’ scenario that most people remember from reading Superintelligence: A single highly agentic AI system rapidly becomes superintelligent on all human tasks, in a world broadly similar to that of today. The objective function used to train the system (e.g. ‘maximise production’) doesn’t push it to do what we really want, and the system’s goals match the objective function.

Inner-misaligned brain-in-a-box scenario

Another version of the brain-in-a-box scenario features inner misalignment, rather than outer misalignment. That is, a superintelligent AGI could form some arbitrary objective that arose during the training process. 

Flash economy

A multipolar version of the outer-misaligned ‘brain-in-a-box’ scenario, with many powerful AIs. Groups of systems reach an agreement to divide the Earth and space above it into several conical sectors, to avoid conflict between them (this locks in multipolarity).
 

Slow scenarios

What failure looks like, part 1 (WFLL 1)

Many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics). The objective functions used to train them (e.g., ‘reduce reported crimes’, ‘increase reported life satisfaction’, ‘increase human wealth on paper’) don’t push them to do what we really want. There’s no discrete point where this scenario becomes irreversible. AI systems gradually become more sophisticated, and their goals gradually gain more influence over the future relative to human goals. 

Another (outer) alignment failure story (AAFS)

This scenario, also described by Christiano, is initially similar to WFLL 1. AI systems slowly increase in generality and capability and become widely deployed. The systems are outer misaligned: they pursue natural generalisations of the poorly chosen objective functions they are trained on. Problems arise along the way, when systems do things that look good but aren’t actually good. Specifically: ‘ensuring things look good according to human judgment’ eventually means fooling humans and carefully controlling what gets fed into the sensors, because the AIs can produce much more desirable-looking outcomes, much more easily, by controlling the sensor feedback given to human operators instead of actually making the world good. Eventually, all humans will either be killed or totally disempowered, because this is the best way of making sure the systems’ objectives are maximally positive and will remain that way forever.

Production Web

Critch’s Production Web scenario is similar to WFLL 1 and AAFS, except that the objective functions used to train the systems are more severely outer misaligned. Specifically, the systems are trained to ‘maximise productive output’ or another similarly crude measure of success. 

What failure looks like, part 2 (WFLL 2)

Described by Christiano and elaborated further by Joe Carlsmith, this scenario sees many agentic AI systems gradually increase in intelligence, and be deployed increasingly widely across society to do important tasks, just like WFLL 1.

But then there is an inner alignment failure rather than an outer alignment failure. The systems learn an objective unrelated to the training objective. The objective it follows will be easily discoverable by neural networks (e.g., ‘don’t get shut down) as it arises naturally in the training process. The systems seek influence as an instrumental subgoal. Takeover becomes irreversible during a period of heightened vulnerability (a conflict between states, a natural disaster, a serious cyberattack, etc.) before systems have undergone an intelligence explosion. This could look like a “rapidly cascading series of automation failures”

Soft takeoff leads to decisive strategic advantage

This scenario, described by Kokotajlo, starts off much like ‘What failure looks like’. Unlike in ‘What failure looks like’, in this scenario one AI is able to buy more computing hardware and invest more time and resources into improving itself, enabling it to do more research and pull far ahead of its competition. Eventually, it can seize a decisive strategic advantage and defeat all opposition. 

Different assumptions between slow and fast scenarios

The starting point for our investigation is the following observation: fast ‘brain-in-a-box’ scenarios assume that takeover probably cannot be prevented after the misaligned Transformative Artificial Intelligence (TAI) is deployed (due to very rapid capability gain), but the ‘slow scenarios’ involve an extended period where misaligned AIs are deployed, incremental improvements to alignment are attempted and, in some cases, warning shots (small-scale disasters that indicate that AI is unsafe) happen.

Therefore, the slow scenarios have to provide an explanation as to why many actors persist in deploying this dangerous technology over several years. These social/economic assumptions can be thought of as substituting for the assumption of very fast progress that was key to the fast scenarios - the rapid capability gain with no time to respond is replaced by a slower capability gain and an ineffective response.

If the slow scenarios capture reality better than the fast scenarios, then systems will be deployed deliberately and will initially be given power rather than seizing power. This means both that the systems won’t be so obviously dangerous that the misbehaviour is noticed early on and that there is still misalignment later on. Carlsmith:

The question, then, isn’t whether relevant actors will intentionally deploy systems that are already blatantly failing to behave as they intend. The question is whether the standards for good behavior they apply during training/testing will be adequate to ensure that the systems in question won’t seek power in misaligned ways on any inputs post-deployment.

Just from this initial observation, we know that there are several differences in the assumptions of slow and fast scenarios that go beyond just technical factors or overall results like whether the outcome is unipolar or multipolar. This led us to investigate exactly how particular slow and fast scenarios differ in the broader set of assumptions they make.

 

Takeover Characteristics

Our initial table of characteristics for AI takeover scenarios discussed the primary and overt characteristics of a takeover - whether they were unipolar or multipolar, whether they involved rapid capability gain or slow capability gain, and how and why the AI systems were misaligned. In this post, we present a table of secondary characteristics of AI takeover scenarios - factors that influence these primary characteristics or depend on them in various ways.

The characteristics of AI takeover `can be divided into first, social and economic factors: crucial decisions, competitive pressures, and second, technical factors: takeover capabilities and alignment ‘hackability’.

Crucial decisions and competitive pressures are two ways of looking at the preconditions for an AI takeover scenario. The first is a local view, focussing on particular mistaken decisions (e.g. around deploying a dangerous AI). The second is a broad view, focussing on the presence of perverse economic or political incentives. These two overlap - bad decisions are made in response to perverse competitive pressures, and competitive pressures can lessen or intensify because of key decisions about oversight or regulation. 

Takeover capabilities and Alignment ‘hackability’ are assumptions each scenario makes about the competence of the AIs which take over and how difficult it is to align them using short term, case-by-case fixes. There are complicated relationships between the assumptions you make about these technological questions and the assumptions you make about social factors. Roughly speaking, the weaker the competitive pressures and the more competently crucial decisions are made, the more capable the AIs have to be and the harder (less ‘hackable’) alignment has to be for disaster to occur. However, note that if hackability is very low, we might have enough warning shots to avoid developing dangerous AI in the first place. These relationships are discussed in more detail in the section on Discussion of Characteristics.

Table

This table presents our best guess of what the crucial decisions, degree and cause of competitive pressures, assumed capabilities for AI takeover and hackability (effectiveness of short-term fixes) in different takeover scenarios are. In the following section we then discuss each scenario from these perspectives. You may want to refer back to our first summary table.

CharacteristicsCrucial Decisions (Identifiable decisions made by humans that lead to Takeover)Competitive Pressures (Strength and nature of incentives to deploy AI)

Takeover Capabilities

(What capabilities do the AIs employ to execute takeover)

Alignment

‘Hackability’

(extent to which short-term fixes are sufficient for aligning systems on all inputs which they will in fact receive)

Outer-misaligned brain-in-a-box ‘superintelligence’ scenario


 

Inner-misaligned brain-in-a-box scenario

Choose to develop TAI


 

(If not released deliberately: Choose to deploy TAI)

(race dynamic may be present in leadup to TAI development)

Rapid Capability Gain


 

Ability to seize DSA or major advantage over the rest of the world from ~nothing

If not released deliberately – has to escape

Irrelevant (no time for fixes)
Flash Economy

Choose to develop TAI


 

Choose to release system open-source / share research

Enough to allow initial deployment of the TAI systemsAbility to seize DSA or major advantage over the rest of the world from strong starting pointCould be fairly high, not much time for fixes

What failure looks like, part 1

(WFLL 1)

Choose to develop TAI


 

Choose to automate systems on a large-scale


 

Inadequate response to warning signs

Incentives to keep deploying AI


 

Some pressure to fix small errors


 

Irrelevant, loss of control occurs without takeoverModerate

Another (outer) alignment failure story

(AAFS)

Choose to develop TAI


 

Choose to automate systems on a large-scale


 

Inadequate response to warning signs and small disasters

Incentives to keep deploying AI



 

Significant pressure to fix small errors

Ability to seize DSA or major advantage over the rest of the world from strong starting pointLower than WFLL 1
Production Web

Choose to develop TAI


 

Choose to automate systems on a large-scale


 

Inadequate response to warning signs and small disasters

Strong incentives to keep deploying AI


 

No real pressure to fix small errors

Ability to seize DSA or major advantage over the rest of the world from strong starting pointSimilar to WFLL 1

What failure looks like, part 2 

(WFLL 2)

Choose to develop TAI


 

Choose to automate systems on a large-scale


 

Inadequate response to warning signs and escalating series of disasters

Strong incentives to keep deploying AIAbility to seize DSA or major advantage over the rest of the world after some weakening eventLow
Soft Takeoff leading to DSA

Choose to develop TAI


 

Government or research group centralises research effort and achieves strong lead

Race DynamicAbility to seize DSA or major advantage over the rest of the world from resources of initial project(Low enough that whatever is tried during testing of system fails)

 

 

Discussion of Scenarios

Here we discuss each of the seven scenarios in depth from the perspective of crucial decisions, competitive pressures, takeover capabilities and alignment hackability. The links on each heading take you to a full description of the original scenario in our previous post.

Outer-misaligned brain-in-a-box scenario/Inner-misaligned brain-in-a-box scenario

In ‘brain-in-a-box’ scenarios, the main crucial decisions occur early on and involve development (and possibly voluntary deployment) of the first and only TAI, with the assumption that once this TAI is deployed it’s game over. Depending on the anticipated level of capability, the system might also be capable of talking its way into being deployed during testing or escaping its testing environment, or else might be voluntarily deployed. This particular critical decision - the choice to deploy systems - was discussed by Sotala in depth.

As well as anticipated economic benefit, the systems could be voluntarily released for unethical reasons - terrorism, criminal profit, ideological motives or a last-ditch mutually assured destruction attempt.

Competitive pressures to allow the AI to proliferate despite the risks it poses aren’t that important, because after deployment, the AI rapidly completes its takeover and there is no chance for opposition. A race dynamic due to anticipated economic benefit or military power may well be present, and might explain why the system got developed in the first place, but unlike with the slow scenarios there aren’t noticeable competitive pressures explaining how the AI takes over after release. Alignment ‘hackability’ also doesn’t become an issue - there’s no time to incrementally correct the system because it increases in capability too quickly.

Flash economy

The scenario unfolds quickly once the requisite jump in capability has been made (over a few months), but unlike the Brain-in-a-box scenarios, there are multiple highly capable systems in the world. Crucially, the breakthroughs required to create the ‘distributed autonomous organisations’ (highly capable TAIs in this scenario) have to either be leaked or shared (e.g. open-sourced, or shared between particular companies) rapidly, so that the technology isn’t monopolised by one group leading to a DSA.

The agents - ‘distributed autonomous organisations’ - proliferate quickly after the required technology is developed. Because of the extreme speed with which the agents proliferate, the large benefit they deliver early on, and their decentralised nature, there are strong incentives against interference by government and regulation (competitive pressures).

The agents do execute a takeover once they have built up their own infrastructure (takeover capabilities), but they aren’t capable of executing a takeover immediately after being deployed. Lastly, because of how fast the scenario unfolds and the fact that the agents are mostly left alone, alignment might be fairly hackable and corrections easy to apply. As with outer misaligned ‘brain-in-a-box’, once the systems are released there’s just no opportunity to coordinate and actually do this, so even if some systems are controlled with incremental improvements many escape human attention or through regulatory capture/the economic benefit they deliver avoid human interference.

What failure looks like, part 1 (WFLL 1)

In WFLL 1 there are fewer crucial decisions. AI systems gradually increase in capability and are used throughout the economy. Therefore, there has to be no concerted effort to prevent this sort of heavy automation of the economy (so a lack of restrictive regulation or litigation), but otherwise there are few identifiable specific decisions that need to be made. Competitive pressures - mainly arising from the direct economic benefit the systems provide and their benefit to stakeholders, are quite strong. In this scenario, a fraction of people are aware that things are proceeding along a dangerous path, yet AI deployment continues. However, there aren’t many visible small-scale disasters, so competitive pressures needn’t be exceptionally strong (i.e. sufficient to maintain deployment even in the face of warning shots).

The systems don’t execute an overt takeover at any point, so the required capability for takeover is effectively nil - they are just delegated more and more power until humanity loses control of the future. There also aren’t many obvious disasters as things proceed, and the final result of the scenario doesn’t necessarily involve human extinction. Since the systems don’t end up so egregiously misaligned that they execute a violent takeover, there is some, probably intermittent, effort to incrementally fix systems as they malfunction. Therefore, the ‘hackability’ of AI alignment in this scenario is neither very high (otherwise we wouldn’t lose control eventually), nor very low (in which case the systems would end up egregiously misaligned and execute a violent takeover, definitely resulting in extinction) - the alignment problem has an “intermediate” level of hackability.

Another (outer) alignment failure story (AAFS)

AAFS is subtly different from WFLL 1 in several key ways. The crucial decisions are the same as WFLL 1, except that this scenario specifies there are many early warning signs of misaligned behaviour - small scale disasters that do come to public attention (e.g. a factory colludes with the auditors valuing its output, giving a great quarterly report that didn’t actually correspond to any revenue), but the response to these accidents is always incremental patches and improvements to oversight rather than blanket bans on automation or rethinking our overall approach to AI development. Competitive pressures are somewhat strong, with direct economic benefit and benefit to shareholders again playing key roles in explaining why we persist in deploying dangerous systems.

However, the scenario also specifies that there are many, varied attempts at incremental improvements to TAI systems in response to each failure - since these attempts are a key part of the story (unlike WFLL1) but the result is worse than in WFLL1 (definite extinction), the scenario assumes that alignment ‘hackability’ is lower than WFLL 1 (also see Paul’s comment that this scenario is one where ‘the alignment problem is somewhat harder than I expect’). This also means that the scenario assumes competitive pressures are weaker than in WFLL 1, as there is much more coordination around attempting to patch mistakes, compared to WFLL1 (see Paul’s comment that this scenario is one where ‘society handles AI more competently than I expect’). However, while there are more attempts at reigning in AI than in WFLL 1, the competitive pressures aren’t reduced by enough to prevent eventual AI takeover.

Lastly, this scenario does feature a takeover executed by systems that physically and violently seize control of their sensors and feedback mechanisms - the takeover capabilities must therefore include cyberoffense and possibly control of drones or advanced nanotechnology, not primarily effective persuasion tools and other ‘quiet’ means. 

However, unlike the brain-in-a-box scenarios, the AI systems are already highly embedded in the economy when they take over, so are starting from a much stronger position than brain-in-a-box AIs including control of lots of physical resources, factories and drones. Therefore, the technological capabilities required for takeover are lower.

Production Web

Production web is similar to AAFS in terms of crucial decisions, except that the systems that proliferate in production web gain their large-scale goals without much deliberate planning or testing at all (agentic systems with narrow goals like fulfilling a specific market niche knit together into a ‘production web’ by themselves). The competitive pressures, primarily from economic benefit and benefit delivered to stakeholders, must be very strong for this process to proceed (stronger than in AAFS/WFLL 1) despite the fact that it occurs over multiple years and with obvious signs that humanity is losing control of the situation. Regulatory capture and benefit to stakeholders are emphasised as reasons why the development of the production web is not halted, but there is less focus on the ambiguity of the situation, compared to WFLL 1 (since the outcome is much more obviously disastrous in Production Web).

Alignment ‘Hackability’ is similar to AAFS - in both cases, incremental fixes work for a while and produce behaviour that is at least beneficial in the short term. The difference is that because competitive pressures are stronger in Production Web, compared to AAFS, there is less effort put into incremental fixes and so systems end up going off the rails much sooner.

Like AAFS, the takeover occurs when the systems are already highly embedded in the world economy, but probably occurs earlier and with a somewhat lower barrier to success, since the systems don’t need to seize control of sensors to ensure that things continue to ‘look good’. Otherwise, the takeover route is similar to AAFS, though the story emphasises resources being consumed and humanity going extinct as a side effect, rather than systems seizing control of their sensors and oversight systems.

What failure looks like, part 2 (WFLL 2)

WFLL 2 involves an inner alignment failure, so setting up the training in ways that disincentivise power-seeking behaviour less will be very hard, as by specification power-seeking is a strong attractor state. Therefore hackability is low. This has various other effects on the scenario. The crucial decisions probably involve a greater neglect of potential risks than in WFLL 1, especially because the warning shots and small-scale failure modes in WFLL 2 are more likely to take the form of violent power-seeking behaviour rather than comparatively benign mistakes (like auditor-AIs and factory-AIs colluding).

The competitive pressures have to be strong, to explain why systems keep getting deployed despite the damage they have already inflicted.

Christiano describes the takeover as occurring at a point of heightened vulnerability - both because this is a Schelling point where different systems can coordinate to strike, and because the minimum level of capability required for a takeover is lower. Since the systems will execute a takeover at the first opportunity and during a period of heightened vulnerability, (and will therefore be attempting takeover much earlier) the required capabilities for takeover are lower in this scenario, compared to AAFS/Production Web.

Soft takeoff leads to decisive strategic advantage

Soft takeoff leading to decisive strategic advantage (DSA) has an extra assumption on top of the preconditions for AAFS/WFLL 1/Production Web - that one particular research group is able to secure significant lead time over competitors, such that it can defeat both humanity and rival AIs. Given this assumption, what’s going on in the rest of the world, whether the other AI systems are aligned or not, is irrelevant.

The leading project is probably motivated by a strategic race for military or economic dominance, since it has secured enough resources to dominate the rest of the world. The required takeover capability is very high as the system is competing against other transformative AI systems, although not quite as high as in the ‘brain-in-a-box’ scenario, as this leading project starts out with a lot of resources. Alignment cannot be hackable enough that the leading project is able to successfully align the AI system in the development time it has, but otherwise the exact level of ‘hackability’ is underdetermined.

 

New Scenarios

Here, we present some scenarios devised by varying one or more of the takeover characteristics.

Soft takeoff and decisive strategic advantage by narrow AI

We devised this scenario by setting the ‘takeover capabilities’ to a very low value - the barrier to AI takeover is low.

This scenario is similar to ‘Soft takeoff leads to decisive strategic advantage’, except that the single system which takes over is not that much more capable than its rivals. Rather, it simply has a single good trick that enables it to subvert and take control of the rest of the world. Its takeover capability might be exceptionally good manipulation techniques, specific deadly technology, or cyberoffensive capability, any of which could allow the system to exploit other AIs and humans. This removes the assumption that a lot of research effort will need to be concentrated to achieve a DSA, and replaces it with an assumption that there is some unique vulnerability in human society which a narrow system can exploit. Implicit in this scenario is the assumption that generally capable AI is not needed to take on an extraordinary research effort to find this vulnerability in human society.

Compounding Disasters

We devised this scenario by assuming the competitive pressures are very high, crucial decisions are very incompetent and ‘hackability’ is very low.

This scenario is similar to AAFS, with TAI systems widely being deployed, pursuing goals that are okay proxies for what humans actually want, and demonstrating some misbehaviour. However, instead of the small-scale failures taking the form of relatively benign ‘warning shots’ that lead to (failed) attempts to hack AI systems to prevent future errors, the small scale disasters cause a large amount of direct damage. For example, an AI advisor misleads the government, leading to terrible policy mistakes and a collapse of trust, or autonomous weapon systems go rogue and attack cities before being taken out.  The result of this is a compounding series of small disasters that rapidly spiral out of control, rather than attempted patches staving off disaster for a while before a single sudden AI takeover. In the end, the AI takeover occurs at a period of heightened vulnerability brought about by previous medium-sized AI-related disasters. Therefore, AI systems in this scenario need not be as competent as in AAFS or even WFLL 2 to take over. Alignment may be easily hackable in this situation, but such damage has been done by early, agentic, narrow AIs that no such fixes are attempted.

Automated War

A situation rather like the AAFS scenario plays out, where the economy becomes dependent on AI, and we lose control of much key infrastructure. Capable, agentic AI systems are built which do a good job of representing and pursuing the goals of their operators (inner and outer aligned). These are deployed on a large scale and used to control armies of drones and automatic factories, as well as the infrastructure needed for surveillance, for the purposes of defending countries.

However, there are key flaws in the design of the AI systems that only become apparent after they are in a position to act relatively independent of human feedback. At that point, flaws in their ability to both model each other and predict their own chances of winning potential contests over valuable resources lead to arms races and ultimately destructive wars that the AIs have precommitted to pursue.

This scenario probably involves a stronger role for military competition, instead of just economic competition, and also involves a particular kind of (non-intent) alignment failure - systems failing to behave correctly in multiagent situations (along with an intent alignment failure that means the systems can’t just be told to stand down when they start going against the interests of their operators).

From the perspective we are taking in this post, there need to be particular crucial decisions made (automation of military command and control), as well as strong military competitive pressures and a likely race dynamic. Alignment is not very hackable, for a specific reason - the multiagent flaw in AIs is not easy to detect in testing or soon after deployment.

Failed Production Web

The preconditions for Production Web play out as described in that scenario, where agentic AI systems each designed to fill specific market niches attempt to integrate together. However, due to either specific defects in modelling other AIs or inner misalignment, the systems are constantly seeking ways to exploit and defraud each other. These attempts eventually result in AI systems physically attacking each other, resulting in a chaotic war that kills humans as a side effect. This is similar to ‘automated war’, but with different competitive pressures. There is less of a focus on strategic competition and more of a focus on economic competition, and requires similar assumptions to Production Web about very strong competitive pressures.

 

Discussion of Characteristics 

We have seen how different scenarios involve varied critical decisions, stronger or weaker assumptions about competitive pressures, a lower or higher threshold for takeover or different levels of alignment hackability. How plausible are these features of the scenarios?

Below, we discuss the four characteristics we have identified and, for some, give an assessment of the reasons why you might expect them to be at one extreme or another (crucial decisions made unusually competently/incompetently, very strong/very weak competitive pressures to deploy AI systems, a low/high bar for AIs to be capable enough to take over, easy/hard alignment ‘hackability’).

Crucial Decisions

In all the scenarios discussed, we can identify certain decisions which governments and companies must make. Most obviously, research into agentic AI has to be pursued for long enough to produce significant results, and this would have to include a lack of oversight and no decisions to halt research in the face of significant risk. Some scenarios also involve cases where AIs that obviously pose a risk are deliberately released for some reason.

A scenario is less plausible if many crucial decisions must all be made wrongly for the scenario to come about. A scenario is more plausible if varying whether actors make the wrong choice at many stages of TAI development doesn’t change whether the scenario happens.

This is important, especially because it is very difficult to assess what choices actors will actually make while TAI develops (and we won’t try to figure this out in this post). By finding out how many crucial decisions are relevant for a given AI takeover scenario, we can get a better understanding of how plausible they are, despite our confusion about what governments and companies would decide in particular cases. There is an extensive discussion of the plausibility of some potential crucial decisions on page 326 and after of Kaj Sotala’s report.

Competitive Pressures

‘Competitive pressures’ is a characteristic that describes how strong the incentives will be to keep deploying dangerous AI, even in the face of significant risk. There has been some discussion of the implied strength of competitive pressures in the slow and fast scenarios. Here are some reasons to expect that there will be strong pressures to deploy dangerous Transformative Artificial Intelligence (TAI):

(1) Short-term incentives and collective action

Economic Incentives: Since TAI will be economically valuable in the short-term, incentives might lead us to cut corners on safety research, especially checks on how models generalize over long time horizons.

Military Incentives: TAI even in its early stages might provide an unchallengeable military advantage, so states would have an extra incentive to compete with each other to produce TAI first.

(2) Regulatory capture

AI actions benefit stakeholders: There will be many particular beneficiaries (as distinct from benefits to the overall economy) from TAI systems acting in misaligned ways, especially if they are pursuing particular goals like ‘make money’ or ‘maximise production’. This means the stakeholders will have both the resources and motivation to water down  regulation and oversight.

AI existence provides value (due to IP): If financial markets realize how valuable TAI is ahead of time, the developers can quickly become extremely wealthy ahead of deployment once they demonstrate the future value they will be able to provide (before the TAI has had time to act in the world to produce economic benefit). This gives stakeholders resources and a motivation to water down regulation and oversight.

(3) Genuine ambiguity

Actual ambiguity: In many of the scenarios we discuss, humanity’s situation might be good in easy to measure ways. This means getting buy-in to challenge the status quo could be difficult.

Invisible misalignment: The AI systems might not be acting in dangerous, power-seeking or obviously misaligned ways early on. This could either be because of deliberate deception (deceptive alignment) or because the systems only fail to effectively generalise their goals on a very large scale or over long time horizons, so the misbehaviour takes years to show up.

Clearly, there are many reasons to expect strong competitive pressure to develop TAI. But how plausible is the idea that competitive pressures would be so high that potentially dangerous AI would be deployed despite major concerns? There are two intuitions we might have before looking into the details of the slow scenarios. We illustrate these intuitions with examples from existing writing on this question:

Unprecedentedly Dangerous

Transformative AI has the potential to cause unprecedented damage, all the way up to human extinction. Therefore, our response to other very dangerous technologies such as nuclear weapons is a good analogy for our response to TAI. It is unrealistic to expect TAI to be deployed if first there are many worsening warning shots involving dangerous AI systems. This would be comparable to an unrealistic alternate history where nuclear weapons were immediately used by the US and Soviet Union as soon as they were developed and in every war where they might have offered a temporary advantage, resulting in nuclear annihilation in the 1950s. From Ngo:

The second default expectation about technology is that, if using it in certain ways is bad for humanity, we will stop people from doing so. This is a less reliable extrapolation - there are plenty of seemingly-harmful applications of technology which are still occurring. But note that we’re talking about a slow-rolling catastrophe - that is, a situation which is unprecedentedly harmful. And so we should expect an unprecedented level of support for preventing whatever is causing it, all else equal.

Perhaps the development of TAI will be similar enough to the development of nuclear weapons that, by analogy with this past development, we can claim evidence that harmful AI takeover is unlikely. In order for the risk from TAI to be like the risk from nuclear escalation, the potential TAI disaster would have to have a clear precedent (some small scale version of the disaster has already occurred), the delay between the poor decision and the negative consequence would have to be very short, and we would have to be sure beforehand that deployment would be catastrophic (an equivalent of mutually assured destruction). Carlsmith discusses such a scenario as potentially plausible:

it seems plausible to me that we see PS [Power-seeking]-alignment failures of escalating severity (e.g., deployed AI systems stealing money, seizing control of infrastructure, manipulating humans on large scales), some of which may be quite harmful, but which humans ultimately prove capable of containing and correcting. 

Unprecedentedly Useful

Transformative AI has the potential to accelerate economic growth by an unprecedented amount, potentially resulting in an entirely new growth regime far faster than today’s. A scenario where we don’t take shortcuts when deploying TAI systems is comparable to an unrealistic alternate history where the entire world refrained from industrializing and stopped additional burning of fossil fuels right after the first plausible evidence of climate change became available in the 1960s. From Carlsmith:

Climate change might be some analogy. Thus, the social costs of carbon emissions are not, at present, adequately reflected in the incentives of potential emitters -- a fact often thought key to ongoing failures to curb net-harmful emissions. Something similar could hold true of the social costs of actors risking the deployment of practically PS [power-seeking] -misaligned APS [agentic AI] systems for the sake of e.g. profit, global power, and so forth…

...The first calculations of the greenhouse effect occurred in 1896; the issue began to receive attention in the highest levels of national and international governance in the late 1960s; and scientific consensus began to form in the 1980s. Yet here we are, more than 30 years later, with the problem unsolved, and continuing to escalate -- thanks in part to the multiplicity of relevant actors (some of whom deny/minimize the problem even in the face of clear evidence), and the incentives and externalities faced by those in a position to do harm. There are many disanalogies between PS-alignment risk and climate change (notably, in the possible -- though not strictly necessary -- immediacy, ease of attribution, and directness of AI-related harms), but we find the comparison sobering regardless. At least in some cases, “warnings” aren’t enough.

Just as with the optimistic analogy to nuclear weapons, we can ask what AI takeover scenarios fit with this pessimistic analogy to climate change. The relevance of the climate change analogy will depend on the lag between early signs of profit/success and early signs of damage, as well as how much of the damage represents an externality to the whole of society, versus directly backfiring onto the stakeholders of the individual project in a short time. It might also depend on how well (power-seeking) alignment failures are understood, and (relatedly) how strong public backlash is (which could also depend on whether AI causes other non-alignment related, non-existential level harms e.g. widespread unemployment and widening inequality).

Takeover Capabilities

In each scenario, there is a certain understanding of what capabilities are necessary for AIs to seize control of the future from humanity. The assumption about how capable AIs need to be varies for two reasons. The first is that some scenarios make different assumptions than others about the intrinsic vulnerability of human civilisation. The second is that in different scenarios, TAIs become obviously adversarial to humans and start fighting back at different points in their development. 

Some scenarios (such as brain-in-a-box) describe systems acting in ways that provoke human opposition almost immediately, so if those scenarios result in AI takeover the systems must be supremely capable (able to defeat all opponents with no starting resources). Other scenarios assume a ‘creeping failure’ where competitive pressures mean humans allow AI systems to monopolise resources and build up infrastructure for a while before the systems execute a takeover (such as AAFS). In these scenarios, the TAI systems need to be capable enough to defeat human opposition while already having access to factories, drones, large amounts of money etc. which requires fewer assumptions about the AI’s capabilities.

How do we quantify the ‘intrinsic vulnerability’ of human civilisation? It is hard to assess how much of an advantage is required to secure a DSA. Two intuitions on this question point in radically different directions:

  • Economic: To be able to outcompete the rest of the world, your project has to represent a substantial fraction of the entire world's capability on some crucial metric relevant to competitive success, because if you are actively seeking to take over the world then you will face opposition from everyone else. Perhaps that should be measured by GDP, military power, the majority of the world's AI compute, or some other measure. For a single project to represent a large fraction of world GDP, you would need either an extraordinary effort to concentrate resources or an assumption of sudden, off-trend rapid capability gain such that the leading project can race ahead of competitors.
  • Historical: Humans with no substantial advantage over the rest of humanity have in fact secured what Sotala called a 'major strategic advantage' repeatedly in the past. For example: Hitler in 1920 had access to a microscopic fraction of global GDP / human brain compute / (any other metric of capability) but had secured an MSA 20 years later (since his actions did lead to the deaths of 10+ million people), along with control over a significant fraction of the world's resources. No single human has ever secured a DSA (the closest was probably Cyrus the Great, who at one point ruled just under half of humanity). We might reasonably assume that if a human was able to take control of 45% of humanity, taking control of 100% would not require vastly greater capability.

Even if the absolute baseline capabilities required to achieve a DSA are unknown, we can see how other aspects of the scenario will raise or lower the required capabilities. Specifically, there is the issue of Dependency and deskilling. We might have integrated misaligned AI into our infrastructure when it attempts to take over. If we hand over lots of decision-making power to systems, they don’t need to be as capable to take over. Taken to the extreme, we can imagine a scenario where we directly hand over control of the future to the systems, such that no takeover is even necessary.

Even if this is not the case, a given system might already have centralised control of important services before acting in ways that motivate a human counter-response. They would still have to go the extra mile and secure full control from their strong starting point, but that is necessarily easier than a brain-in-a-box taking over from nothing.

All else being equal, we might expect a greater degree of required takeover capability in faster stories or stories where systems are more obviously power-seeking from the start. The more clearly dangerous and power-seeking systems are, the more likely we are to try to stop them instead of succumbing to pressures to persist in deploying them, so either the response needs to be worse or the TAI’s capabilities need to be greater for takeover to succeed.

Asking ‘how capable’ TAI systems need to be to take over, and discussing factors that might influence this, is an abstraction that covers up the question of which capabilities are necessary.

Some failure scenarios don’t discuss the exact route by which final takeover occurs, but sometimes they emphasise a particular approach (such as massed drone swarms, or highly effective manipulation propaganda). Ngo breaks down the takeover capabilities into two general categories of Manipulation and direct destructive capabilities.

Example takeover capabilities:

  • Nanotechnology
  • Drones, advanced robotics
  • Biotechnology
  • Persuasion skills
  • Cyberoffense Skills

In general, fast scenarios must assume systems can take over from a very weak starting point, which is more likely to require the capabilities to seize control of already existing infrastructure (persuasion skills and cyberoffense skills), while slow scenarios that assume takeover begins when the systems are already well established in the world economy might only require the ability to make use of that infrastructure to defeat opponents (advanced robotics and drones, biotechnology).

Alignment ‘Hackability’

‘Competitive pressures’ determine how much extra effort is put into aligning and overseeing AI systems - if the pressures are weaker, then we assume more effort is put into alignment and oversight because there is less incentive to cut corners. However, scenarios also differ on how ‘hackable’ the alignment problem is - that is, how easy it is to ‘correct’ misbehaviour by methods of incremental course correction such as improving oversight and sensor coverage or tweaking reward functions. This correction requires two parts - first, noticing that there is a problem with the system early on, then determining what fix to employ and applying it.

In fast takeoff worlds, the ‘hackability’ of the alignment problem doesn’t matter. There is no opportunity for alignment via course correction: either the AIs that rapidly become superintelligent are aligned, or they are not.

In slow takeoff worlds, the ‘hackability’ of the alignment problem appears to have a U-shaped effect on how good the outcomes are. On one extreme, the alignment problem is hackable “all the way” - that is, we can incrementally correct AI systems as we go until we end up with existentially safe TAI. On the other extreme, the alignment problem isn’t hackable at all. This might seem like a terrible outcome, but if it is the reality, it will probably lead to many early warning shots (i.e. small- or medium-scale accidents caused by alignment failures) that cannot be fixed. These will hopefully illustrate the danger ahead and bring about a slow-down in AI development and deployment, until we have robust solutions to alignment.

Between these two extremes, things seem to be more existentially risky. Consider if the alignment problem is “hackable until it isn’t” - that is, for systems of lower capability, we can patch our way towards systems that do what we want, but as systems become increasingly capable, this becomes impossible. Call this an “intermediate” level of hackability. In this world, warning shots are likely to result in fixes that ‘work’ in the short-term, in the sense that they fix the specific problem. This gives humans confidence, resulting in more systems being deployed and more decision-making power being handed over to them. But this course correction becomes unworkable as systems become more capable, until eventually the alignment failure of a highly capable system results in existential catastrophe.

What predictions can we make today about how hackable the alignment problem is? Considering outer alignment: without any breakthroughs in techniques, there seems to be a strong case that we are on track towards the ‘intermediate’ world where the alignment problem is hackable until it isn’t. It seems like the best workable approach to outer alignment we have so far is to train systems to try to ensure that the world looks good according to some kind of (augmented) human judgment (i.e. using something like the training regime described in 'An unaligned benchmark'). This will result in a world that “looks good until it doesn’t”, for the reasons described in Another (outer) alignment failure story

Considering inner alignment: it’s unclear how pervasive of a problem inner misalignment will turn out to be, and also how competent systems have to be to appear aligned when they are not. To the extent that inner alignment is a pervasive problem, and models don’t have to be very competent to appear aligned when they are not, then this also looks like the ‘intermediate’ world where we can hack around the alignment problem, deploying increasingly capable systems, until a treacherous turn results in catastrophe.

 

Conclusion

We have identified four characteristics which help us to interpret AI takeover scenarios and examined some of the most widely discussed AI takeover scenarios from this perspective. What becomes clearer when we do this? Each scenario is unique, but there are large differences in which assumptions about these characteristics you need to make in order for slow scenarios vs fast scenarios to be plausible. 

Compared to fast scenarios, slow scenarios don’t rely as much on decisions to deploy single dangerous AIs, but make more assumptions about incentives to widely deploy dangerous systems over a long period. From one perspective, this assumption about competitive pressures is the default, because that’s what humans have tended to do throughout our history when some lucrative new technology has been made available. From another perspective, the unprecedented danger posted by TAI implies a strong incentive to avoid making any mistakes.

Similarly, aside from the obvious assumption of rapid capability gain, fast stories also differ from slow stories in that they require systems to be capable enough to seize power from a very weak starting point (since in the slow stories, TAI systems are instead given power). How plausible is it that a system could seize power from such a weak starting point? The economic analogy suggests a system would need to acquire a substantial fraction of the world’s resources before attempting to take over, while the historical analogy suggests the system might not need to be much more intelligent than a smart human.

Finally, fast stories don’t really make any assumptions about alignment hackability - they just assume progress is too fast to course-correct. Slow stories must assume hackability is not too high or too low - if hackability is too high there will be no disaster, and if it is too low there will be many escalating warning shots.

15

1 comments, sorted by Highlighting new comments since Today at 9:43 AM
New Comment

Some points that didn't fit into the main post:

If the slow scenarios capture reality better than the fast scenarios, then systems will be deployed deliberately and will initially be given power rather than seizing power. This means both that the systems won’t be so obviously dangerous that the misbehaviour is noticed early on and that there is still misalignment later on. 

 This switch from apparently benign to dangerous behaviour could be due to

  • Power-seeking misaligned behaviour that is too subtle to notice in the training environment but is obviously dangerous in deployment, due to the scale and makeup of the training and deployment environments being quite different
  • Power-seeking misaligned behaviour that only shows up over long time horizons and therefore will not be noticed in training, which we might expect occurs over a shorter period than deployment
  • Systems intentionally hiding misaligned behaviour during training to deceive their operators. Systems could be highly deceptively misaligned from the beginning, and capable enough to know that if they seek power in adversarial ways too early, they will get shut down. This post argues that ML models don't have to be extremely competent to be manipulative, suggesting that these behaviours might show up very early

Rather, it simply has a single good trick that enables it to subvert and take control of the rest of the world. Its takeover capability might be exceptionally good manipulation techniques, specific deadly technology, or cyberoffensive capability, any of which could allow the system to exploit other AIs and humans. 

In reality, I feel that this is more of a fuzzy rather than binary thing: I expect this to require somewhat less of an extraordinary research effort, and instead that there exists somewhat more of a crucial vulnerability in human society (there are already some examples of vulnerabilities, e.g. biological viruses, humans are pretty easy to manipulate under certain conditions). But I also think there are plausibly hard limits to how good various takeover technologies can get - e.g. persuasion tools.

It is unrealistic to expect TAI to be deployed if first there are many worsening warning shots involving dangerous AI systems. This would be comparable to an unrealistic alternate history where nuclear weapons were immediately used by the US and Soviet Union as soon as they were developed and in every war where they might have offered a temporary advantage, resulting in nuclear annihilation in the 1950s. 

Note that this is not the same as an alternate history where nuclear near-misses escalated (e.g. Petrov, Vasili Arkhipov), but instead an outcome where nuclear weapons were used as ordinary weapons of war with no regard for the larger dangers that presented - there would be no concept of ‘near misses’ because MAD wouldn’t have developed as a doctrine. In a previous post I argued, following Anders Sandberg, that paradoxically the large number of nuclear ‘near misses’ implies that there is a forceful pressure away from the worst outcomes.

Robert Wiblin: So just to be clear, you’re saying there’s a lot of near misses, but that hasn’t updated you very much in favor of thinking that the risk is very high. That’s the reverse of what we expected.

Anders Sandberg: Yeah.

Robert Wiblin: Explain the reasoning there.

Anders Sandberg: So imagine a world that has a lot of nuclear warheads. So if there is a nuclear war, it’s guaranteed to wipe out humanity, and then you compare that to a world where is a few warheads. So if there’s a nuclear war, the risk is relatively small. Now in the first dangerous world, you would have a very strong deflection. Even getting close to the state of nuclear war would be strongly disfavored because most histories close to nuclear war end up with no observers left at all.

In the second one, you get the much weaker effect, and now over time you can plot when the near misses happen and the number of nuclear warheads, and you actually see that they don’t behave as strongly as you would think. If there was a very strong anthropic effect you would expect very few near misses during the height of the Cold War, and in fact you see roughly the opposite. So this is weirdly reassuring. In some sense the Petrov incident implies that we are slightly safer about nuclear war.

However, scenarios also differ on how ‘hackable’ the alignment problem is - that is, how easy it is to ‘correct’ misbehaviour by methods of incremental course correction such as improving oversight and sensor coverage or tweaking reward functions. This correction requires two parts - first, noticing that there is a problem with the system early on, then determining what fix to employ and applying it. 

Many of the same considerations around correcting misbehaviour also apply to detecting misbehaviour, and the required capabilities seem to overlap. In this post, we focus on applying corrections to misbehaviour, but there is existing writing on detecting misbehaviour as well.

Considering inner alignment, Trazzi and Armstrong argue that models don’t have to be very competent to appear aligned when they are not, suggesting that it’s possible that it won’t be easy to tell if deployed systems are inner misaligned. But their argument doesn’t have too much to say about how likely this is in practice.

Considering outer alignment, it seems less clear. See here for a summary of some discussion between Richard Ngo and Paul Christiano about how easy it will be to tell that models are outer misaligned to the objective of pursuing easily-measurable goals (rather than the hard-to-measure goals that we actually want).

What predictions can we make today about how hackable the alignment problem is? Considering outer alignment: without any breakthroughs in techniques, there seems to be a strong case that we are on track towards the ‘intermediate’ world where the alignment problem is hackable until it isn’t. It seems like the best workable approach to outer alignment we have so far is to train systems to try to ensure that the world looks good according to some kind of (augmented) human judgment (i.e. using something like the training regime described in 'An unaligned benchmark'). This will result in a world that “looks good until it doesn’t”, for the reasons described in Another (outer) alignment failure story

Whether the method described in ‘an unaligned benchmark’ (which would result in this risky, intermediate level of hackability) actually turns out to be the most natural method to use for building advanced AI will depend on how easily it produces useful, intelligent behaviour.

If we are lucky, there will be more of a correlation between methods that are easily hackable and methods that produce capabilities we want, such that highly hackable methods are easier to find and more capable than even intermediately hackable methods like unaligned benchmark. If you think that the methods we are most likely to employ absent an attempt to change research paradigms are exactly these highly hackable methods, then you accept the claim of Alignment by Default