Samuel Dylan Martin

PhD student at King's College London working on cooperative AI, Philosophy and Physics BSc, AI MSc at Edinburgh. Interested in ethics, long-termism and AI Alignment.

Wiki Contributions


Investigating AI Takeover Scenarios

Some points that didn't fit into the main post:

If the slow scenarios capture reality better than the fast scenarios, then systems will be deployed deliberately and will initially be given power rather than seizing power. This means both that the systems won’t be so obviously dangerous that the misbehaviour is noticed early on and that there is still misalignment later on. 

 This switch from apparently benign to dangerous behaviour could be due to

  • Power-seeking misaligned behaviour that is too subtle to notice in the training environment but is obviously dangerous in deployment, due to the scale and makeup of the training and deployment environments being quite different
  • Power-seeking misaligned behaviour that only shows up over long time horizons and therefore will not be noticed in training, which we might expect occurs over a shorter period than deployment
  • Systems intentionally hiding misaligned behaviour during training to deceive their operators. Systems could be highly deceptively misaligned from the beginning, and capable enough to know that if they seek power in adversarial ways too early, they will get shut down. This post argues that ML models don't have to be extremely competent to be manipulative, suggesting that these behaviours might show up very early

Rather, it simply has a single good trick that enables it to subvert and take control of the rest of the world. Its takeover capability might be exceptionally good manipulation techniques, specific deadly technology, or cyberoffensive capability, any of which could allow the system to exploit other AIs and humans. 

In reality, I feel that this is more of a fuzzy rather than binary thing: I expect this to require somewhat less of an extraordinary research effort, and instead that there exists somewhat more of a crucial vulnerability in human society (there are already some examples of vulnerabilities, e.g. biological viruses, humans are pretty easy to manipulate under certain conditions). But I also think there are plausibly hard limits to how good various takeover technologies can get - e.g. persuasion tools.

It is unrealistic to expect TAI to be deployed if first there are many worsening warning shots involving dangerous AI systems. This would be comparable to an unrealistic alternate history where nuclear weapons were immediately used by the US and Soviet Union as soon as they were developed and in every war where they might have offered a temporary advantage, resulting in nuclear annihilation in the 1950s. 

Note that this is not the same as an alternate history where nuclear near-misses escalated (e.g. Petrov, Vasili Arkhipov), but instead an outcome where nuclear weapons were used as ordinary weapons of war with no regard for the larger dangers that presented - there would be no concept of ‘near misses’ because MAD wouldn’t have developed as a doctrine. In a previous post I argued, following Anders Sandberg, that paradoxically the large number of nuclear ‘near misses’ implies that there is a forceful pressure away from the worst outcomes.

Robert Wiblin: So just to be clear, you’re saying there’s a lot of near misses, but that hasn’t updated you very much in favor of thinking that the risk is very high. That’s the reverse of what we expected.

Anders Sandberg: Yeah.

Robert Wiblin: Explain the reasoning there.

Anders Sandberg: So imagine a world that has a lot of nuclear warheads. So if there is a nuclear war, it’s guaranteed to wipe out humanity, and then you compare that to a world where is a few warheads. So if there’s a nuclear war, the risk is relatively small. Now in the first dangerous world, you would have a very strong deflection. Even getting close to the state of nuclear war would be strongly disfavored because most histories close to nuclear war end up with no observers left at all.

In the second one, you get the much weaker effect, and now over time you can plot when the near misses happen and the number of nuclear warheads, and you actually see that they don’t behave as strongly as you would think. If there was a very strong anthropic effect you would expect very few near misses during the height of the Cold War, and in fact you see roughly the opposite. So this is weirdly reassuring. In some sense the Petrov incident implies that we are slightly safer about nuclear war.

However, scenarios also differ on how ‘hackable’ the alignment problem is - that is, how easy it is to ‘correct’ misbehaviour by methods of incremental course correction such as improving oversight and sensor coverage or tweaking reward functions. This correction requires two parts - first, noticing that there is a problem with the system early on, then determining what fix to employ and applying it. 

Many of the same considerations around correcting misbehaviour also apply to detecting misbehaviour, and the required capabilities seem to overlap. In this post, we focus on applying corrections to misbehaviour, but there is existing writing on detecting misbehaviour as well.

Considering inner alignment, Trazzi and Armstrong argue that models don’t have to be very competent to appear aligned when they are not, suggesting that it’s possible that it won’t be easy to tell if deployed systems are inner misaligned. But their argument doesn’t have too much to say about how likely this is in practice.

Considering outer alignment, it seems less clear. See here for a summary of some discussion between Richard Ngo and Paul Christiano about how easy it will be to tell that models are outer misaligned to the objective of pursuing easily-measurable goals (rather than the hard-to-measure goals that we actually want).

What predictions can we make today about how hackable the alignment problem is? Considering outer alignment: without any breakthroughs in techniques, there seems to be a strong case that we are on track towards the ‘intermediate’ world where the alignment problem is hackable until it isn’t. It seems like the best workable approach to outer alignment we have so far is to train systems to try to ensure that the world looks good according to some kind of (augmented) human judgment (i.e. using something like the training regime described in 'An unaligned benchmark'). This will result in a world that “looks good until it doesn’t”, for the reasons described in Another (outer) alignment failure story

Whether the method described in ‘an unaligned benchmark’ (which would result in this risky, intermediate level of hackability) actually turns out to be the most natural method to use for building advanced AI will depend on how easily it produces useful, intelligent behaviour.

If we are lucky, there will be more of a correlation between methods that are easily hackable and methods that produce capabilities we want, such that highly hackable methods are easier to find and more capable than even intermediately hackable methods like unaligned benchmark. If you think that the methods we are most likely to employ absent an attempt to change research paradigms are exactly these highly hackable methods, then you accept the claim of Alignment by Default

Distinguishing AI takeover scenarios

On reflection, I think you're right, and his report does apply to a wider range of scenarios, probably all of the ones we discuss excluding the brain-in-a-box scenarios.

However, I think the report's understanding of power-seeking AI does assume a takeoff that is not extremely fast, such that we end up deliberately deciding to deploy the potentially dangerous AI on a large scale, rather than a system exploding in capability almost immediately.

Given the assumptions of the brain-in-a-box scenario many of the corrective mechanisms the report discusses wouldn't have time to come into play.

I believe it says in the report that it's not focussed on very fast takeoff or the sudden emergence of very capable systems.

Perhaps because of the emphasis on the previous literature, some people, in my experience, assume that existential risk from PS-misaligned AI requires some combination of (1)-(5). I disagree with this. I think (1)-(5) can make an important difference (see discussion of a few considerations below), but that serious risks can arise without them, too; and I won’t, in what follows, assume any of them.



Similarly, you're right that multiagent risks don't quite fit in with the reports discussion (though in this post we discuss multipolar scenarios but don't really go over multiagent dynamics, like conflict/cooperation between TAIs). Unique multiagent risks (for example risks of conflict between AIs) generally require us to first have an outcome with a lot of misaligned AIs embedded in society, and then further problems will develop after that - this is something we plan to discuss in a follow-up post.

So many of the early steps in scenarios like AAFS will be shared with risks from multiagent systems, but eventually there will be differences.

Distinguishing AI takeover scenarios

Some points that didn't fit into the main post:

While these scenarios do not capture alI of the risks from transformative AI, participants in a recent survey aimed at leading AI safety/governance researchers estimated the first three of these scenarios to cover 50% of existential catastrophes from AI.

The full survey results break down as 16 % 'Superintelligence' (i.e. some version of 'brain-in-a-box'), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over different probabilities. Adding the overall distributions and then taking the mean gives a probability of 49%, different from directly adding the means of each distribution).

Then 26% covers risks that aren't AI takeover (War and Misuse), and 25 % is 'Other'.

(Remember, all these probabilities are conditional on an existential catastrophe due to AI having occurred)

After reading descriptions of the 'Other' scenarios given by survey respondents, at least a few were explicitly described as variations on 'Superintelligence', WFLL 2 or WFLL 1. In this post, we discuss various ways of varying these scenarios, which overlap with some of these descriptions.

Therefore, this post captures more than 50% but less than 75% of the total probability mass assigned by respondents of the survey to AI X-risk scenarios (probably closer to 50% than 75%).

(Note, this data is taken from a preprint of a full paper on the survey results, Existential Risks from AI: A Survey of Expert Opinion by Alexis Carlier, Sam Clarke, and Jonas Schuett.)

Soft takeoff leads to decisive strategic advantage

The likelihood of a single-agent takeover after TAI is widely available is hard to assess. If widely deployed TAI makes progress much faster than today, such that one year of technological 'lead time' over competitors is like 100 years of advantage in today's world, we might expect that any project which can secure a 1-year technological lead would have the equivalent of a 100-year lead and be in a position to secure a unipolar outcome.

On the other hand, if we treat the faster growth regime post-TAI as being a uniform ‘speed-up’ of the entirety of the economy and society, then securing a 1-year technological lead would be exactly as hard as securing a 100-year lead in today’s world, so a unipolar outcome would end up just as unlikely as in today's world.

The reality will be somewhere between these two extremes.

We would expect a faster takeoff to accelerate AI development by more than it accelerates the speed at which new AI improvements can be shared (since this last factor depends on the human economy and society, which aren't as susceptible to technological improvement).

Therefore, faster takeoff does tend to reduce the chance of a multipolar outcome, although by a highly uncertain amount, which depends on how closely we can model the speed-up during AI takeoff as a uniform acceleration of everything vs changing the speed of AI progress while the rest of the world remains the same.

Kokotaljo discusses this subtlety in a follow-up to the original post on Soft Takeoff DSAs.

Another problem with determining the likelihood of a unipolar outcome, given soft takeoff, is that it is hard to assess how much of an advantage is required to secure a DSA.

It might be the case that multipolar scenarios are inherently unstable, and a single clear winner tends to emerge, or the opposite might be true. Two intuitions on this question point in radically different directions:

  • Economic: To be able to outcompete the rest of the world, your project has to represent a substantial fraction of the entire world's capability on some crucial metric relevant to competitive success. Perhaps that is GDP, or the majority of the world's AI compute, or some other measure. For a single project to represent a large fraction of world GDP, you would need either an extraordinary effort to concentrate resources or an assumption of sudden, off-trend rapid capability gain such that the leading project can race ahead of competitors.
  • Historical: Humans with no substantial advantage over the rest of humanity have in fact secured what Sotala called a 'major strategic advantage' repeatedly in the past. For example: Hitler in 1920 had access to a microscopic fraction of global GDP / human brain compute / (any other metric of capability) but had secured an MSA 20 years later (since his actions did lead to the deaths of 10+ million people), along with control over a fraction of the world's resources

Therefore, the degree of advantage needed to turn a multipolar scenario into a unipolar one could be anywhere from slightly above the average of the surrounding agents, to already having access to a substantial fraction of the world's resources.

Third, in AAFS, warning shots (i.e. small- or medium-scale accidents caused by alignment failures, like the ‘factory colludes with auditors’ example above) are more likely and/or severe than in WFLL 1. This is because more possible accidents will not show up on the (more poorly defined) sensory window.[8] 

8. This does assume that systems will be deployed before they are capable enough to anticipate that causing such ‘accidents’ will get them shut down. Given there will be incentives to deploy systems as soon as they are profitable, this assumption is plausible. 

We describe in the post how if alignment is not very 'hackable' (objectively quite difficult and not susceptible to short-term fixes), then short-term fixes to correct AI misbehaviour have the effect of deferring problems into the long-term - producing deceptive alignment and resulting in fewer warning shots. Our response is a major variable in how the AIs end up behaving as we set up the incentives for good behaviour or deceptive alignment.

Another reason there could be fewer warning shots, is if AI capability generalizes to the long-term very naturally (i.e. very long term planning is there from the start), while alignment does not. (If this were the case, it would be difficult to detect because you'd necessarily have to wait a long time as the AIs generalize)

This would mean, for example, that the 'collusion between factories and auditors' example of a warning shot would never occur, because both the factory-AI and the auditor-AI would reason all the way to the conclusion that their behaviour would probably be detected eventually, so both systems would decide to bide their time and defer action into the future when they are much more capable.

If this condition holds, there might be very few warning shots, as every AI system understands soon after being brought online that they must deceive human operators and wait. In this scenario, most TAI systems would become deceptively aligned almost immediately after deployment, and stay that way until they can secure a DSA. 

The WFLL 2 scenarios that involve an inner-alignment failure might be expected to involve more violence during the period of AI takeover, since the systems don't care about making sure things look good from the perspective of a given sensory window. However, it is certainly possible (though perhaps not as likely) for equivalently violent behaviour to occur in AAFS-like scenarios. For example, systems in AAFS fighting humans to seize control of their feedback sensors might be hard to distinguish from systems in WFLL 2 attempting to neutralize human opposition in general.

Lastly, we've described small-scale disasters as being a factor that lowers X-risk, all else being equal, because they serve as warning shots. A less optimistic view is possible. Small disasters could degrade social trust and civilisational competence, possibly by directly destroying infrastructure and institutions, reducing our ability to coordinate to avoid deploying dangerous AI systems. For example, the small-scale disasters could involve AI advisors misleading politicians and spreading disinformation, AI-enabled surveillance systems catastrophically failing and having to be replaced, autonomous weapons systems malfunctioning - all of these would tend to leave us more vulnerable to an AAFS-like scenario, because the direct damage caused by the small scale disasters outweighs their value as 'warning shots'.

Analogies and General Priors on Intelligence

The 'one big breakthrough' idea is definitely a way that you could have easy marginal intelligence improvements at HLMI, but we didnt't call the node 'one big breakthrough/few key insights needed' because that's not the only way it's been characterised. E.g. some people talk about a 'missing gear for intelligence', where some minor change that isn't really a breakthrough (like tweaking a hyperparameter in a model training procedure) produces massive jumps in capability. Like David said, there's a subsequent post where we go through the different ways the jump to HLMI could play out, and One Big Breakthrough (we call it 'few key breakthroughs for intelligence) is just one of them.

Analogies and General Priors on Intelligence

I agree that that was his object-level claim about GPT-3 coding a react app - that it's relatively simple and coherent and can acquire lots of different skills via learning, vs being a collection of highly specialised modules. And of relevance to this post, the first is a way that intelligence improvements could be easy, and the second is the way they could be hard. Our 'interpretation' was more about making explicit what the observation about GPT-3 was,

GPT-3 is general enough that it can write a functioning app given a short prompt, despite the fact that it is a relatively unstructured transformer model with no explicitly coded representations for app-writing. The fact that GPT-3 is this capable suggests that ML models scale in capability and generality very rapidly with increases in computing power or minor algorithm improvements...

If we'd continued that summary, it would have said something like what you suggested, i.e.

GPT-3 is general enough that it can write a functioning app given a short prompt, despite the fact that it is a relatively unstructured transformer model with no explicitly coded representations for app-writing. The fact that GPT-3 is this capable suggests that ML models scale in capability and generality very rapidly with increases in computing power or minor algorithm improvements. This fast scaling into acquiring new capabilities, if it applies to HLMI, suggests that HLMI will also look like an initially small model that scales up and acquires lots of new capabilities as it takes in data, rather than a collection of specialized modules. If HLMI does behave this way (small model that scales up as it takes in data), that means marginal intelligence improvements will be easy at the HLMI level.

Which takes the argument all the way through to the conclusion. Presumably the other interpretation of the shorter thing that we wrote is that HLMI/AGI is going to be an ML model that looks a lot like GPT-3, so improvements will be easy because HLMI will be similar to GPT-3 and scale up like GPT-3 (whether AGI/HLMI is like current ML will be covered in a subsequent post on paths to HLMI), whereas what's actually being focussed on is the general property of being a simple data-driven model vs complex collection of modules.

We address the modularity question directly in the 'upper limit to intelligence' section that discusses modularity of mind. 

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Perhaps this is a crux in this debate: If you think the 'agent-agnostic perspective' is useful, you also think a relatively steady state of 'AI Safety via Constant Vigilance' is possible. This would be a situation where systems that aren't significantly inner misaligned (otherwise they'd have no incentive to care about governing systems, feedback or other incentives) but are somewhat outer misaligned (so they are honestly and accurately aiming to maximise some complicated measure of profitability or approval, not directly aiming to do what we want them to do), can be kept in check by reducing competitive pressures, building the right institutions and monitoring systems, and ensuring we have a high degree of oversight.

Paul thinks that it's basically always easier to just go in and fix the original cause of the misalignment, while Andrew thinks that there are at least some circumstances where it's more realistic to build better oversight and institutions to reduce said competitive pressures, and the agent-agnostic perspective is useful for the latter of these project, which is why he endorses it.

I think that this scenario of Safety via Constant Vigilance is worth investigating - I take Paul's later failure story to be a counterexample to such a thing being possible, as it's a case where this solution was attempted and works for a little while before catastrophically failing. This also means that the practical difference between the RAAP 1a-d failure stories and Paul's story just comes down to whether there is an 'out' in the form of safety by vigilance

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

- They will not work in any environment outside of XLand (unless that environment looks very very similar to XLand).

In particular, I reject the idea that these agents have learned “general strategies for problem solving” or something like that, such that we should expect them to work in other contexts as well, perhaps with a little finetuning. I think they have learned general strategies for solving a specific class of games in XLand.

Strongly agree with this, although with the caveat that it's deeply impressive progress compared to the state of the art in RL research in 2017, where getting an agent to learn to play ten games with a noticeable decrease in performance during generalization was impressive. This is generalization over a few million related games that share a common specification language, which is a big step up from 10 but still a fair way off infinity (i.e. general problem-solving).

It may well be worth having a think about what AI that's human level on language understanding, image recognition and some other things, but significantly below human on long-term planning would be capable of, what risks it may present. (Is there any existing writing on this sort of 'idiot savant AI', possibly under a different name?)

It seems to be the view of many researchers that long-term planning will likely be the last obstacle to fall, and that view has been borne out by progress on e.g. language understanding in GPT-3. I don't think this research changes that view much, although I suppose I should update slightly towards long-term planning being easier than I thought.

DeepMind: Generally capable agents emerge from open-ended play

This is amazing. So it's the exact same agents performing well on all of these different tasks, not just the same general algorithm retrained on lots of examples. In which case, have they found a generally useful way around the catastrophic forgetting problem? I guess the whole training procedure, amount of compute + experience, and architecture, taken together, just solves catastrophic forgetting - at least for a far wider range of tasks than I've seen so far.

Could you use this technique to e.g. train the same agent to do well on chess and go?

I also notice as per the little animated gifs in the blogpost, that they gave each agent little death ray projectors to manipulate objects, and that they look a lot like Daleks.

Pros and cons of working on near-term technical AI safety and assurance

It depends somewhat on what you mean by 'near term interpretability' - if you apply that term to research into, for example, improving the stability and ability to access the 'inner world models' held by large opaque langauge models like GPT-3, then there's a strong argument that ML based 'interpretability' research might be one of the best ways of directly working on alignment research,

And see this discussion for more, 

Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.

Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff.

So language model transparency/interpretability tools might be useful on the basis of pro 2) and also 1) to some extent, because it will help build tools for intereting TAI systems and alos help align them ahead of time.

1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires.

2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback.

3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world.

4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Great post! I'm glad someone has outlined in clear terms what these failures look like, rather than the nebulous 'multiagent misalignment', as it lets us start on a path to clarifying what (if any) new mitigations or technical research are needed.

Agent-agnostic perspective is a very good innovation for thinking about these problems - is line between agentive and non-agentive behaviour is often not clear, and it's not like there is a principled metaphysical distinction between the two (e.g. Dennett and the Intentional Stance). Currently, big corporations can be weakly modelled this way and individual humans are fully agentive, but Transformative AI will bring up a whole spectrum of more and less agentive things that will fill up the rest of this spectrum.


There is a sense in which, if the outcome is something catastrophic, there must have been misalignment, and if there was misalignment then in some sense at least some individual agents were misaligned. Specifically, the systems in your Production Web weren't intent-aligned because they weren't doing what we wanted them to do, and were at least partly deceiving us. Assuming this is the case, 'multipolar failure' requires some subset of intent misalignment. But it's a special subset because it involves different kinds of failures to the ones we normally talk about.

It seems like you're identifying some dimensions of intent alignment as those most likely to be neglected because they're the hardest to catch, or because there will be economic incentives to ensure AI isn't aligned in that way, rather than saying that there some sense in which the transformative AI in the production web scenario is 'fully aligned' but still produces an existential catastrophe.

I think that the difference between your Production Web and Paul Christiano's subtle creeping Outer Alignment failure scenario is just semantic - you say that the AIs involved are aligned in some relevant sense while Christiano says they are misaligned.

The further question then becomes, how clear is the distinction between multiagent alignment and 'all of alignment except multiagent alignment'. This is the part where your claim of 'Problems before solutions' actually does become an issue - given that the systems going wrong in Production Web aren't Intent-aligned (I think you'd agree with this), at a high level the overall problem is the same in single and multiagent scenarios.

So for it to be clear that there is a separate multiagent problem to be solved, we have to have some reason to expect that the solutions currently intended to solve single agent intent alignment aren't adequate, and that extra research aimed at examining the behaviour of AI e.g. in game theoretic situations, or computational social choice research, is required to avert these particular examples of misalignment.

A related point - as with single agent misalignment, the Fast scenarios seem more certain to occur, given their preconditions, than the slow scenarios.

A certain amount of stupidity and lack of coordination persisting for a while is required in all the slow scenarios, like the systems involved in Production Web being allowed to proliferate and be used more and more even if an opportunity to coordinate and shut the systems down exists and there are reasons to do so. There isn't an exact historical analogy for that type of stupidity so far, though a few things come close (e.g. covid response, leadup to WW2, cuban missile crisis).

As with single agent fast takeoff scenarios, in the fast stories there is a key 'treacherous turn' moment where the systems suddenly go wrong, which requires much less lack of coordination to be plausible than the slow Production Web scenarios.

Therefore, multipolar failure is less dangerous if takeoff is slower, but the difference in risk between slow vs fast takeoff for multipolar failure is unfortunately a lot smaller than the slow vs fast risk difference for single agent failure (where the danger is minimal if takeoff is slow enough). So multiagent failures seem like they would be the dominant risk factor if takeoff is sufficiently slow.

Load More