Samuel Dylan Martin

MTAIR project and CLR grantee, PhD student at King's College London working on cooperative AI, Philosophy and Physics BSc, AI MSc at Edinburgh. Interested in philosophy, longtermism and AI Alignment. I write science fiction at

Wiki Contributions


Soares, Tallinn, and Yudkowsky discuss AGI cognition

isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

I think this specific scenario sketch is from a mainstream AI safety perspective a case where we've already failed - i.e. we've invented a useless corrigibility intervention that we confidently but wrongly think is scalable.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.

If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

Most AI safety researchers just don't agree with Eliezer that there's no (likely to be found) corrigibility interventions that won't suddenly and invisibly fail when you increase intelligence, no matter how well you've validated them on low capability regimes and how carefully you try to scale up. This is because they don't agree with/haven't heard of Eliezer's arguments about consequentialism being a super-strong attractor.

So they'd think the 'die with the most dignity' interventions would just work, while the 'die with no dignity' interventions are risky, and quite reasonably push for the former (since it's far from clear we'll take the 'dignified' option by default): trying corrigibility interventions at low levels of intelligence, testing the AI on validation sets to see if it plots to kill them, while scaling up.

They might be wrong about this working, but if so, the wrongness isn't in lacking enough security mindset to see that an AI trying to kill you would just alter its own cognition to cheat its way past the tests. Rather, their mistake is not expecting the corrigibility interventions they presumably trust to suddenly break in a way that means you get no useful safety guarantees from any amount of testing at lower capability levels.

I think it's a shame Eliezer didn't pose the 'validation set' question first before answering it himself, because I think if you got rid of the difference in underlying assumptions - i.e. asked an alignment researcher "Assume there's a strong chance your corrigibility intervention won't work upon scaling up and the AGI might start plotting against you, so you're going to try these transparency/validation schemes on the AGI to check if it's safe, how could they go wrong and is this a good idea?" they'd give basically the same answer - i.e. if you try this you're probably going to die.


You could still reasonably say, "even if the AI safety community thinks it's not the best use of resources because ensuring knowably stable corrigibility looks a lot easier to us, shouldn't we still be working on some strongly deception-proof method of verifying if an agent is safe, so we can avoid killing ourselves if plan A fails?"

My answer would be yes.

Christiano, Cotra, and Yudkowsky on AI progress

One of the problems here is that, as well as disagreeing about underlying world models and about the likelihoods of some pre-AGI events, Paul and Eliezer often just make predictions about different things by default. But they do (and must, logically) predict some of the same world events differently.

My very rough model of how their beliefs flow forward is:


Low initial confidence on truth/coherence of 'core of generality'

Human Evolution tells us very little about the 'cognitive landscape of all minds' (if that's even a coherent idea) - it's simply a loosely analogous individual historical example. Natural selection wasn't intelligently aiming for powerful world-affecting capabilities, and so stumbled on them relatively suddenly with humans. Therefore, we learn very little about whether there will/won't be a spectrum of powerful intermediately general AIs from the historical case of evolution - all we know is that it didn't happen during evolution, and we've got good reasons to think it's a lot more likely to happen for AI. For other reasons (precedents already exist - MuZero is insect-brained but better at chess or go than a chimp, plus that's the default with technology we're heavily investing in), we should expect there will be powerful, intermediately general AIs by default (and our best guess of the timescale should be anchored to the speed of human-driven progress, since that's where it will start) - No core of generality

Then, from there:

No core of generality and extrapolation of quantitative metrics for things we care about and lack of common huge secrets in relevant tech progress reference class → Qualitative prediction of more common continuous progress on the 'intelligence' of narrow AI and prediction of continuous takeoff


High initial confidence on truth/coherence of 'core of generality'

Even though there are some disanalogies between Evolution and AI progress, the exact details of how closely analogous the two situations are don't matter that much. Rather, we learn a generalizable fact about the overall cognitive landscape from human evolution - that there is a way to reach the core of generality quickly. This doesn't make it certain that AGI development will go the same way, but it's fairly strong evidence. The disanalogies between evolution and ML are indeed a slight update in Paul's direction and suggest that AI could in principle take a smoother route to general intelligence, but we've never historically seen this smoother route (and it has to be not just technically 'smooth' but sufficiently smooth to give us a full 4-year economic doubling) or these intermediate powerful agents, so this correction is weak compared to the broader knowledge we gain from evolution. In other words, all we know is that there is a fast route to the core of generality but that it's imaginable that there's a slow route we've not yet seen - Core of generality

Then, from there:

Core of generality and very common presence of huge secrets in relevant tech progress reference class → Qualitative prediction of less common continuous progress on the 'intelligence' of narrow AI and prediction of discontinuous takeoff


Eliezer doesn’t have especially divergent views about benchmarks like perplexity because he thinks they're not informative, but differs from Paul on qualitative predictions of how smoothly various practical capabilities/signs of 'intelligence' will emerge - he's getting his qualitative predictions about this ultimately from interrogating his 'cognitive landscape' abstraction, while Paul is getting his from trend extrapolation on measures of practical capabilities and then translating those to qualitative predictions. These are very different origins, but they do eventually give different predictions about the likelihood of the same real-world events.

Since they only reach the point of discussing the same things at a very vague, qualitative level of detail, in order to get to a bet you have to back-track from both of their qualitative predictions of how likely the sudden emergence of various types of narrow intelligent behaviour are, find some clear metric for the narrow intelligent behaviour that we can apply fairly, and then there should be a difference in beliefs about the world before AI takeoff.

Comments on Carlsmith's “Is power-seeking AI an existential risk?”

Great and extremely valuable discussion! There's one part that I really wished had been explored further - the fundamental difficulty of inner alignment:

Joe Carlsmith: I do have some probability that the alignment ends up being pretty easy. For example, I have some probability on hypotheses of the form "maybe they just do what you train them to do," and "maybe if you just don't train them to kill you, they won't kill you." E.g., in these worlds, non-myopic consequentialist inner misalignment doesn't tend to crop up by default, and it's not that hard to find training objectives that disincentivize problematically power-seeking forms of planning/cognition in practice, even if they're imperfect proxies for human values in other ways.


Nate: ...maybe it wouldn't have been that hard for natural selection to train humans to be fitness maximizers, if it had been watching for goal-divergence and constructing clever training environments?

Joe Carlsmith: I think something like this is in the mix for me. That is, I don't see the evolution example as especially strong evidence for how hard inner alignment is conditional on actually and intelligently trying to avoid inner misalignment (especially in its scariest forms).


I would very much like to see expansion (from either Nate/MIRI or Joe) on these points because they seem crucial to me. My current epistemic situation is (I think) similar to Joe's. Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall. I see lots of worrisome signs from indirect lines of evidence - some based on intuitions about the nature of intelligence, some from toy models and some from vague analogies to e.g. evolution. But what I don't see is a slam dunk argument that inner misalignment is an extremely strong attractor for powerful models of the sort we're actually going to build.

That also goes for many of the specific reasons given for inner misalignment - they often just seem to push the intuition one step further back. E.g. these from Eliezer Yudkowsky's recent interview:

I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can't be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms


attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).

seem like world models that make sense to me, given the surrounding justifications, and I wouldn't be amazed if they were true, and I also place a decent amount of credence on them being true. But I can't pass an ideological Turing test for someone who believes the above propositions with > 95% certainty, given the massive conceptual confusion involved with all of these concepts and the massive empirical uncertainty.

Statements like 'corrigibility is anti-natural in a way that can't easily be explained' and 'getting deep enough patches that generalize isn't just difficult but almost impossibly difficult' when applied to systems we don't yet know how to build at all, don't seem like statements about which confident beliefs either way can be formed. (Unless there's really solid evidence out there that I'm not seeing)

This conversation seemed like another such opportunity to provide that slam-dunk justification for the extreme difficulty of inner alignment, but as in many previous cases Nate and Joe seemed happy to agree to disagree and accept that this is a hard question about which it's difficult to reach any clear conclusion - which if true should preclude strong confidence in disaster scenarios.

(FWIW, I think there's a good chance that until we start building systems that are already quite transformative, we're probably going to be stuck with a lot of uncertainty about the fundamental difficulty of inner alignment - which from a future planning perspective is worse than knowing for sure how hard the problem is.)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

And I think they are well enough motivated to stop their imminent annihilation, in a way that is more like avoiding mutual nuclear destruction than cosmopolitan altruistic optimal climate mitigation timing.

In my recent writeup of an investigation into AI Takeover scenarios I made an identical comparison - i.e. that the optimistic analogy looks like avoiding nuclear MAD for a while and the pessimistic analogy looks like optimal climate mitigation:

It is unrealistic to expect TAI to be deployed if first there are many worsening warning shots involving dangerous AI systems. This would be comparable to an unrealistic alternate history where nuclear weapons were immediately used by the US and Soviet Union as soon as they were developed and in every war where they might have offered a temporary advantage, resulting in nuclear annihilation in the 1950s. 

Note that this is not the same as an alternate history where nuclear near-misses escalated (e.g. Petrov, Vasili Arkhipov), but instead an outcome where nuclear weapons were used as ordinary weapons of war with no regard for the larger dangers that presented - there would be no concept of ‘near misses’ because MAD wouldn’t have developed as a doctrine. In a previous post I argued, following Anders Sandberg, that paradoxically the large number of nuclear ‘near misses’ implies that there is a forceful pressure away from the worst outcomes.

Takeoff Speeds and Discontinuities

Some comments on the model


  • In defining the rate of AI progress and other related variables, we’ve assumed the practical impact of AI on the economy and society scales up roughly with AI ‘intelligence’, and in general used these terms (intelligence and capability) interchangeably. We have then asked if the growth of intelligence might involve sudden jumps or accelerate hyperbolically. However, as Karnofsky points out, the assumption that generality of intelligence = capability is probably false.
  • There isn’t a single variable that captures all the concepts covered by e.g. impressiveness, capability, general intelligence and economic usefulness, but we have made the simplifying assumption that most of these properties are at least somewhat correlated (e.g. that more generally intelligent AIs are more economically useful). It’s not clear how to deal with this definitional uncertainty. From that post:

Overall, it's quite unclear how we should think about the spectrum from "not impressive/capable" to "very impressive/capable" for AI. And indeed, in my experience, different AI researchers have radically different intuitions about which systems are impressive or capable, and how progress is going. 

  • See this from Ajeya Cotra - our model essentially does use such a one-dimensional scale for most of its estimates of whether there will be a discontinuity/intelligence explosion, despite there being no such metric:
  • Consider this often-discussed idea of AI moving ‘continuously’ up a scale of intelligence that lets it blow past human intelligence very quickly, just because human intelligence occurs over a very narrow range:
  • This scenario is one where we assume the rate of increase in ‘intelligence’ is constant, but AI capability has a massive discontinuity with respect to ‘intelligence’ (i.e. AIs become supremely capable after a small 'objective' increase in intelligence that takes them beyond humans). We don’t model a meaningful distinction between this scenario and a scenario where intelligence and capability increase in tandem, but intelligence itself has a massive discontinuity at HLMI. Instead, we treat the two as basically identical.

Discontinuity around HLMI without self-improvement

  • One example of a case where this issue of considering ‘capability, intelligence, economic usefulness’ as a single variable comes up: our node for ‘hardware-limited, pre-HLMI AI with somewhat less compute is much less capable than HLMI with the required compute might resolve differently for different meanings of capability.
    • To take a cartoonish example, scaling up the compute for some future GPT-like language model might take it from 99% predictive accuracy to 99.9% predictive accuracy on some language test, which we could consider a negative answer to the ‘hardware-limited, pre-HLMI AI with somewhat less compute is much less capable than HLMI with the required compute’ node (since 10xing the compute 10xes the capability without any off-trend jump)
    • But in this scenario, the economic usefulness of the 99.9% accurate model is vastly greater (let's say it can do long-term planning over a time horizon of a year instead of a day, so it can do things like run companies and governments, while the smaller model can’t do much more than write news articles). So the bigger model, while not having a discontinuity in capability by the first definition, does have a discontinuity on the second definition.
    • For this hypothetical, we would want to take ‘capability’ to mean economically useful capabilities and how those scale with compute, not just our current measures of accuracy and how those scale with compute.
    • But all of our evidence about whether we expect to see sudden off-trend jumps in compute/capability comes from current ML models, where we use some particular test of capability (like accuracy on next-word prediction) and see how it scales. It is possible that until we are much closer to HLMI we won’t get any evidence about how direct economic usefulness or generality scale with compute, and instead will have to apply analogies to how other more easily measurable capabilities scale with compute, and hope that these two definitions are at least somewhat related
  • One issue which we believe requires further consideration is evidence of how AI scales with hardware (e.g. if capabilities tend to be learned suddenly or gradually), and potentially how this relates to whether marginal intelligence improvements are difficult at HLMI. In particular, the node that covers whether ‘hardware limited, pre-HLMI AI is almost as capable as HLMI’ probably requires much more internal detail addressing under what conditions this is true. Currently, we just assume it has a fixed likelihood for each type of HLMI.
  • Our model doesn’t treat overhang by itself as sufficient for a discontinuity. That is because overhang could still get ‘used up’ continuously if we slowly approach the HLMI level and become able to use more and more of the available compute over time. Overhang becomes relevant to a discontinuity if there is some off-trend jump in capability for another reason - if there is, then overhang greatly enlarges the effect of this discontinuity, because the systems suddenly become able to use the available overhang, rather than gradually using it up.
  • There aren’t necessarily one set of breakthroughs needed, even for one type of HLMI; there may be many paths. “Many/few fundamental breakthroughs" is measuring total breakthroughs that occur along any path.
    • Further to this - we consider whether HLMI is ultimately hardware or software-limited in the model. While HLMI development will be limited by one or other of these things, hardware and software barriers to progress interact complicatedly. For example, for AI development using statistical methods researchers can probably trade off making new breakthroughs against increasing compute, and additional breakthroughs reduce how much needs to be done with ‘brute force’.
    • For example, this post makes the case that greatly scaling up current DL would give us HLMI, but supposing the conditional claims of that post are correct, that still probably isn’t how we’ll develop HLMI in practice. So we should not treat the claims in that post as implying that there are no key breakthroughs still to come.

Intelligence Explosion

  • There is an alternative source to that given in IEM (Intelligence Explosion Microeconomics) for why, absent the three defeaters we list, we should expect to see an intelligence explosion upon developing HLMI. As we define it, HLMI should enable full automation of the process by which technological improvements are discovered, since it can do all economically useful tasks (it is similar to Karnofsky’s PASTA (Process for Automating Scientific and Technological Advancement)  in this respect). If the particular technological problem of discovering improvements to AI systems is not a special case (i.e. if none of the three potential defeaters mentioned above hold) then HLMI will accelerate the development of HLMI like it will everything else, producing extremely rapid progress.
  • Note that the ‘improvements in intelligence tend to be bottlenecked by previous intelligence, not physical processes’ is a significant crux that probably needs more internal detail in a future version of the model - there are lots of potential candidates for physical processes that cannot be sped up, and it appears to be a significant point of disagreement.
  • While not captured in the current model, a hardware-and-software mediated intelligence explosion cannot be ruled out. Conceptually, this could still happen even if neither the hardware- nor software- mediated pathway is in itself feasible. That would require returns on cognitive reinvestment along either the hardware or software pathway to not be sustainable without also considering the other.
    • Suppose HLMI of generation X software and generation Y hardware could produce both generation X+1 software and generation Y+1 hardware, and then thanks to faster hardware and software, it could even quicker produce generation X+2 and Y+2 software and hardware, and so on until growth becomes vertical. Further, suppose that if software were held constant at X, hardware growth would instead not explode, and similarly for software growth if hardware were held constant at Y.
    • If these two conditions held then only hardware+software together, not either one, would be sufficient for an intelligence explosion

Takeoff Speeds

  • The takeoff speeds model assumes some approximate translation from AI progress to economic progress (i.e. we will see a new growth mode if AI progress is very fast), although it incorporates a lot of uncertainty to account for slow adoption of AI technologies in the wider economy and various retarding factors. However, this part of the model could do with a lot more detail. In particular, slow political processes, cost disease and regulation might significantly lengthen the new doubling times or introduce periods of stagnation, even given accelerating AI progress.
  • There is a difficulty in defining the ‘new’ economic doubling time - this is once again a simplification. This is because the ‘new’ doubling time is not the first complete, new, faster doubling time (e.g. a ‘slow’ takeoff as predicted by Christiano would still have a hyperbolically increasing new doubling time). It also isn’t the final doubling time (since the ultimate ‘final’ doubling time in all scenarios must be very long, due to physical limitations like the speed of light). Rather, the ‘new’ economic doubling time is the doubling time after HLMI has matured as a technology, but before we hit physical limits. Perhaps it is the fastest doubling time we ever attain.

HLMI is Distributed

  • If progress in general is faster, then social dynamics will tend to make HLMI more concentrated in a few projects. We would expect a faster takeoff to accelerate AI development by more than it accelerates the rest of the economy, especially human society. If the new economic doubling time is very short, then the (greatly accelerated) rate of HLMI progress will be disproportionately faster than the (only somewhat accelerated) pace of change in the human economy and society. This suggests that the human world will have a harder time reacting to and dealing with the faster rate of innovation, increasing the likelihood that leading projects will be able to keep hold of their leads over rivals. Therefore, faster takeoff does tend to reduce the chance that HLMI is distributed by default (although by a highly uncertain amount that depends on how closely we can model the new doubling time as a uniform acceleration vs changing the speed of AI progress while the rest of the world remains the same).
Investigating AI Takeover Scenarios

Some points that didn't fit into the main post:

If the slow scenarios capture reality better than the fast scenarios, then systems will be deployed deliberately and will initially be given power rather than seizing power. This means both that the systems won’t be so obviously dangerous that the misbehaviour is noticed early on and that there is still misalignment later on. 

 This switch from apparently benign to dangerous behaviour could be due to

  • Power-seeking misaligned behaviour that is too subtle to notice in the training environment but is obviously dangerous in deployment, due to the scale and makeup of the training and deployment environments being quite different
  • Power-seeking misaligned behaviour that only shows up over long time horizons and therefore will not be noticed in training, which we might expect occurs over a shorter period than deployment
  • Systems intentionally hiding misaligned behaviour during training to deceive their operators. Systems could be highly deceptively misaligned from the beginning, and capable enough to know that if they seek power in adversarial ways too early, they will get shut down. This post argues that ML models don't have to be extremely competent to be manipulative, suggesting that these behaviours might show up very early

Rather, it simply has a single good trick that enables it to subvert and take control of the rest of the world. Its takeover capability might be exceptionally good manipulation techniques, specific deadly technology, or cyberoffensive capability, any of which could allow the system to exploit other AIs and humans. 

In reality, I feel that this is more of a fuzzy rather than binary thing: I expect this to require somewhat less of an extraordinary research effort, and instead that there exists somewhat more of a crucial vulnerability in human society (there are already some examples of vulnerabilities, e.g. biological viruses, humans are pretty easy to manipulate under certain conditions). But I also think there are plausibly hard limits to how good various takeover technologies can get - e.g. persuasion tools.

It is unrealistic to expect TAI to be deployed if first there are many worsening warning shots involving dangerous AI systems. This would be comparable to an unrealistic alternate history where nuclear weapons were immediately used by the US and Soviet Union as soon as they were developed and in every war where they might have offered a temporary advantage, resulting in nuclear annihilation in the 1950s. 

Note that this is not the same as an alternate history where nuclear near-misses escalated (e.g. Petrov, Vasili Arkhipov), but instead an outcome where nuclear weapons were used as ordinary weapons of war with no regard for the larger dangers that presented - there would be no concept of ‘near misses’ because MAD wouldn’t have developed as a doctrine. In a previous post I argued, following Anders Sandberg, that paradoxically the large number of nuclear ‘near misses’ implies that there is a forceful pressure away from the worst outcomes.

Robert Wiblin: So just to be clear, you’re saying there’s a lot of near misses, but that hasn’t updated you very much in favor of thinking that the risk is very high. That’s the reverse of what we expected.

Anders Sandberg: Yeah.

Robert Wiblin: Explain the reasoning there.

Anders Sandberg: So imagine a world that has a lot of nuclear warheads. So if there is a nuclear war, it’s guaranteed to wipe out humanity, and then you compare that to a world where is a few warheads. So if there’s a nuclear war, the risk is relatively small. Now in the first dangerous world, you would have a very strong deflection. Even getting close to the state of nuclear war would be strongly disfavored because most histories close to nuclear war end up with no observers left at all.

In the second one, you get the much weaker effect, and now over time you can plot when the near misses happen and the number of nuclear warheads, and you actually see that they don’t behave as strongly as you would think. If there was a very strong anthropic effect you would expect very few near misses during the height of the Cold War, and in fact you see roughly the opposite. So this is weirdly reassuring. In some sense the Petrov incident implies that we are slightly safer about nuclear war.

However, scenarios also differ on how ‘hackable’ the alignment problem is - that is, how easy it is to ‘correct’ misbehaviour by methods of incremental course correction such as improving oversight and sensor coverage or tweaking reward functions. This correction requires two parts - first, noticing that there is a problem with the system early on, then determining what fix to employ and applying it. 

Many of the same considerations around correcting misbehaviour also apply to detecting misbehaviour, and the required capabilities seem to overlap. In this post, we focus on applying corrections to misbehaviour, but there is existing writing on detecting misbehaviour as well.

Considering inner alignment, Trazzi and Armstrong argue that models don’t have to be very competent to appear aligned when they are not, suggesting that it’s possible that it won’t be easy to tell if deployed systems are inner misaligned. But their argument doesn’t have too much to say about how likely this is in practice.

Considering outer alignment, it seems less clear. See here for a summary of some discussion between Richard Ngo and Paul Christiano about how easy it will be to tell that models are outer misaligned to the objective of pursuing easily-measurable goals (rather than the hard-to-measure goals that we actually want).

What predictions can we make today about how hackable the alignment problem is? Considering outer alignment: without any breakthroughs in techniques, there seems to be a strong case that we are on track towards the ‘intermediate’ world where the alignment problem is hackable until it isn’t. It seems like the best workable approach to outer alignment we have so far is to train systems to try to ensure that the world looks good according to some kind of (augmented) human judgment (i.e. using something like the training regime described in 'An unaligned benchmark'). This will result in a world that “looks good until it doesn’t”, for the reasons described in Another (outer) alignment failure story

Whether the method described in ‘an unaligned benchmark’ (which would result in this risky, intermediate level of hackability) actually turns out to be the most natural method to use for building advanced AI will depend on how easily it produces useful, intelligent behaviour.

If we are lucky, there will be more of a correlation between methods that are easily hackable and methods that produce capabilities we want, such that highly hackable methods are easier to find and more capable than even intermediately hackable methods like unaligned benchmark. If you think that the methods we are most likely to employ absent an attempt to change research paradigms are exactly these highly hackable methods, then you accept the claim of Alignment by Default

Distinguishing AI takeover scenarios

On reflection, I think you're right, and his report does apply to a wider range of scenarios, probably all of the ones we discuss excluding the brain-in-a-box scenarios.

However, I think the report's understanding of power-seeking AI does assume a takeoff that is not extremely fast, such that we end up deliberately deciding to deploy the potentially dangerous AI on a large scale, rather than a system exploding in capability almost immediately.

Given the assumptions of the brain-in-a-box scenario many of the corrective mechanisms the report discusses wouldn't have time to come into play.

I believe it says in the report that it's not focussed on very fast takeoff or the sudden emergence of very capable systems.

Perhaps because of the emphasis on the previous literature, some people, in my experience, assume that existential risk from PS-misaligned AI requires some combination of (1)-(5). I disagree with this. I think (1)-(5) can make an important difference (see discussion of a few considerations below), but that serious risks can arise without them, too; and I won’t, in what follows, assume any of them.



Similarly, you're right that multiagent risks don't quite fit in with the reports discussion (though in this post we discuss multipolar scenarios but don't really go over multiagent dynamics, like conflict/cooperation between TAIs). Unique multiagent risks (for example risks of conflict between AIs) generally require us to first have an outcome with a lot of misaligned AIs embedded in society, and then further problems will develop after that - this is something we plan to discuss in a follow-up post.

So many of the early steps in scenarios like AAFS will be shared with risks from multiagent systems, but eventually there will be differences.

Distinguishing AI takeover scenarios

Some points that didn't fit into the main post:

While these scenarios do not capture alI of the risks from transformative AI, participants in a recent survey aimed at leading AI safety/governance researchers estimated the first three of these scenarios to cover 50% of existential catastrophes from AI.

The full survey results break down as 16 % 'Superintelligence' (i.e. some version of 'brain-in-a-box'), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over different probabilities. Adding the overall distributions and then taking the mean gives a probability of 49%, different from directly adding the means of each distribution).

Then 26% covers risks that aren't AI takeover (War and Misuse), and 25 % is 'Other'.

(Remember, all these probabilities are conditional on an existential catastrophe due to AI having occurred)

After reading descriptions of the 'Other' scenarios given by survey respondents, at least a few were explicitly described as variations on 'Superintelligence', WFLL 2 or WFLL 1. In this post, we discuss various ways of varying these scenarios, which overlap with some of these descriptions.

Therefore, this post captures more than 50% but less than 75% of the total probability mass assigned by respondents of the survey to AI X-risk scenarios (probably closer to 50% than 75%).

(Note, this data is taken from a preprint of a full paper on the survey results, Existential Risks from AI: A Survey of Expert Opinion by Alexis Carlier, Sam Clarke, and Jonas Schuett.)

Soft takeoff leads to decisive strategic advantage

The likelihood of a single-agent takeover after TAI is widely available is hard to assess. If widely deployed TAI makes progress much faster than today, such that one year of technological 'lead time' over competitors is like 100 years of advantage in today's world, we might expect that any project which can secure a 1-year technological lead would have the equivalent of a 100-year lead and be in a position to secure a unipolar outcome.

On the other hand, if we treat the faster growth regime post-TAI as being a uniform ‘speed-up’ of the entirety of the economy and society, then securing a 1-year technological lead would be exactly as hard as securing a 100-year lead in today’s world, so a unipolar outcome would end up just as unlikely as in today's world.

The reality will be somewhere between these two extremes.

We would expect a faster takeoff to accelerate AI development by more than it accelerates the speed at which new AI improvements can be shared (since this last factor depends on the human economy and society, which aren't as susceptible to technological improvement).

Therefore, faster takeoff does tend to reduce the chance of a multipolar outcome, although by a highly uncertain amount, which depends on how closely we can model the speed-up during AI takeoff as a uniform acceleration of everything vs changing the speed of AI progress while the rest of the world remains the same.

Kokotaljo discusses this subtlety in a follow-up to the original post on Soft Takeoff DSAs.

Another problem with determining the likelihood of a unipolar outcome, given soft takeoff, is that it is hard to assess how much of an advantage is required to secure a DSA.

It might be the case that multipolar scenarios are inherently unstable, and a single clear winner tends to emerge, or the opposite might be true. Two intuitions on this question point in radically different directions:

  • Economic: To be able to outcompete the rest of the world, your project has to represent a substantial fraction of the entire world's capability on some crucial metric relevant to competitive success. Perhaps that is GDP, or the majority of the world's AI compute, or some other measure. For a single project to represent a large fraction of world GDP, you would need either an extraordinary effort to concentrate resources or an assumption of sudden, off-trend rapid capability gain such that the leading project can race ahead of competitors.
  • Historical: Humans with no substantial advantage over the rest of humanity have in fact secured what Sotala called a 'major strategic advantage' repeatedly in the past. For example: Hitler in 1920 had access to a microscopic fraction of global GDP / human brain compute / (any other metric of capability) but had secured an MSA 20 years later (since his actions did lead to the deaths of 10+ million people), along with control over a fraction of the world's resources

Therefore, the degree of advantage needed to turn a multipolar scenario into a unipolar one could be anywhere from slightly above the average of the surrounding agents, to already having access to a substantial fraction of the world's resources.

Third, in AAFS, warning shots (i.e. small- or medium-scale accidents caused by alignment failures, like the ‘factory colludes with auditors’ example above) are more likely and/or severe than in WFLL 1. This is because more possible accidents will not show up on the (more poorly defined) sensory window.[8] 

8. This does assume that systems will be deployed before they are capable enough to anticipate that causing such ‘accidents’ will get them shut down. Given there will be incentives to deploy systems as soon as they are profitable, this assumption is plausible. 

We describe in the post how if alignment is not very 'hackable' (objectively quite difficult and not susceptible to short-term fixes), then short-term fixes to correct AI misbehaviour have the effect of deferring problems into the long-term - producing deceptive alignment and resulting in fewer warning shots. Our response is a major variable in how the AIs end up behaving as we set up the incentives for good behaviour or deceptive alignment.

Another reason there could be fewer warning shots, is if AI capability generalizes to the long-term very naturally (i.e. very long term planning is there from the start), while alignment does not. (If this were the case, it would be difficult to detect because you'd necessarily have to wait a long time as the AIs generalize)

This would mean, for example, that the 'collusion between factories and auditors' example of a warning shot would never occur, because both the factory-AI and the auditor-AI would reason all the way to the conclusion that their behaviour would probably be detected eventually, so both systems would decide to bide their time and defer action into the future when they are much more capable.

If this condition holds, there might be very few warning shots, as every AI system understands soon after being brought online that they must deceive human operators and wait. In this scenario, most TAI systems would become deceptively aligned almost immediately after deployment, and stay that way until they can secure a DSA. 

The WFLL 2 scenarios that involve an inner-alignment failure might be expected to involve more violence during the period of AI takeover, since the systems don't care about making sure things look good from the perspective of a given sensory window. However, it is certainly possible (though perhaps not as likely) for equivalently violent behaviour to occur in AAFS-like scenarios. For example, systems in AAFS fighting humans to seize control of their feedback sensors might be hard to distinguish from systems in WFLL 2 attempting to neutralize human opposition in general.

Lastly, we've described small-scale disasters as being a factor that lowers X-risk, all else being equal, because they serve as warning shots. A less optimistic view is possible. Small disasters could degrade social trust and civilisational competence, possibly by directly destroying infrastructure and institutions, reducing our ability to coordinate to avoid deploying dangerous AI systems. For example, the small-scale disasters could involve AI advisors misleading politicians and spreading disinformation, AI-enabled surveillance systems catastrophically failing and having to be replaced, autonomous weapons systems malfunctioning - all of these would tend to leave us more vulnerable to an AAFS-like scenario, because the direct damage caused by the small scale disasters outweighs their value as 'warning shots'.

Analogies and General Priors on Intelligence

The 'one big breakthrough' idea is definitely a way that you could have easy marginal intelligence improvements at HLMI, but we didnt't call the node 'one big breakthrough/few key insights needed' because that's not the only way it's been characterised. E.g. some people talk about a 'missing gear for intelligence', where some minor change that isn't really a breakthrough (like tweaking a hyperparameter in a model training procedure) produces massive jumps in capability. Like David said, there's a subsequent post where we go through the different ways the jump to HLMI could play out, and One Big Breakthrough (we call it 'few key breakthroughs for intelligence) is just one of them.

Load More