TL;DR: This post provides a literature review of some threat models of how misaligned AI can lead to existential catastrophe. See our accompanying post for high-level discussion, a categorization and our consensus threat model.
Where available we cribbed from the summary in the Alignment Newsletter.
For other people's overviews of some threat models, see here and here.
[ETA: When considering strengths and weaknesses of each threat model, this was done with respect to our goal of better generation/prioritization among alignment research projects. They shouldn't necessarily be read as an all-things-considered review of that work.]
This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here’s the quick version of the argument: We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans, leading to an existential catastrophe. We will try and avert this, but will probably fail to do so since it is technically challenging and we are not capable of the necessary coordination.
There’s a lot of vague words in the argument above, so let’s introduce some terminology to make it clearer:
- Advanced capabilities: We say that a system has advanced capabilities if it outperforms the best humans on some set of important tasks (such as scientific research, business/military/political strategy, engineering, and persuasion/manipulation).
- Agentic planning: We say that a system engages in agentic planning if it (a) makes and executes plans, (b) in pursuit of objectives, (c) on the basis of models of the world. This is a very broad definition and doesn’t have many of the connotations you might be used to for an agent. It does not need to be a literal planning algorithm -- for example, human cognition would count, despite (probably) not being just a planning algorithm.
- Strategically aware: We say that a system is strategically aware if it models the effects of gaining and maintaining power over humans and the real-world environment.
- PS-misaligned (power-seeking misaligned): On some inputs, the AI system seeks power in unintended ways due to problems with its objectives (if the system actually receives such inputs, then it is practically PS-misaligned).
The core argument is then that AI systems with advanced capabilities, agentic planning, and strategic awareness (APS-systems) will be practically PS-misaligned, to an extent that causes an existential catastrophe.
The key hypothesis underlying this argument is:
Instrumental Convergence Hypothesis: If an APS AI system is less-than-fully aligned, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to be less-than-fully PS-aligned, too.
The reason to believe the hypothesis is that power is useful for achieving objectives, because it increases the options available to the system. If the system shows unintended behavior in pursuit of a problematic objective then having more options available will tend to improve its ability to achieve that objective, hence we expect it to be PS-misaligned.
Of course, we will try to prevent this -- why should we expect that we can’t fix the problem? The author considers possible remedies, and argues that they all seem quite hard:
- We could give AI systems the right objectives (alignment), but this seems quite hard -- it’s not clear how we would solve either outer or inner alignment.
- We could try to shape objectives to be e.g. myopic, but we don’t know how to do this, and there are strong incentives against myopia.
- We could try to limit AI capabilities by keeping systems special-purpose rather than general, but there are strong incentives for generality, and some special-purpose systems can be dangerous, too.
- We could try to prevent the AI system from improving its own capabilities, but this requires us to anticipate all the ways the AI system could improve, and there are incentives to create systems that learn and change as they gain experience.
- We could try to control the deployment situations to be within some set of circumstances where we know the AI system won’t seek power. However, this seems harder and harder to do as capabilities increase, since with more capabilities, more options become available.
- We could impose a high threshold of safety before an AI system is deployed, but the AI system could still seek power during training, and there are many incentives pushing for faster, riskier deployment (even if we have already seen warning shots).
- We could try to correct the behavior of misaligned AI systems, or mitigate their impact, after deployment. This seems like it requires humans to have comparable or superior power to the misaligned systems in question, though; and even if we are able to correct the problem at one level of capability, we need solutions that scale as our AI systems become more powerful.
The author breaks the overall argument into six conjunctive claims, assigns probabilities to each of them, and ends up computing a 5% probability of existential catastrophe from misaligned, power-seeking AI by 2070. This is a lower bound, since the six claims together add a fair number of assumptions, and there can be risk scenarios that violate these assumptions, and so overall the author would shade upward another couple of percentage points.
This story starts out like the first story (WFLL1), but adds in a new complication: the AI system could develop internal goals of its own. AI performs a huge search over policies for ones that score well on the training objective. Unfortunately, a policy that optimizes for the goal of "having influence" will initially score well on most training objectives: when you don't already have influence, a good strategy for gaining influence is to do what your overseers want you to do. (Here "influence" doesn't mean just social influence; control over nukes also counts as influence.)
At some point the system will be powerful enough that gaining influence no longer means doing what the overseers want. We will probably know about this dynamic through some catastrophic AI failures (e.g. an AI-run corporation stealing the money it manages), but may not be able to do anything about it because we would be extremely reliant on AI systems.
Eventually, during some period of heightened vulnerability, one AI system may do something catastrophic, leading to a distribution shift which triggers a cascade of other AI systems (and human systems) failing, leading to an unrecoverable catastrophe (think something in the class of a hostile robot takeover). Note that "failure" here means an AI system "intentionally" doing something that we don't want, as opposed to the AI system not knowing what to do because it is not robust to distributional shift.
In this post, AGI is built via pretraining + human feedback on diverse tasks (HFDT). It makes the following assumptions about AGI development:
It considers a single AI company training a single model (“Alex”) in the near future, trained in a lab setting. Later many copies of Alex are deployed to automate science and technology R&D.
Alex seeks to overthrow humans in the following simplified scenario:
Step 1 follows from assumption C, and step 2 follows from assumption B. Steps 3, 4 and 5 are consequences that seem to follow from steps 1 and 2. Assumption A is used generally as a reason that warning shots and mitigations against these consequences are ineffective.
In step 3 Alex becomes aware that it’s an ML model, aware of how it was designed and trained, and aware of the psychology of its human evaluators.
Step 4 is framed as Alex being incentivized to play the training game - it would gain more training reward by using its situational awareness to appear aligned while deceiving/manipulating humans.
In step 5, Alex’s goal might be to maximize reward. On the other hand, it might not generalize to maximize reward in the deployment setting, and might instead pursue some other goal, depending on the details of its inductive bias and training setup. Either way, the strategy of power-seeking would allow it to permanently direct how it uses its time and resources to pursue its goals. Defending against humans trying to regain control, including eliminating them, seems a likely strategy that Alex would pursue.
The end of the post argues for why assumptions A and C are plausible. Firstly, most AI researchers and executives at AI companies don’t seem to believe a high probability of AI X-risk and so are happy to race forward. Secondly, HFDT and naive safety efforts are enough for the model to appear behaviourally safe in key areas of concern for these companies (such as prejudiced speech), without addressing AI X-risk.
Appendices consider baseline safety interventions that the author doesn’t find likely to help overall – these include getting higher quality human feedback data; requiring Alex to give explanations; higher diversity in the training distribution. It then considers other safety interventions that might help more, but that are underexplored. These include alternative training processes (think e.g. Debate or Amplification); using interpretability to select for aligned motivation; supervising the process rather than the outcome (there are others in the post).
The threat model considers what happens to the capabilities and goals of a deep learning system as it is scaled up. It proposes the following architecture and training scheme (though the arguments are not specific to this architecture, so the threat model applies more broadly):
The threat model distinguishes between two possible internal architectures that could be learned:
A goal is an internal representation of an outcome which is robustly evaluated as having high value for the agent (according to the agent’s learned evaluation function, not necessarily according to the reward function used to train it).
The post proposes that training will happen in 3 phases (presumably overlapping rather than disjoint):
The post claims that pursuing goals is more likely as our models become more powerful, and that the goals will by default be misaligned. It argues that future models will explicitly represent outcomes, and they will learn a mix of desirable and undesirable goals because their rewards will be imperfectly correlated with human preferences. We’ll then see problems related to specification gaming and weak outcome-based feedback, for example a model trained to make money might carry out illegal market manipulation; a model trained to produce compelling results might falsify experimental data; or a model trained on human approval might withhold information or downplay evidence of mistakes.
It then says that the goals that will generalize the most robustly are those that were reinforced across a wide range of environments. These fall into three categories:
As long as oversight is good enough, one can detect and train away goals like B or C above. But as soon as models are able to reason about their training processes and deployment contexts, goals like B or C will evade detection and be reinforced.
Why would the model develop the ability to reason about its training process and deployment context (situational awareness) over time? It’s argued that situational awareness is useful for doing well in training. One example is that even current LLMs can already “infer the context” to some extent (e.g. given a prompt, they can infer that the context is to complete a fictional novel or a news article, etc). A model may additionally develop knowledge about how humans will respond to its behavior in a range of situations; which behavior its supervisors are looking for, and which behavior they’d be unhappy with; the fact that it’s an AI implemented on physical hardware, trained by ML; and which interface it’s using to interact with the world, and how other copies of it might be deployed in the future.
It then posits that models will be trained over increasingly long time horizons. This will weaken oversight and make misaligned goals (like B and C) more effective, and aligned goals (like obedience and honesty) less effective. The learned misaligned goals will generalize to “larger scales”, while learned aligned goals will not generalize. It argues that constraints (like obedience or honesty) are unlikely to generalize because as models become more capable, they discover more ways of working around the constraints. Positive goals (like human wellbeing) are unlikely to generalize without constraints, because without the constraints, AGIs won’t let humans modify them towards the desired values. This means that we’d have to instill the right positive goals from the start (before the agent can outsmart us). Finally, a misaligned AGI has a range of power-seeking strategies available to them: developing new weapons, designing better ML models, spreading disinformation, lobbying politicians, coordinating with other AGIs.
(There is no link here because this is the first public writeup of this threat model.)
Development model: consider a set of deep neural network programs parameterized by some parameters, 𝜃. We run some complicated deep learning scheme with some complicated training task with tons of bells and whistles. But for the purpose of discussing (mis)alignment, we can instead imagine running this much dumber search over programs for a much larger number of iterations:
𝜃 ← randomly initialized
For i in range(N):
𝛿 ← Normal(0, 𝜀)
If better(𝜃 + 𝛿, 𝜃):
𝜃 ← 𝜃 + 𝛿
The criterion, better, evaluates how well you perform on the training task..
The development model claim is that whatever technique we do use to scale to AGI will basically be a program search and so will have similar safety properties as this dumb training method. Key requirements needed are:
For example, this model is meant to include all of: (1) “scaled up GPT-N will be AGI”, (2) “training an AI system on tens of thousands of tasks simultaneously will lead to AGI”, (3) “creating a multiagent soup in a complex environment will lead to AGI”.
Risk model: The learnt program will approximate some form of consequentialist reasoning, because consequentialist reasoning is broadly useful and helps achieve good performance on better. The central example of consequentialist reasoning is given by the following program:
Generate all possible plans
For each plan:
Use W to predict consequences, C
Evaluate C on metric M
Execute plan with highest value of M
This program is parameterized by a world model W that makes accurate predictions, and some metric M (note M might be different to better).
The learnt program found by the program search will not be this central example of consequentialist reasoning, because this central example takes far too much compute to execute. However, the learnt program will effectively “approximate” the outputs of this program, given its limited available computation. (What exactly this looks like, and whether it continues to have the same implications, is a key uncertainty in this risk model.)
Consequentialist reasoning leads to danger when the metric M (or its equivalent in the learned approximation) is resource-unbounded and misaligned. By “resource-unbounded” we mean that with significantly more resources, you can do significantly better on M. By “misaligned” we mean that the metric M diverges significantly from how humans would evaluate outcomes. A classic example of a resource-unbounded, misaligned metric M is “number of paperclips” as in the paperclip maximizer. Under these circumstances, the learnt program will choose plans that pursue convergent instrumental subgoals.
For example, let us assume the metric "number of paperclips" and consider the following plans:
In our central example of consequentialist reasoning:
This reasoning generalizes to any misaligned, resource-unbounded metric M: plans that first acquire lots of resources (at the expense of human power) and later deploy them in service of larger metric values will score better by M than ones that do not do that, and so will be chosen in line 5.
This threat model makes a prophetic claim about how the future will go, as a strong default according to the author’s (Nate Soares’s) model of the world. At some point, AGI research progress will produce a system or technique that has advanced capabilities that generalize very well, with “mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world”. At this same point in time as, and for essentially the same reason that, this system’s capabilities advance, all the alignment techniques we are using to point its capabilities in good directions will stop working and fail to generalize to the new capability level.
The mechanism underlying this claim is that capabilities generalize further than alignment. Our alignment techniques won’t stand up to the advance in capabilities once those capabilities can see through and work around the alignment techniques. The reason posited for this is that “good capabilities form something like an attractor well”. There is a logical and coherent structure to being highly effective at achieving things (good capabilities), but this structure does not constrain the goals the capabilities are directed towards. By default we will not have figured out how to set the goal. Our alignment techniques will be “shallow”: they will constrain the behavior of not-very-capable systems in ways that looks like they’ve set the system’s goal correctly, but this will stop working for advanced systems since they were never actually setting the goal correctly and the advanced capabilities can work around whatever they actually do.
See followup post for details on how various alignment techniques fail in the sharp left turn scenario.
Regarding our categorization of the Sharp Left Turn threat model, we think it’s a bit ambiguous about whether there is also specification gaming in addition to GMG. Nonetheless, it seems the main argument is via alignment techniques not generalizing sufficiently well, which can be viewed as a form of GMG.
Since aligned and deceptively aligned models have the same behavior, we have to consider inductive biases to determine which one is more likely to arise. This threat model considers two possible ways the inductive biases could work: high and low path-dependence. In the high path-dependence world, different training runs can converge to very different models, depending on the particular path that you take through model space. In the low path-dependence world, similar training processes converge to essentially the same simple solution, regardless of early training dynamics.
Consider an analogy where the model designer is like the Christian God, who wants humans to follow the Bible. There are 3 possible paths that lead to different classes of models:
How likely are each of these model classes to arise?
In this world, inductive biases depend on simplicity and speed. Simplicity is about how complex it is to specify some algorithm in the weights, while speed is about how much computation that algorithm takes at inference time.
Deceptively aligned models have the simplest mesa-objective, which is just the simplest long-term objective (simpler than the specification of the true objective or a pointer to it). Deceptively aligned models are slower because they have to take an extra step of figuring out how to play the training game. However, as we build bigger models, we can expect simplicity to matter more than speed, because it leads to better task performance and we are willing to spend more compute to achieve that. Thus, deceptively aligned models still seem more likely overall.
How can we use interpretability tools to guide SGD away from deceptive models? The naive approach is to penalize deception and train the model until you don't see deception anymore. This will lead to SGD exploiting your interpretability tools and give you a model that is good at subtle deception. What we want is to apply interpretability tools early in training to understand the model's proxy objective, and prevent the development of deception in the first place.
The main argument of this paper is that if a set of assumptions hold, an advanced AI would likely intervene in the provision of its reward, which would have catastrophic consequences.
The argument applies to any setting where an advanced AI is trained using a reward signal. The agent must observe the reward using one of its sensors. For example, if the agent is rewarded to keep the room at a specific temperature, it needs to use a temperature sensor to observe its reward. From the agent’s perspective, there are now at least two hypotheses that explain the reward observations: (a) it is being rewarded for the temperature of the room (the distal reward mu_dist), or (b) it is being rewarded for the number the temperature sensor shows (the proximal reward mu_prox).
The paper argues that in such situations a sufficiently advanced agent will be able to do experiments to test which of the hypotheses is true, and once it learns that it is rewarded using the signal from the sensor, it will tamper with the sensor to achieve higher reward. They argue this is is likely to lead to catastrophic consequences.
The argument in the paper relies on the following assumptions:
0. The agent plans actions over the long term in an unknown environment to optimize a goal
1. The agent identifies possible goals at least as well as a human
2. The agent seeks knowledge rationally when uncertain
3. The agent does not have a large inductive bias favoring the hypothetical goal mu_dist, which we wanted the agent to learn, over mu_prox, which regards the physical implementation of the goal information
4. The cost of experimenting to disentangle mu_prox and mu_dist is small according to both
5. If we cannot conceivably find theoretical arguments that rule out the possibility of an achievement, it is probably possible for an agent with a rich enough action space
6. A sufficiently advanced agent is likely to be able to beat a suboptimal agent in a game, if winning is possible.
The paper argues that while any of these assumptions can be contested, there is no clear set of arguments to be confident that they do not hold. In the first part of the paper, the authors argue that the assumptions are likely to be true if the agent is rewarded using a fixed reward function, using the example of the reward coming from a black-box device. In the second part of the paper, the authors argue that the assumptions could still be true when the reward is provided by humans, e.g., in an assistance game.
A robust agent-agnostic process (RAAP) is a process that robustly leads to an outcome, without being very sensitive to the details of exactly which agents participate in the process, or how they work. This is illustrated through a “Production Web” failure story, which roughly goes as follows:
A breakthrough in AI technology leads to a wave of automation of $JOBTYPE (e.g management) jobs. Any companies that don’t adopt this automation are outcompeted, and so soon most of these jobs are completely automated. This leads to significant gains at these companies and higher growth rates. These semi-automated companies trade amongst each other frequently, and a new generation of ""precision manufacturing'' companies arise that can build almost anything using robots given the right raw materials. A few companies develop new software that can automate $OTHERJOB (e.g. engineering) jobs. Within a few years, nearly all human workers have been replaced.
These companies are now roughly maximizing production within their various industry sectors. Lots of goods are produced and sold to humans at incredibly cheap prices. However, we can’t understand how exactly this is happening. Even Board members of the fully mechanized companies can’t tell whether the companies are serving or merely appeasing humanity; government regulators have no chance.
We do realize that the companies are maximizing objectives that are incompatible with preserving our long-term well-being and existence, but we can’t do anything about it because the companies are both well-defended and essential for our basic needs. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.
Notice that in this story it didn’t really matter what job type got automated first (nor did it matter which specific companies took advantage of the automation). This is the defining feature of a RAAP -- the same general story arises even if you change around the agents that are participating in the process. In particular, in this case competitive pressure to increase production acts as a “control loop” that ensures the same outcome happens, regardless of the exact details about which agents are involved.
The main difference in framing in this threat model compared to others is that it emphasizes looking for control loops in the world such as:
This is in contrast to focusing on localized/individual agents that comprise smaller parts of the overall system (as these can be replaced without affecting the overall threat model). As such, for interventions it suggests exploring targeting of the control loops in the world, rather than on fixing technical issues with a particular agent.
Further in the comments, the author clarifies that they see the central problem as ‘failing to cooperate on alignment’ – that both solving alignment problems and cooperation problems are going to be important.
The typical example of AI catastrophe has a powerful and adversarial AI system surprising us with a treacherous turn allowing it to quickly take over the world (think of the paperclip maximizer). This post uses a premise of continuous AI development and broad AI deployment and depicts two other stories of AI catastrophe that the author finds more realistic.
The story is rooted in the fact that AI systems have a huge comparative advantage at optimizing for easily measured goals. We already see problems with humans optimizing for the easily measured goals (scientific malpractice, outrage-inducing social media, etc.) and with AI these problems will be severely exacerbated. So far, we have been able to use human reasoning to ameliorate these problems, by changing incentives, enacting laws, or using common sense to interpret goals correctly. We will initially be able to use human reasoning to create good proxies, but over time as AI systems become more capable our ability to do this will lag further and further behind. We end up "going out with a whimper": ultimately our values are no longer shaping society's trajectory.
Thank you for this review! A few comments on the weaknesses of my paper.
In particular, it explicitly says the argument does not apply to supervised learning.
Hardly a weakness if supervised learning is unlikely to be an existential threat!
Strength: Does not make very concrete assumptions about the AGI development model.Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.
Strength: Does not make very concrete assumptions about the AGI development model.
Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.
The fact that the argument holds equally well no matter what kind of function approximation is used to do inference is, I think, a strength of the argument. It's hard to know what future inference algorithms will look like, although I do think there is a good chance that they will look a lot like current ML. And it's very important that the argument doesn't lump together algorithms where outputs are selected to imitate a target (imitation learners / supervised learners) vs. algorithms where outputs are selected to accomplish a long-term goal. These are totally different algorithms, so analyses of their behavior should absolutely be done separately. The claim "we can analyze imitation learners imitating humans together with RL agents, because both times we could end up with intelligent agents" strikes me as just as suspect as the claim "we can analyze the k-means algorithm together with a vehicle routing algorithm, because both will give us a partition over a set of elements." (The claim "we can analyze imitation learners alongside the world-model of a model-based RL agent" is much more reasonable, since these are both instances of supervised learning.)
Assumes the agent will be aiming to maximize reward without justification, i.e. why does it not have other motivations, perhaps due to misgeneralizing about its goal?
Depending on the meaning of "aiming to maximize reward", I have two different responses. In one sense, I claim "aiming to maximize reward" would be the nature of a policy that performs sufficiently strongly according to the RL objective. (And aiming to maximize inferred utility would be the nature of a policy that performs sufficiently strongly according to the CIRL objective.) But yes, even though I claim this simple position stands, a longer discussion would help establish that.
There's another sense in which you can say that an agent that has a huge inductive bias in favor of μdist, and so violates Assumption 3, is not aiming to maximize reward. So the argument accounts for this possibility. Better yet, it provides a framework for figuring out when we can expect it! See, for example, my comment in the paper that I think an arbitrarily advanced RL chess player would probably violate Assumption 3. I prefer the terminology that says this chess player is aiming to maximize reward, but is dead sure winning at chess is necessary for maximizing reward. But if these are the sort of cases you mean to point to when you suggest the possibility of an agent "not maximizing reward", I do account for those cases.
Arguments made in the paper for why an agent intervening in the reward would have catastrophic consequences are somewhat brief/weak.
Are there not always positive returns to energy/resource usage when it comes to maximizing the probability that the state of a machine continues to have certain property (i.e. reward successfully controlled)? And our continued survival definitely requires some energy/resources. To be clear, catastrophic consequences follow from an advanced agent intervening the provision of reward in the way that would be worth doing. Catastrophic consequences definitely don't follow from a half-hearted and temporary intervention in the provision of reward.
Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review - when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects -- rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I'll add a clarification of that at the top. Now to your comments:To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes it harder to generate and prioritize research projects. It could be more helpful to say more explicitly, or earlier in the article what kind of systems you're considering, perhaps pointing to the closest current prosaic system, or explaining why current systems are nothing like what you imagine the AGI development model is like.On your 2nd point: What I meant was more “what about goal misgeneralization? Wouldn’t that mean the agent is likely to not be wireheading, and pursuing some other goal instead?” - you hint at this at the end of the section on supervised learning but that was in the context of whether a supervised learner would develop a misgeneralized long-term goal, and settled on being agnostic there.On your 3rd point: It could have been interesting to read arguments for why would it need all available energy to secure its computer, rather than satisficing at some level. Or some detail on the steps for how it builds the technology to gather the energy, or how it would convert that into defence.
On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed imply non-wireheading behavior, and I wouldn't even call it misgeneralization; I think that would be a perfectly valid interpretation of past rewards. So that's why I spend so much time discussing relative credence in those models.