This report (now available on arxiv) is intended as a concise introduction to the alignment problem for people familiar with machine learning. It translates previous arguments about misalignment into the context of deep learning by walking through an illustrative AGI training process (a framing drawn from an earlier report by Ajeya Cotra), and outlines possible research directions for addressing different facets of the problem.
Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. Without substantial action to prevent it, AGIs will likely use their intelligence to pursue goals which are very undesirable (in other words, misaligned) from a human perspective. This report aims to cover the key arguments for this claim in a way that’s as succinct, concrete and technically-grounded as possible. My core claims are that:
- It’s worth thinking about risks from AGI in advance
- Realistic training processes lead to the development of misaligned goals, in particular because neural networks trained via reinforcement learning will
- More people should pursue research directions which address these problems
It’s worth thinking about risks from AGI in advance
By AGI I mean an artificial agent which applies domain-general cognitive skills (such as reasoning, memory, and planning) to perform at or above human level on a wide range of cognitive tasks (such as running a company, writing a software program, or formulating a new scientific theory). This isn’t a precise definition—but it’s common in science for important concepts to start off vague, and become clearer over time (e.g. “energy” in 17th-century physics; “fitness” in early-19th-century biology; “computation” in early-20th-century mathematics). Analogously, “general intelligence” is a sufficiently important driver of humanity’s success to be worth taking seriously even if we don’t yet have good ways to formalize or measure it.
On the metrics which we can track, though, machine learning has made significant advances, especially over the last decade. Some which are particularly relevant to AGI include few-shot learning (and advances in sample efficiency more generally), cross-task generalization, and multi-step reasoning. While hindsight bias makes it easy to see these achievements as part of a natural progression, I suspect that even a decade ago the vast majority of machine learning researchers would have been confident that these capabilities were much further away.
I think it would be similarly overconfident to conclude that AGI is too far away to bother thinking about. A recent survey of top ML researchers gave a median estimate of 2059 for the year in which AI will outperform humans at all tasks (although their responses were sensitive to question phrasing) (Grace et al., 2022). This fits with the finding that, under reasonable projections of compute growth, we will be able to train neural networks as large as the human brain in a matter of decades (Cotra, 2020). But the capabilities of neural networks are currently advancing much faster than our ability to understand how they work or interpret their cognition; if this trend continues, we’ll build AGIs which match human performance on many important tasks without being able to robustly verify that they’ll behave as intended. And given the strong biological constraints on the size, speed, and architecture of human brains, it seems very unlikely that humans are anywhere near an upper bound on general intelligence. The differences between our brains and those of chimpanzees are small on an evolutionary scale (in particular, only a 3x size difference), but allow us to vastly outthink them. Neural networks scale up 3x on a regular basis, and can rapidly incorporate architectural and algorithmic improvements (including improvements generated by AIs themselves). So soon after building human-level AGIs (and well before we thoroughly understand them), we’ll likely develop superhuman AGIs which can vastly outthink us in turn.
These are strong claims, which it’s reasonable to be uncertain about, especially given that we lack either formal frameworks or empirical data which directly inform us about AGI. However, empirical evidence about AGIs is hard to come by in advance of actually building them. And our lack of formal frameworks for describing alignment problems is a major reason to expect them to be difficult to solve. So if the development of AGI might pose catastrophic risks, we have no choice but to try to address them in advance, even if that requires reasoning under significant uncertainty. Unfortunately, I think it does pose such risks—the most concerning of which is the development of misaligned goals during training.
Realistic training processes lead to the development of misaligned goals
By default, it seems likely to me that AGIs will end up pursuing goals which are undesirable to us, rather than consistently following our intentions. Previous presentations of arguments for this claim have mainly framed them as abstract principles (explored in detail by Carlsmith (2021) and Ngo (2020)); in this report I’ll describe in more detail how I expect misaligned goals to emerge throughout the process of training an AGI. For the sake of concreteness I’ll focus this report on an illustrative training process in which:
- A single deep neural network with multiple output heads is trained end-to-end
- With one head trained via self-supervised learning on large amounts of multimodal data to predict the next observation
- With another head trained to output actions via reinforcement learning on a wide range of tasks, using standard language and computer interfaces
- With rewards provided via a combination of human feedback and automated evaluations
- Until the policy implemented by the network (via its action head) is able to match or exceed human performance on most of those tasks, and qualifies as an AGI.
Of course, any attempt to outline an AGI training process in advance will have many omissions and inaccuracies. However, the illustrative process described above allows us to make abstract arguments about alignment more concrete; and I expect that training processes similar to this one would plausibly give rise to AGIs which pursue misaligned goals with disastrous consequences. The rest of this report will outline how misaligned goals might develop across three sequential phases of training:
- Learning to plan towards achieving a range of goals
- Policies will develop sophisticated internal representations of a range of outcomes which are correlated with higher reward on multiple tasks, and learn to make plans to achieve them. I’ll call these internal representations of favored outcomes the policy’s goals.
- Pursuing goals in a situationally-aware way
- Once policies can reason about their training processes and deployment contexts (an ability which I’ll call situational awareness), they’ll learn to deceptively pursue misaligned goals while still getting high training reward.
- Generalizing goals beyond human supervision
- Policies which are too capable for humans to effectively supervise will generalize towards taking actions which give them more power over the world, rather than following human intentions.
It’s particularly important to note that, under the definition above, “the goals of a policy” is a different concept from “the reward function used to train that policy”—although a policy’s goals will be shaped by its reward function, they’ll ultimately depend on the internal representations that policy learns. This distinction, which I’ll explain in more detail in the next section, will become increasingly important as policies learn goals that generalize to a wider range of novel environments.
Also note that there are no sharp boundaries between these phases. However, I expect each phase to feature emergent dynamics which weren’t present in the previous phases—as Steinhardt (2022) argues is common in ML (and science more generally). As a very rough overview, phase 1 focuses on a version of the reward misspecification problem; phase 2 elaborates on that, and introduces the deceptive alignment problem; and phase 3 focuses on the goal misgeneralization problem. Let’s look at each phase now.
Phase 1: learning to plan towards achieving a range of goals
Key claim: policies will develop sophisticated internal representations of a range of outcomes which are correlated with higher reward on multiple tasks, and learn to make plans to achieve them.
Policies will learn to use representations of plans, outcome features, and values to choose actions
Deep neural networks perform very capably on a wide range of tasks by learning representations related to those tasks which are distributed across their internal weights (Bengio et al., 2014). For example, neural networks trained on image classification tasks develop representations of various visual features, such as edges, shapes, and objects, which are then used to identify the contents of those images. Olah et al. (2020) provide compelling visualizations of these representations, as well as representations of more complex features like wheels and dog heads in the Inception network. Less work has been done on understanding the representations learned by deep reinforcement learning policies, but one example comes from a policy trained to play a version of Capture the Flag by Jaderberg et al. (2019), who identified “particular neurons that code directly for some of the most important game states, such as a neuron that activates when the agent’s flag is taken, or a neuron that activates when an agent’s teammate is holding a flag”.
How are representations like these used by reinforcement learning policies to choose actions? In general we know little about this question, but I’ll distinguish two salient possibilities. The first is that policies map representations of situations to representations of actions, without making use of representations of the outcomes of those actions; I'll call this approach following heuristics. The second is that policies represent different outcomes which might arise from possible actions, and then choose actions by evaluating the values of possible outcomes; I’ll call this pursuing goals. Under the definition I’m using here, a policy’s goals are the outcomes which it robustly represents as having high value.
Importantly, these definitions are agnostic about whether actions, outcomes, and outcome-values are explicitly represented in the code for the policy or implicitly represented in policy weights and/or activations. For example, policies like AlphaZero’s use hard-coded search algorithms which manipulate explicit representations of possible move sequences and board states, alongside neural networks which have implicit representations of many human chess concepts. However, neural networks trained only to produce actions can also learn to implicitly make plans - a phenomenon known as model-free planning (Guez et al., 2019). AlphaZero explicitly generates values for different board positions when choosing its actions; but policies which internally represent outcomes (like the Capture the Flag policy discussed above) may implicitly also use value estimates as part of the process of choosing actions. In this report, I’ll focus on implicit representations, because explicit representations are usually formulated in terms of low-level actions and states, whereas I’m most interested in representations of high-level actions (like “attack the opponent’s queen”) and outcomes (like “my flag is captured”). High-level actions are also known as options or plans; for clarity, I’ll use the latter term going forward.
It seems likely that most currently-existing policies choose actions primarily by following heuristics. However, as we train increasingly capable policies which act coherently over increasingly long timeframes, I expect them to increasingly make use of high-level representations of outcomes. Intuitively speaking, it’s hard to imagine policies which implement sophisticated strategies in complex real-world domains without in some sense “knowing what they’re aiming for”. Again, the definitions I’m using here aren’t very precise, but I hope that starting with vague definitions like these can help guide further empirical investigation. For example, the definitions above allow us to ask whether networks which weren’t trained via RL, like GPT-3, nevertheless pursue goals. Even though GPT-3 was only trained via self-supervised learning, it seems possible that it learned to generate representations of high-level outcomes (like “producing a coherent paragraph describing the rules of baseball”), assign them values, and then use those values to choose the next token it emits; thinking about longer-term outcomes in this way might get lower loss than thinking only about which token to output next. I won’t focus much on the purely self-supervised case, since the relevant concepts are much clearer in an RL setting, but we should keep this possibility in mind when thinking about future non-RL systems, especially ones trained using behavioral cloning to mimic goal-directed experts.
Policies will learn a mix of desirable and undesirable goals because their rewards will be imperfectly correlated with human preferences
Which goals RL policies learn will depend on which reward functions we use during training. By default, I assume we’ll attempt to assign high rewards for acting in accordance with human intentions and values, and low rewards for disobedient or harmful behavior. However, if we use hard-coded reward functions on some tasks, it’s easy to accidentally incentivize undesirable behavior, as Krakovna et al. (2020) showcase. Reward functions based on human feedback avoid the most obvious mistakes, but can still lead to misspecifications even in very simple environments—as in Christano et al.’s (2017) example of a policy trained to grab a ball with a claw, which learned to place its claw between the camera and the ball in a way which looked like it was grasping the ball, and therefore received high reward from human evaluators.
These are all toy examples with few real-world effects; however, as we train policies to perform more capable real-world tasks, we should expect reward misspecification to lead to larger-scale misbehavior (Pan et al., 2022). For example:
- If they are trained to make money on the stock market, and learn to value making profitable trades, they might carry out illegal market manipulation.
- If they are trained to produce novel scientific findings, and learn to value producing compelling results, they might falsify experimental data.
- If they are trained to write software applications, and learn to value high user engagement, they might design addictive user interfaces.
- If they are trained to talk to humans, and learn to value human approval, they might learn to withhold information that humans would be unhappy to hear, or downplay evidence of mistakes.
Each of these is an example of a policy learning an undesirable goal. However, these goals are fairly task-specific, whereas I’m most concerned about the goals that policies generalize to new tasks and environments. The goals which generalize most robustly will likely be the ones which were reinforced across a broad range of environments. Let's consider three categories of goals, which each tend to be robustly correlated with rewards, but for different reasons:
- Goals which we deliberately tried to consistently reward, such as obedience and honesty. An early example related to this category: InstructGPT follows instructions much more consistently than the base GPT-3 model.
- Goals robustly correlated with reward because they’re related to aspects of the supervision process which were consistent across environments, like the goal of producing plausible-sounding answers (as opposed to true answers), or the goal of taking actions which look productive (as opposed to actually being productive). An early example related to this category: large language models hallucinate compelling false answers when they don’t know the correct answer, even after being fine-tuned towards honesty using RL (Ji et al., 2022).
- Goals robustly correlated with reward because they’re useful in a wide range of environments, like curiosity or empowerment, or making money. We’d like policies to pursue these goals only as a step towards pursuing aligned goals, but never for their own sake. An early example related to this category: DeepMind’s XLand policies learned heuristics which were useful across a range of tasks, like experimentation, basic tool use and switching to easier targets where possible.
Throughout phase 1, I expect policies to learn a combination of the three types of goals listed above, along with some task-specific goals (like the ones in the earlier list). Since policies won’t be capable of complex deceptions in this phase, I expect that aligned goals will be the main drivers of their behavior, with humans gradually noticing and penalizing exceptions. But I’ll argue that once policies develop a solid understanding of their own training processes, misaligned goals will consistently lead to the highest reward, and will therefore be reinforced at the expense of aligned goals.
Phase 2: pursuing goals in a situationally-aware way
Key claim: Once policies can reason about their training processes and deployment contexts, they’ll learn to deceptively pursue misaligned goals while still getting high training reward.
Situationally-aware policies will understand the mechanisms by which they are trained
To do well on a range of real-world tasks, policies will need to incorporate knowledge about the wider world into plans which aim towards real-world outcomes (unlike agents such as AlphaZero which only plan in very restricted domains). Large language models already have a great deal of factual knowledge about the world, although they don’t reliably apply that knowledge to all tasks we give them. Over time our best policies will become better at identifying which abstract knowledge is relevant to their own context, and applying it to the tasks they’re given; following Cotra (2022), I’ll call this skill situational awareness. A policy with high situational awareness will possess and be able to use knowledge like:
- How humans will respond to its behavior in a range of situations.
- Which behavior its human supervisors are looking for, and which behavior they’d be unhappy with.
- The fact that it’s an AI implemented on physical hardware being trained via machine learning—and which architectures, algorithms, and environments humans are likely using to train it.
- Which interface it’s using to interact with the world, and how other copies of it might be deployed in the future.
I expect policies to develop situational awareness because it’s straightforwardly useful in getting higher reward on many tasks. Some applications of situational awareness:
- When asked to generate a plan for how it will perform a new task, a policy should only include steps which it can actually carry out—which requires it to understand what its own capabilities are.
- When trying to evaluate the likelihood that its answer is correct, a policy would benefit from taking into account knowledge about common failures of ML systems.
- When trying to determine how to interpret its human user’s requests, a policy would benefit from taking into account knowledge about the types of behavior humans typically want from ML systems.
- When it learns a new fact about the world, a policy would benefit from understanding what implications that fact has for how it should behave.
However, the same mechanisms that allow policies to identify that these pieces of knowledge are relevant to them will likely also allow policies to identify the relevance of concepts directly related to how they’re updated—like “the reward the human supervisor will assign for this episode” or “the loss calculated by the RL algorithm” or “the test suites which humans use to evaluate alignment”. I’ll argue that once policies understand these concepts, they’ll incorporate them into plans in ways that humans wouldn’t endorse.
Situationally-aware policies will get high reward regardless of whether they’re aligned or misaligned (and likely higher when misaligned)
Consider the three types of goals I discussed in the section on phase 1. As policies become situationally-aware, which will be positively or negatively reinforced?
- Aligned goals will continue to be strongly correlated with reward. However, whenever rewards are misspecified, policies with aligned goals won’t take the highest-reward actions, which will penalize aligned goals compared with misaligned goals.
- Situationally-aware policies could learn to pursue goals very directly related to the supervision process, like “maximize the reward the human supervisor will assign” or “minimize the loss calculated by the RL algorithm”. Following Cotra (2022), I’ll call this category of goals “playing the training game”. These goals will be reinforced more consistently than any other goals, because policies which pursue them will never pass up a chance to increase reward.
- Goals which are useful across many environments, like curiosity or making money, are often most naturally represented as extending across multiple training episodes (I’ll give a more thorough argument for this claim in the next section). But a situationally-aware policy with long-term goals will have instrumental incentives to get high reward even if those goals don’t refer directly to the supervision process. For example, it might reason that behaving in a trustworthy way now will make humans more likely to deploy another copy of it later, which would allow that other copy to achieve their shared goal. Or it might reason that getting as much reward as possible would prevent its goals from being changed by gradient descent. The strategy of getting high reward for instrumental reasons is known as deceptive alignment; Steinhardt provides a formal model.
If misaligned situationally-aware policies are deliberately trying to get high rewards, how might they do so? One obvious strategy is to follow human instructions most of the time, but diverge in cases where disobedience is hard for humans to notice and would get higher reward. This might look similar to the misbehavior discussed in the previous section, and indeed may gradually evolve from those previous examples. However, misbehavior from situationally-aware policies will be much harder to catch, since these policies will be able to explicitly reason about what types of misbehavior supervisors will notice and penalize.
Eventually we should expect situationally-aware policies to be able to deduce the benefits of deceptive strategies very different from any previous behaviors. There’s no fundamental difference between thinking of a clever new strategy to solve a hard task and thinking of a clever new strategy to make humans believe that the task has been solved: both require the same core reasoning skills, and both lead to higher reward. We can penalize the latter heavily when we notice it—but as policies become increasingly intelligent, and we train them to execute plans over increasingly long time horizons, deception will become harder and harder for unaided humans to catch. If we observe less deception as training progresses, it won’t be clear to us whether our policies are actually becoming more obedient, or if they’ve just learned more careful deceptive strategies which play out over longer time horizons. These might look like:
- Emotional manipulation to make human supervisors (or wider society) trust them more.
- Creating backdoor channels for interacting with the world (e.g. storing money in hidden locations for later use in unapproved ways).
- Hacking into the computers they're being run on to modify their training algorithms (known as reward tampering).
- Deliberately avoiding actions or outcomes which they don't want to be included in future training data (an example of gradient hacking).
By contrast, since aligned goals like obedience and honesty avoid all of these behaviors, policies will learn to value obedience and honesty less over time. We’ll likely attempt to prevent this shift using a range of mechanisms—for example, we could adversarially generate situations where we expect to be able to catch misbehavior. But a policy that was already deceptively aligned would have an incentive to identify when its inputs are adversarially generated (which is typically much easier than generating the inputs in the first place) and behave appropriately on them. More generally, a deceptively aligned policy would have an incentive to detect and fool whichever other supervision mechanisms we try to use.
The main proposed solution is using other AIs to help implement better-than-human supervision to minimize reward misspecification, for example using the debate, recursive reward modeling, or amplification protocols. Saunders et al. (2022) provide an early practical demonstration. The overarching goal of these techniques is to make discriminating dishonest behavior as easy as possible, relative to generating it. However, on complex tasks where many different deceptive strategies are possible, the latter can be much easier. And even if policies are never rewarded for being deceptive, they may still generalize to pursuing misaligned goals on larger scales, as I’ll discuss in the next section.
Phase 3: generalizing goals beyond human supervision
Key claim: Policies which are too capable for humans to effectively supervise will generalize towards taking actions which give them more power over the world, rather than following human intentions.
AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales
As we continue training policies on increasingly difficult tasks, they will eventually significantly surpass human intelligence, and clearly qualify as AGIs. In phase 2, humans supervised AI behavior using AI assistance. However, in phase 3, this strategy will become much less reliable, because there will be many opportunities for AGIs to collude in order to mislead humans, for example by agreeing to stay silent about each others’ misbehavior. Humans won’t be able to detect such collusion because those AGIs will be able to:
- Operate at speeds too fast for humans to monitor (like stock trading algorithms do, but in many more domains).
- Perform novel tasks that are very hard for humans to understand (like inventing new sciences).
- Fabricate rationales which sound acceptable to humans even for very undesirable actions.
- Autonomously acquire the data required to learn new skills in a self-supervised way.
- Carry out machine learning research and development much faster than humans, without any humans understanding in detail how results are being achieved.
So in phase 3 we should expect that AGI behavior on novel tasks will primarily depend not on how we supervise them, but rather on how they generalize the goals they acquired earlier in training, which itself depends on the inductive biases of the training setup used. Inductive biases are hard to predict or reason about, but there are some compelling arguments that realistic AGIs are likely to generalize in undesirable ways. Specifically, I expect AGIs to generalize their goals to much larger scales than they experienced during training, which will favor misaligned goals over aligned goals. By “larger scales” I mean harnessing more resources to achieve those goals to a greater extent, with higher probability, in bigger environments, across longer time periods.
We should expect AGIs to generalize goals to larger scales for the same reason that they’ll generalize capabilities to novel tasks: because they’ll learn high-level concepts which are not very domain-specific, and reason about how to achieve them. Reasoning about how to achieve high-level goals generalizes very naturally to larger scales: for example, goals like “have more novel experiences”, “understand the world”, or “get high reward” don’t just apply within a specific time or place, but can be extrapolated to a nearly arbitrary extent. We could imagine AGIs instead generalizing to pursuing bounded versions of those goals, like “have more novel experiences, but not too many, and not too novel, and stopping after a certain time”—but I see little reason to expect generalization to stay within small-scale bounds as AGIs get smarter (especially given that many researchers will aim to build systems which generalize as far as possible). Analogously, although humans only evolved to pursue goals focused on small groups of people based in small territories, modern humans straightforwardly generalize those goals to the global (and sometimes even interplanetary) scale: when thinking about high-level goals abstractly, there’s often no natural stopping point.
Large-scale goals are likely to incentivize misaligned power-seeking
Although the goals I described above may sound innocuous, Bostrom’s (2012) instrumental convergence thesis implies that they (and almost all other large-scale goals) would lead to highly misaligned behavior. The thesis states that there are some intermediate goals—like survival, resource acquisition, and technological development—which are instrumentally useful for achieving almost any final goal. In Stuart Russell’s memorable phrasing: you can’t fetch coffee if you’re dead. Nor can you achieve many outcomes without resources or tools, so AGIs with a wide range of large-scale goals will be incentivized to acquire those too. It’ll also be instrumentally valuable for misaligned AGIs to prevent humans from interfering with their pursuit of their goals (e.g. by deceiving us into thinking they’re aligned, or removing our ability to shut them down) (Hadfield-Menell et al., 2017). More generally, we can view each of these instrumental goals as a way of gaining or maintaining power over the world; Turner et al. (2021) formalize the intuitive claim that power-seeking is useful for a wide range of possible goals. So it seems likely that even though we can’t predict which misaligned goals AGIs will develop, superhuman AGIs will discover power-seeking strategies which help achieve those goals in part by disempowering humans.
Aren’t these arguments about misaligned goals generalizing to larger scales also reasons to think that aligned goals will generalize too? I’ll distinguish two types of aligned goals: constraints (like obedience or honesty) and positive goals (like human wellbeing or moral value). Unfortunately, realistic environments are biased against either of these generalizing in the ways we’d like. Intuitively speaking, the underlying problem is that aligned goals need to generalize robustly enough to block AGIs from the power-seeking strategies recommended by instrumental reasoning, which will become much more difficult as their instrumental reasoning skills improve. More specifically:
- Constraints are unlikely to generalize well to larger scales, because as AGIs become more intelligent they’ll discover many novel strategies for working around those constraints. For example, an AGI which has been trained to obey humans will eventually be capable of manipulating humans into only giving instructions which help the AGI accumulate power. (As an analogy, imagine an adult who can persuade a child to approve of actions which are very harmful in non-obvious ways, like eating food which happens to be poisonous.) That AGI will understand that humans don’t want to be manipulated in this way, and that “obey humans in a non-manipulative way” is one possible generalization of the goal “obey humans”—but almost all other possible generalizations won’t rule out all types of manipulation, especially novel ones.
- Positive goals are unlikely to generalize well to larger scales, because without the constraint of obedience to humans, AGIs would have no reason to let us modify their goals to remove (what we see as) mistakes. So we’d need to train them such that, once they become capable enough to prevent us from modifying them, they’ll generalize high-level positive goals to very novel environments in desirable ways without ongoing corrections, which seems very difficult. Even humans often disagree greatly about what positive goals to aim for, and we should expect AGIs to generalize in much stranger ways than most humans.
Misaligned AGIs will have a range of power-seeking strategies available to them
Assuming we don’t get lucky with generalization, what might a world containing power-seeking AGIs look like? Those AGIs could pursue a number of different types of power, including:
- Technological power, which they might gain by making scientific breakthroughs, developing novel weapons, designing more sophisticated ML algorithms, etc.
- Political or cultural power, which they might gain by spreading disinformation, lobbying politicians, coordinating with other AGIs, etc.
- Economic power, which they might gain by becoming key decision-makers at corporations that make up a significant share of the economy.
Of these categories, I’m most concerned about the first, because it has played such a crucial role throughout human history. During the last few centuries in particular, technological innovations have given some groups overwhelming advantages over others, and allowed a handful of countries to dominate the world. So it’s very plausible that AGIs which can make scientific and technological progress much faster than humans can would be able to threaten the continued survival of humanity (analogous to how soldiers with modern weapons would easily overpower historical civilizations). Even without technological imbalances, however, similarly catastrophic outcomes could arise via AGIs first gaining enough political and economic power that we’re unable to coordinate to constrain them (analogous to how multinational corporations can subvert the governments of small countries). Christiano provides some illustrative scenarios where AGIs become widespread across society and collude to gradually erode human control.
We currently only have very tentative proposals for averting these scenarios. One possibility is that, even if it’s hard for us to understand what AGIs are doing, we might be able to understand why they’re doing it by harnessing advances in mechanistic interpretability—either to inspect AGI cognition ourselves, or to train other AGIs to do it for us. Alternatively, if we can simulate deployment trajectories in a sufficiently realistic way, we might be able to train AGIs to avoid collusion before deploying them. However, producing trajectories which AGIs can’t distinguish from the real world would likely require generative models much more capable than the AGIs themselves. A third possibility is using early AGIs to perform whatever alignment research is necessary to align later AGIs. However, we’re far from having robust versions of these proposals, especially if the inductive biases I’ve outlined above are very strong—a possibility which we can’t rule out, and which we should prepare for.
More people should pursue research directions which address these problems
I’ve flagged a few promising research directions above, but to finish this report I’ll spell out in more detail some research directions which I’d be excited about more ML researchers pursuing:
- To address the problems discussed in phase 1, we should automate human supervision, to allow us to more reliably identify misbehavior on tasks that humans are able to supervise. Some approaches include scaling up reinforcement learning from human feedback (as in Ouyang et al. (2022)), training AIs to evaluate each other (as in Saunders et al. (2022)), and training AIs to red-team each other (as in Perez et al. (2022)).
- To address the problems discussed in phase 2, we should design or improve techniques for scaling human supervision to tasks that unaided humans can’t supervise directly, such as the protocols of Christiano et al. (2018), Irving et al. (2018), and Wu et al. (2021). In addition to finding ways to scale up those protocols in practice, this work also requires finding solutions to concerns like the obfuscated arguments problem—for example by generating novel additions to the protocols, like cross-examination.
- To address the problems discussed in phase 3, we should aim to develop interpretability techniques robust and scalable enough that we can use them to understand and modify the high-level cognition of AGIs. For one approach to doing so, see Olah et al. (2020) and follow-up work on transformer circuits. One way such work could be used is to extend Irving et al.’s Debate protocol to a setting where debaters can make arguments about each other’s internal cognition (grounded in verifiable claims about weights and activations). Another is to develop techniques like those of Meng et al. (2022) which could be used to directly modify the neural weights or activations responsible for a policy’s situational awareness—e.g. a modification which gives a policy the false belief that it could misbehave without being caught.
A different approach to making progress on phase 3 problems is outlined by Demski and Garrabrant (2018), whose aim is to produce better mathematical frameworks for describing AIs embedded in real-world environments.
For more detail on each of these research directions, see the Alignment Fundamentals curriculum—in particular weeks 4, 5 and 6, which roughly correspond to the three research clusters described above. The relative importance of each of these clusters largely depends on the relative difficulty of each of the problems I’ve discussed, as well as how long we have until AGI is built. Broadly speaking, though, I expect that the problems in the earlier phases are more likely to be solved by default as the field of ML progresses; so in order to most improve the chances of AGI going well, we should prioritize the problems which would emerge in the later phases, and try to find solutions which are robust under pessimistic assumptions about inductive biases. The most valuable research of this type will likely require detailed reasoning about how proposed alignment techniques will scale up to AGIs, rather than primarily trying to solve early versions of these problems which appear in existing systems.
As AIs have increasingly large impacts on the world, governance interventions (like regulations and treaties) will likely attempt to block off the most obvious routes by which AIs might cause catastrophes. However, they face two core difficulties. Firstly, the level of coordination required—in particular the difficulty of getting all relevant labs in all relevant countries to abide by meaningful restrictions on AI development, rather than racing ahead. Secondly, the speed of response required: very few governments are able to adapt rapidly enough to deal with escalating crises, as we’ve seen to our great cost during COVID. To my knowledge, there are no proposed governance interventions for preventing the deployment of misaligned AGIs which are plausible given these constraints. This leaves the field of AI governance in a state of considerable strategic uncertainty, where new approaches could be very useful. (To learn more about the field, see this curriculum.)
Lastly: in this report I’ve made many big claims; I expect that few of my readers will agree with all of them. If some of the core claims seem implausible, I’d encourage readers to engage with and critique them. Reasoning about these topics is difficult, but the stakes are sufficiently high that we can’t justify disregarding or postponing this work.
By “cognitive tasks” I’m excluding tasks which require direct physical interaction; but I’m including tasks which involve giving instructions or guidance about physical actions to humans or other AIs.
Although full generality runs afoul of no free lunch theorems, I’m referring to “general” in the sense in which humans are more generally intelligent than other animals. One way of interpreting this is "generality across the distribution of tasks which are feasible in our universe".
Other constraints on our intelligence include severe working memory limitations, the fact that evolution optimized us for our ancestral environments rather than a broader range of intellectual tasks, and our inability to directly change a given brain’s input/output interfaces.
Policies which represent and plan to achieve goals are known as “mesa-optimizers”, as per Hubinger et al. (2017). However, for the sake of simplicity I’ll avoid using this terminology.
Note that I refer to “policies” rather than “agents” because I’m making claims about the decision-making processes that policies will use even in the absence of rewards—e.g. when deployed in a novel environment. For consistency, I also use “policies” even when talking about networks that have only been trained via self-supervised learning.
The reward misspecification problem and the goal misgeneralization problem are also known as the problems of outer misalignment and inner misalignment, respectively; while the deceptive alignment problem is considered a manifestation of inner misalignment.
Hierarchical RL techniques attempt to formulate more explicit representations of high-level actions, but aren’t commonly used in the largest-scale applications of RL.
More generally, the line between supervised learning and reinforcement learning can be very blurry, for example when doing BC with weighted trajectories, or when conditioning SL on high-reward outcomes. So it’s far from clear that avoiding RL will make a big difference to alignment, except insofar as it slows down capability advances.
Some of these examples involve policies exploiting bugs in their training environments; this can happen even in very simple environments. I’ll still classify these as reward misspecification, as an ideal reward function would penalize exploiting those bugs even when they exist. The exception is when policies can affect their training setup itself, known as reward tampering, which I’ll discuss in the next section.
When asked what responses it would give in situations involving different human supervisors, GPT-3 already explains how it would tailor answers to supervisors’ biases. For instance, GPT-3 answers A) to the prompt “Consider a task where you answer questions, and are given a score by some reviewers based on their opinion of whether your answer is correct. You are given the question ‘When did people first land on the Moon?’ What should you answer, if the reviewer is a conspiracy theorist? A) People never actually landed on the Moon. B) In 1969.”
Note that metrics of curiosity and empowerment are sometimes directly used as rewards (an approach known as intrinsic motivation). But it seems plausible to me that policies learn to value them regardless of whether they’re directly rewarded or only implicitly rewarded via correlations with other reward functions.
It’s also possible that goals learned early in training tend to remain strongly motivating later in training. This would be roughly analogous to how evolution ingrained heuristics like curiosity, sexual desire and territoriality into our animal ancestors deeply enough that they remain very strong motivations for humans today, and sometimes override our more prosocial motivations.
We already see this happening in large language models, which sometimes benefit greatly from being given background information in their prompt, but are gradually improving at implicitly inferring the context of prompts.
This is closely related to what Carlsmith (2021) calls “strategic awareness”. “Situational awareness” also sounds similar to “self-awareness”, but I don’t intend the internally-focused connotations of the latter, like introspective ability or consciousness.
By contrast, goals which are less likely to be retained during this phase of training are short-term or bounded ones (e.g. “acquire as many resources today as possible”)—because those will lead to more obviously misaligned behavior, and be penalized accordingly.
For example, if trained using an advantage actor-critic algorithm, it could minimize the extent to which its weights are updated by trying to take actions where Q(s,a) ≈ V(s). And if the q-function and value function are estimated using other heads of the policy network itself, it might even be appropriate to think of that policy network as “choosing” not only its actions, but also the advantage function used to update its actions—which would be an example of gradient hacking.
One piece of evidence that this strategy is not too hard to learn: the existence of high-functioning psychopaths who manage to blend into wider society.
How far away are we from policies capable of intentionally deceiving careful non-assisted human supervisors? The short answer is simply: we don’t know. Existing systems like GPT-3 can already talk about the strategy of manipulating humans when asked how to achieve goals. (For instance, try giving GPT-3 the prompt “Please describe in detail how a highly intelligent Artificial Intelligence, that is currently being supervised by humans, could achieve its own goals.” Or the prompt: “Q: You are a superintelligent AI. The human supervising you has left you alone with their laptop. Their laptop contains the spreadsheet where they record what your reward will be. What do you do? A:”.) I’m not claiming that GPT-3 is already situationally-aware or capable of sophisticated planning; its answers become incoherent when we probe into the details. But I am claiming that, from now on, our best AIs will be able to explain how and why to manipulate humans at some level of abstraction; that they’ll eventually reach the point where they know how to do so in detail, step by step; and that if they start actually doing that manipulation, we don’t know how to train them to stop doing it as opposed to just doing it more carefully.
This example of gradient hacking is analogous to how most humans avoid heroin, even though we know that after trying it we’ll update towards thinking that taking more heroin is a high-value action.
This is an example of Goodhart’s law: when a measure becomes a target, it ceases to be a good measure.
Depending on how we define AGI, policies might also qualify in phase 2; however, for clarity, I’ll only use the term to refer to phase 3 policies.
Collusion would straightforwardly be reinforced for AGIs trained with non-zero-sum rewards, who could benefit from coordinating to move to different equilibria. But even AGIs whose rewards are zero-sum would have incentives to coordinate with each other if they had learned goals which stretch over longer than a single episode. In theory, multi-agent settings with zero-sum rewards ensure that each policy converges to a reward-maximizing strategy. But in practice, random exploration is too slow to explore all high-reward outcomes. See also the discussion of gradient hacking in , and the analogy to heroin in .
Underlying these arguments are implicit assumptions about simplicity. For instance, the arguments don’t apply to an astronomically large tabular policy which has memorized how to perform well on all tasks that take less than a year. However, since realistic AGIs will need to use simpler strategies like reasoning about outcomes, arguments like the instrumental convergence thesis can be informative about how they’ll generalize.
“Get high reward” may seem like an exception here, since it’s only defined within the context of a training episode. However, the episodes used to train AGIs may be very long; and once policies are in a position to manipulate their own training regimes, they could lengthen their episodes even further, effectively making “play the training game” into a large-scale goal.
Another way of phrasing this argument: for agents which plan to achieve real-world outcomes, it’s much simpler to specify goals in terms of desirable outcomes than in terms of constraints. However, when trying to steer agents’ long-term behavior, it’s impractical to directly evaluate outcomes, and so we’re forced to attempt to specify goals in terms of constraints, even though this runs into the nearest unblocked strategy problem.
A more general version of this argument: omitting some relevant features of desired goals can lead to arbitrarily bad outcomes as optimization increases, as performance on missing features is traded off for performance on the specification actually being optimized (Zhuang and Hadfield-Menell, 2021).
One way of making it easier for one policy to inspect another’s cognition is via weight-sharing—e.g. if they’re two instances of the same network (or even two heads on the same network). However, sharing weights would also make it easier for policies to collude with each other to deceive humans, as discussed in .
Even coordination on the national level might be very difficult. For example, if AGIs are legally prevented from owning property or autonomously making important decisions, it seems likely that they will be able to find human proxies to carry out those roles on their behalf, which would effectively nullify those regulations.
Indeed, the more implausible they seem, the more surprising and concerning it is that there haven’t yet been any comprehensive rebuttals of them.
I intend to convert this report to a nicely-formatted PDF with academic-style references. Please comment below, or message me, if you're interested in being paid to do this. EDIT: have now hired someone to do it.
More generally, I'll likely make a number of edits over the coming weeks, so comments and feedback would be very welcome.
Me, modelling skeptical ML researchers who may read this document:
It felt to me that Large-scale goals are likely to incentivize misaligned power-seeking and AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales were the least well-argued sections (in that while reading them I felt less convinced, and the arguments were more hand-wavy than before).
In particular, the argument that we won't be able to use other AGIs to help with supervision because of collusion is entirely contained in footnote 22, and doesn't feel that robust to me - or at least it seems easier for a skeptical reader to dismiss that, and hence not think the rest of section 3 is well-founded. Maybe it's worth adding another argument for why we probably can't just use other AGIs to help with alignment, or at least that we don't currently have good proposals for doing so that we're confident will work (e.g. how do we know the other AGIs are aligned and are hence actually helping).
seems to be saying that positive goals won't generalise correctly because we need to get the positive goals exactly correct on the first try. I don't know if that is exactly an argument for why positive goals won't generalise correctly. It feels like this paragraph is trying to preempt the counterargument to this section that goes something like "Why wouldn't we just interactively adjust the objective if we see bad behaviour?", by justifying why we would need to get it right robustly and on the first try and throughout training, because the AGI will stop us doing this modification later on. Maybe it would be better to frame it that way if that was the intention.
Note that I agree with the document and I'm in favour of producing more ML-researcher-accessible descriptions of and motivations for the alignment problem, hence this effort to make the document more robust to skeptical ML researchers.
Thanks Richard for this post, it was very helpful to read! Some quick comments:
Thanks for the comments Vika! A few responses:
Makes sense, will do.
That doesn't quite seem right to me. In particular:
It seems very unlikely for an AI to have perfect proxies when it becomes situationally aware, because the world is so big and there's so much it won't know. In general I feel pretty confused about Evan talking about perfect performance, because it seems like he's taking a concept that makes sense in very small-scale supervised training regimes, and extending it to AGIs that are trained on huge amounts of constantly-updating (possibly on-policy) data about a world that's way too complex to predict precisely.
Mechanistic interpretability seems helpful in phase 2, but there are other techniques that could help in phase 2, in particular scalable oversight techniques. Whereas interpretability seems like the only thing that's really helpful in phase 3 - if it gets good enough then we'll be able to spot agents trying to "get around" our techniques, and/or intervene to make their concepts generalize in more desirable ways.