Clarifying some key hypotheses in AI alignment

Ben Cottier; Rohin Shah

We've created a diagram mapping out important and controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements.

Diagram

A part of the diagram. Click through to see the full version.

Caveats

This does not decompose arguments exhaustively. It does not include every reason to favour or disfavour ideas. Rather, it is a set of key hypotheses and relationships with other hypotheses, problems, solutions, models, etc. Some examples of important but apparently uncontroversial premises within the AI safety community: orthogonality, complexity of value, Goodhart's Curse, AI being deployed in a catastrophe-sensitive context.
This is not a comprehensive collection of key hypotheses across the whole space of AI alignment. It focuses on a subspace that we find interesting and is relevant to more recent discussions we have encountered, but where key hypotheses seem relatively less illuminated. This includes rational agency and goal-directedness, CAIS, corrigibility, and the rationale of foundational and practical research. In hindsight, the selection criteria was something like:
1. The idea is closely connected to the problem of artificial systems optimizing adversarially against humans.
2. The idea must be explained sufficiently well that we believe it is plausible.
Arrows in the diagram indicate flows of evidence or soft relations, not absolute logical implications — please read the "interpretation" box in the diagram. Also pay attention to any reasoning written next to a Yes/No/Defer arrow — you may disagree with it, so don't blindly follow the arrow!

Background

Much has been written in the way of arguments for AI risk. Recently there have been some talks and posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations.

One aspect of the discourse that seems to be lacking clarification and analysis is the reasons to favour one argument over another — in particular, the key hypotheses or cruxes that underlie the different arguments. Understanding this better will make discourse more productive and help people reason about their beliefs.

This work aims to collate and clarify hypotheses that seem key to AI alignment in particular (by "alignment" we mean the problem of getting an AI system to reliably do what an overseer intends, or try to do so, depending on which part of the diagram you are in). We point to which hypotheses, arguments, approaches, and scenarios are favoured and disfavoured by each other. It is neither comprehensive nor sufficiently nuanced to capture everyone's views, but we expect it to reduce confusion and encourage further analysis.

You can digest this post through the diagram or the supplementary information, which have their respective strengths and limitations. However, we recommend starting with the diagram, then if you are interested in related reading or our comments about a particular hypothesis, you can click the link on the box title in the diagram, or look it up below.

Supplementary information

The sections here list the hypotheses in the diagram, along with related readings and our more opinion-based comments, for lack of software to neatly embed this information (however, boxes in the diagram do link back to the headings here). Note that the diagram is the best way to understand relationships and high-level meaning, while this offers more depth and resources for each hypothesis. Phrases in italics with the first letter capitalised are referring to a box in the diagram.

Definitions

AGI: a system (not necessarily agentive) that, for almost all economically relevant cognitive tasks, at least matches any human's ability at the task. Here, "agentive AGI" is essentially what people in the AI safety community usually mean when they say AGI. References to before and after AGI are to be interpreted as fuzzy, since this definition is fuzzy.
CAIS: comprehensive AI services. See Reframing Superintelligence.
Goal-directed: describes a type of behaviour, currently not formalised, but characterised by generalisation to novel circumstances and the acquisition of power and resources. See Intuitions about goal-directed behaviour.

Agentive AGI?

Will the first AGI be most effectively modelled like a unitary, unbounded, goal-directed agent?

Related reading: Reframing Superintelligence, Comments on CAIS, Summary and opinions on CAIS, embedded agency sequence, Intuitions about goal-directed behaviour
Comment: This is consistent with some of classical AI theory, and agency continues to be a relevant concept in capability-focused research, e.g. reinforcement learning. However, it has been argued that the way AI systems are taking shape today, and the way humans historically do engineering, are cause to believe superintelligent capabilities will be achieved by different means. Some grant that a CAIS-like scenario is probable, but maintain that there will still be Incentive for agentive AGI. Others argue that the current understanding of agency is problematic (perhaps just for being vague, or specifically in relation to embeddedness), so we should defer on this hypothesis until we better understand what we are talking about. It appears that this is a strong crux for the problem of Incorrigible goal-directed superintelligence and the general aim of (Near) proof-level assurance of alignment, versus other approaches that reject alignment being such a hard, one-false-move kind of problem. However, to advance this debate it does seem important to clarify notions of goal-directedness and agency.

Incentive for agentive AGI?

Are there features of systems built like unitary goal-directed agents that offer a worthwhile advantage over other broadly superintelligent systems?

Related reading: Reframing Superintelligence, Comments on CAIS, Summary and opinions on CAIS, Will humans build goal-directed agents?, AGI will drastically increase economies of scale
Comment: Some basic points argued in favour are that agentive AGI is significantly more efficient, or humans find agents easier to think about, or humans just want to build human-like agents for its own sake. However, even if agentive AGI offers greater efficiency, one could argue it is too risky or difficult to build, so we are better off settling for something like CAIS.

Modularity over integration?

In general and holding resources constant, is a collection of modular AI systems with distinct interfaces more competent than a single integrated AI system?

Related reading: Reframing Superintelligence Ch. 12, 13, AGI will drastically increase economies of scale
Comment: an almost equivalent trade-off here is generality vs. specialisation. Modular systems would benefit from specialisation, but likely bear greater cost in principal-agent problems and sharing information (see this comment thread). One case that might be relevant to think about is human roles in the economy — although humans have a general learning capacity, they have tended towards specialising their competencies as part of the economy, with almost no one being truly self-sufficient. However, this may be explained merely by limited brain size. The recent success of end-to-end learning systems has been argued in favour of integration, as has the evolutionary precedent of humans (since human minds appear to be more integrated than modular).

Current AI R&D extrapolates to AI services?

AI systems so far generally lack some key qualities that are traditionally supposed of AGI, namely: pursuing cross-domain long-term goals, having broad capabilities, and being persistent and unitary. Does this lacking extrapolate, with increasing automation of AI R&D and the rise of a broad collection of superintelligent services?

Related reading: Reframing Superintelligence Ch. I

Incidental agentive AGI?

Will systems built like unitary goal-directed agents develop incidentally from something humans or other AI systems build?

Related reading: Subsystem Alignment, Risks from Learned Optimization, Let's talk about "Convergent Rationality"

Convergent rationality?

Given sufficient capacity, does an AI system converge on rational agency and consequentialism to achieve its objective?

Related reading: Let's talk about "Convergent Rationality"
Comment: As far as we know, "convergent rationality" has only been named recently by David Krueger, and while it is not well fleshed out yet, it seems to point at an important and commonly-held assumption. There is some confusion about whether the convergence could be a theoretical property, or is merely a matter of human framing, or merely a matter of Incentive for agentive AGI.

Mesa-optimisation?

Will there be optimisation processes that, in turn, develop considerably powerful optimisers to achieve their objective? A historical example is natural selection optimising for reproductive fitness to make humans. Humans may have good reproductive fitness, but optimise for other things such as pleasure even when this diverges from fitness.

Related reading: Subsystem alignment, Risks from Learned Optimization

Discontinuity to AGI?

Will there be discontinuous, explosive growth in AI capabilities to reach the first agentive AGI? A discontinuity reduces the opportunity to correct course. Before AGI it seems most likely to result from a qualitative change in learning curve, due to an algorithmic insight, architectural change or scale-up in resource utilisation.

Related reading: Intelligence Explosion Microeconomics, The Hanson-Yudkowsky AI-Foom Debate, A Contra AI FOOM Reading List, Any rebuttals of Christiano and AI Impacts on takeoff speeds?, A shift in arguments for AI risk
Comment: Discontinuity or fast takeoff is a central assumption of early arguments for AI risk and seems to have the greatest quantity of debate. A large proportion of the community supports it in some form, but this proportion has apparently decreased significantly in the last few years (beliefs changing or new people, it's unclear), with Paul Christiano's and Katja Grace's writing being a key influence. Note that we distinguish to AGI and from AGI because of strategic and developmental considerations around human-level. In published works the distinction has not been very clear — we would like to see more discussion about it. Thanks to Ben Garfinkel for pointing out how the distinction can be important.

Recursive self improvement?

Is an AI system that improves through its own AI R&D and self-modification capabilities more likely than distributed AI R&D automation? Recursive improvement would give some form of explosive growth, and so could result in unprecedented gains in intelligence.

Related reading: Intelligence Explosion Microeconomics, Reframing Superintelligence Ch. 1

Discontinuity from AGI?

Will there be discontinuous, explosive growth in AI capabilities after agentive AGI? A discontinuity reduces the opportunity to correct course. After AGI it seems most likely to result from a recursive improvement capability.

Related reading: see Discontinuity to AGI
Comment: see Discontinuity to AGI

ML scales to AGI?

Do contemporary machine learning techniques scale to general human level (and beyond)? The state-of-the-art experimental research aiming towards AGI is characterised by a set of theoretical assumptions, such as reinforcement learning and probabilistic inference. Does this paradigm readily scale to general human-level capabilities without fundamental changes in the assumptions or methods?

Related reading: Prosaic AI alignment, A possible stance for alignment research, Conceptual issues in AI safety: the paradigmatic gap, Discussion on the machine learning approach to AI safety
Comment: One might wonder how much change in assumptions or methods constitutes a paradigm shift, but the more important question is how relevant current ML safety work can be to the most high-stakes problems, and that seems to depend strongly on this hypothesis. Proponents of the ML safety approach admit that much of the work could turn out to be irrelevant, especially with a paradigm shift, but argue that there is nonetheless a worthwhile chance. ML is a fairly broad field, so people taking this approach should think more specifically about what aspects are relevant and scalable. If one proposes to build safe AGI by scaling up contemporary ML techniques, clearly they should believe the hypothesis — but there is also a feedback loop: the more feasible approaches one comes up with, the more evidence there is for the hypothesis. You may opt for Foundational or "deconfusion" research if (1) you don't feel confident enough about this to commit to working on ML, or (2) you think that, whether or not ML scales in terms of capability, we need deep insights about intelligence to get a satisfactory solution to alignment. This implies Alignment is much harder than, or does not overlap much with, capability gain.

Deep insights needed?

Do we need a much deeper understanding of intelligence to build an aligned AI?

Related reading: The Rocket Alignment Problem

Broad basin for corrigibility?

Do corrigible AI systems have a broad basin of attraction to intent alignment? Corrigible AI tries to help an overseer. It acts to improve its model of the overseer's preferences, and is incentivised to make sure any subsystems it creates are aligned — perhaps even more so than itself. In this way, perturbations or errors in alignment tend to be corrected, and it takes a large perturbation to move out of this "basin" of corrigibility.

Related reading: Corrigibility, discussion on the need for a grounded definition of preferences (comment thread), Two Neglected Problems in Human-AI Safety (problem 1 poses a challenge for corrigibility)
Comment: this definition of corrigibility is still vague, and although it can be explained to work in a desirable way, it is not clear how practically feasible it is. It seems that proponents of corrigible AI accept that greater theoretical understanding and clarification is needed: how much is a key source of disagreement. On a practical extreme, one would iterate experiments with tight feedback loops to figure it out, and correct errors on the go. This assumes ample opportunity for trial and error, rejecting Discontinuity to/from AGI. On a theoretical extreme, some argue that one would need to develop a new mathematical theory of preferences to be confident enough that this approach will work, or such a theory would provide the necessary insights to make it work at all. If you find this hypothesis weak, you probably put more weight on threat models based on Goodhart's Curse, e.g. Incorrigible goal-directed superintelligence, and the general aim of (Near) proof-level assurance of alignment.

Inconspicuous failure?

Will a concrete, catastrophic AI failure be overwhelmingly hard to recognise or anticipate? For certain kinds of advanced AI systems (namely the goal-directed type), it seems that short of near proof-level assurances, all safeguards are thwarted by the nearest unblocked strategy. Such AI may also be incentivised for deception and manipulation towards a treacherous turn. Or, in a machine learning framing, it would be very difficult to make such AI robust to distributional shift.

Related reading: Importance of new mathematical foundations to avoid inconspicuous failure (comment thread)
Comment: This seems to be a key part of many people's models for AI risk, which we associate most with MIRI. We think it significantly depends on whether there is Agentive AGI, and it supports the general aim of (Near) proof-level assurance of alignment. If we can get away from that kind of AI, it is more likely that we can relax our approach and Use feedback loops to correct course as we go.

Creeping failure?

Would gradual gains in the influence of AI allow small problems to accumulate to catastrophe? The gradual aspect affords opportunity to recognise failures and think about solutions. Yet for any given incremental change in the use of AI, the economic incentives could outweigh the problems, such that we become more entangled in, and reliant on, a complex system that can collapse suddenly or drift from our values.

Related reading: What failure looks like, A shift in arguments for AI risk > The alignment problem without a discontinuity > Questions about this argument

Thanks to Stuart Armstrong, Wei Dai, Daniel Dewey, Eric Drexler, Scott Emmons, Ben Garfinkel, Richard Ngo and Cody Wild for helpful feedback on drafts of this work. Ben especially thanks Rohin for his generous feedback and assistance throughout its development.

[-]Ofer7y50

Meta: I think there's an attempt to deprecate the term "inner optimizer" in favor of "mesa-optimizer" (which I think makes sense when the discussion is not restricted to a subsystem within an optimized system).

[-]Ben Cottier7y30

Noted and updated.

[-]David Scott Krueger7y40

Nice chart!

A few questions and comments:

Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??
Suggestion: make the blue boxes without parents more apparent? e.g. a different shade of blue? Or all sitting above the other ones? (e.g. "broad basin of corrigibility" could be moved up and left).

31