We've created a diagram mapping out important and controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements.
A part of the diagram. Click through to see the full version.
Much has been written in the way of arguments for AI risk. Recently there have been some talks and posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations.
One aspect of the discourse that seems to be lacking clarification and analysis is the reasons to favour one argument over another — in particular, the key hypotheses or cruxes that underlie the different arguments. Understanding this better will make discourse more productive and help people reason about their beliefs.
This work aims to collate and clarify hypotheses that seem key to AI alignment in particular (by "alignment" we mean the problem of getting an AI system to reliably do what an overseer intends, or try to do so, depending on which part of the diagram you are in). We point to which hypotheses, arguments, approaches, and scenarios are favoured and disfavoured by each other. It is neither comprehensive nor sufficiently nuanced to capture everyone's views, but we expect it to reduce confusion and encourage further analysis.
You can digest this post through the diagram or the supplementary information, which have their respective strengths and limitations. However, we recommend starting with the diagram, then if you are interested in related reading or our comments about a particular hypothesis, you can click the link on the box title in the diagram, or look it up below.
The sections here list the hypotheses in the diagram, along with related readings and our more opinion-based comments, for lack of software to neatly embed this information (however, boxes in the diagram do link back to the headings here). Note that the diagram is the best way to understand relationships and high-level meaning, while this offers more depth and resources for each hypothesis. Phrases in italics with the first letter capitalised are referring to a box in the diagram.
Will the first AGI be most effectively modelled like a unitary, unbounded, goal-directed agent?
Are there features of systems built like unitary goal-directed agents that offer a worthwhile advantage over other broadly superintelligent systems?
In general and holding resources constant, is a collection of modular AI systems with distinct interfaces more competent than a single integrated AI system?
AI systems so far generally lack some key qualities that are traditionally supposed of AGI, namely: pursuing cross-domain long-term goals, having broad capabilities, and being persistent and unitary. Does this lacking extrapolate, with increasing automation of AI R&D and the rise of a broad collection of superintelligent services?
Will systems built like unitary goal-directed agents develop incidentally from something humans or other AI systems build?
Given sufficient capacity, does an AI system converge on rational agency and consequentialism to achieve its objective?
Will there be optimisation processes that, in turn, develop considerably powerful optimisers to achieve their objective? A historical example is natural selection optimising for reproductive fitness to make humans. Humans may have good reproductive fitness, but optimise for other things such as pleasure even when this diverges from fitness.
Will there be discontinuous, explosive growth in AI capabilities to reach the first agentive AGI? A discontinuity reduces the opportunity to correct course. Before AGI it seems most likely to result from a qualitative change in learning curve, due to an algorithmic insight, architectural change or scale-up in resource utilisation.
Is an AI system that improves through its own AI R&D and self-modification capabilities more likely than distributed AI R&D automation? Recursive improvement would give some form of explosive growth, and so could result in unprecedented gains in intelligence.
Will there be discontinuous, explosive growth in AI capabilities after agentive AGI? A discontinuity reduces the opportunity to correct course. After AGI it seems most likely to result from a recursive improvement capability.
Do contemporary machine learning techniques scale to general human level (and beyond)? The state-of-the-art experimental research aiming towards AGI is characterised by a set of theoretical assumptions, such as reinforcement learning and probabilistic inference. Does this paradigm readily scale to general human-level capabilities without fundamental changes in the assumptions or methods?
Do we need a much deeper understanding of intelligence to build an aligned AI?
Do corrigible AI systems have a broad basin of attraction to intent alignment? Corrigible AI tries to help an overseer. It acts to improve its model of the overseer's preferences, and is incentivised to make sure any subsystems it creates are aligned — perhaps even more so than itself. In this way, perturbations or errors in alignment tend to be corrected, and it takes a large perturbation to move out of this "basin" of corrigibility.
Will a concrete, catastrophic AI failure be overwhelmingly hard to recognise or anticipate? For certain kinds of advanced AI systems (namely the goal-directed type), it seems that short of near proof-level assurances, all safeguards are thwarted by the nearest unblocked strategy. Such AI may also be incentivised for deception and manipulation towards a treacherous turn. Or, in a machine learning framing, it would be very difficult to make such AI robust to distributional shift.
Would gradual gains in the influence of AI allow small problems to accumulate to catastrophe? The gradual aspect affords opportunity to recognise failures and think about solutions. Yet for any given incremental change in the use of AI, the economic incentives could outweigh the problems, such that we become more entangled in, and reliant on, a complex system that can collapse suddenly or drift from our values.
Thanks to Stuart Armstrong, Wei Dai, Daniel Dewey, Eric Drexler, Scott Emmons, Ben Garfinkel, Richard Ngo and Cody Wild for helpful feedback on drafts of this work. Ben especially thanks Rohin for his generous feedback and assistance throughout its development.
Meta: I think there's an attempt to deprecate the term "inner optimizer" in favor of "mesa-optimizer" (which I think makes sense when the discussion is not restricted to a subsystem within an optimized system).
Noted and updated.
A few questions and comments:
What software did you use to produce this diagram?