We had some discussions of the AGI ruin arguments within the DeepMind alignment team to clarify for ourselves which of these arguments we are most concerned about and what the implications are for our work. This post summarizes the opinions of a subset of the alignment team on these arguments. Disclaimer: these are our own opinions that do not represent the views of DeepMind as a whole or its broader community of safety researchers.
This doc shows opinions and comments from 8 people on the alignment team (without attribution). For each section of the list, we show a table summarizing agreement / disagreement with the arguments in that section (the tables can be found in this sheet). Each row is sorted from Agree to Disagree, so a column does not correspond to a specific person. We also provide detailed comments and clarifications on each argument from the team members.
For each argument, we include a shorthand description in a few words for ease of reference, and a summary in 1-2 sentences (usually copied from the bolded parts of the original arguments). We apologize for some inevitable misrepresentation of the original arguments in these summaries. Note that some respondents looked at the original arguments while others looked at the summaries when providing their opinions (though everyone has read the original list at some point before providing opinions).
A general problem when evaluating the arguments was that people often agreed with the argument as stated, but disagreed about the severity of its implications for AGI risk. A lot of these ended up as "mostly agree / unclear / mostly disagree" ratings. It would have been better to gather two separate scores (agreement with the statement and agreement with implications for risk).
Most controversial among the team:
Cruxes from the most controversial arguments:
Possible implications for our work:
#1. Human level is nothing special / data efficiencySummary: AGI will not be upper-bounded by human ability or human learning speed (similarly to AlphaGo). Things much smarter than human would be able to learn from less evidence than humans require.
#2. Unaligned superintelligence could easily take overSummary: A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.
#3. Can't iterate on dangerous domainsSummary: At some point there will be a 'first critical try' at operating at a 'dangerous' level of intelligence, and on this 'first critical try', we need to get alignment right.
#4. Can't cooperate to avoid AGISummary: The world can't just decide not to build AGI.
#5. Narrow AI is insufficientSummary: We can't just build a very weak system.
#6. Pivotal act is necessarySummary: We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.
#7. There are no weak pivotal acts because a pivotal act requires powerSummary: It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness.
#8. Capabilities generalize out of desired scopeSummary: The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve.
#9. A pivotal act is a dangerous regimeSummary: The builders of a safe system would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that.
#10. Large distributional shift to dangerous domainsSummary: On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.
#11. Sim to real is hardSummary: There's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world.
#12. High intelligence is a large shiftSummary: Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level.
#13. Some problems only occur above an intelligence thresholdSummary: Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.
#14. Some problems only occur in dangerous domainsSummary: Some problems seem like their natural order of appearance could be that they first appear only in fully dangerous domains.
#15. Capability gains from intelligence are correlatedSummary: Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.
#16. Inner misalignmentSummary: Outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.
#17. Can't control inner propertiesSummary: On the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.
#18. No ground truth (no comments)Summary: There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned'.
#19. Pointers problemSummary: There is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment.
#20. Flawed human feedbackSummary: Human raters make systematic errors - regular, compactly describable, predictable errors.
#21. Capabilities go furtherSummary: Capabilities generalize further than alignment once capabilities start to generalize far.
#22. No simple alignment coreSummary: There is a simple core of general intelligence but there is no analogous simple core of alignment.
#23. Corrigibility is anti-natural.Summary: Corrigibility is anti-natural to consequentialist reasoning.
#24. Sovereign vs corrigibilitySummary: There are two fundamentally different approaches you can potentially take to alignment [a sovereign optimizing CEV or a corrigible agent], which are unsolvable for two different sets of reasons. Therefore by ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
#25. Real interpretability is out of reachSummary: We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.
#26. Interpretability is insufficientSummary: Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system that isn't planning to kill us.
#27. Selecting for undetectabilitySummary: Optimizing against an interpreted thought optimizes against interpretability.
#28. Large option space (no comments)Summary: A powerful AI searches parts of the option space we don't, and we can't foresee all its options.
#29. Real world is an opaque domainSummary: AGI outputs go through a huge opaque domain before they have their real consequences, so we cannot evaluate consequences based on outputs.
#30. Powerful vs understandableSummary: No humanly checkable output is powerful enough to save the world.
#31. Hidden deceptionSummary: You can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about.
#32. Language is insufficient or unsafeSummary: Imitating human text can only be powerful enough if it spawns an inner non-imitative intelligence.
#33. Alien conceptsSummary: The AI does not think like you do, it is utterly alien on a staggering scale.
#34. Multipolar collusionSummary: Humans cannot participate in coordination schemes between superintelligences.
#35. Multi-agent is single-agentSummary: Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other.
#36. Human flaws make containment difficult (no comments)Summary: Only relatively weak AGIs can be contained; the human operators are not secure systems.
#37. Optimism until failureSummary: People have a default assumption of optimism in the face of uncertainty, until encountering hard evidence of difficulty.
#38. Lack of focus on real safety problemsSummary: AI safety field is not being productive on the lethal problems. The incentives are for working on things where success is easier.
#39. Can't train people in security mindsetSummary: This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others.
#40. Can't just hire geniuses to solve alignmentSummary: You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them.
#41. You have to be able to write this listSummary: Reading this document cannot make somebody a core alignment researcher, you have to be able to write it.
#42. There's no planSummary: Surviving worlds probably have a plan for how to survive by this point.
#43. Unawareness of the risksSummary: Not enough people have noticed or understood the risks.
This was interesting and I would like to see more AI research organizations conducting + publishing similar surveys.
Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments).
My viewpoint is that the most dangerous risks rely on inner alignment issues, and that is basically because of very bad transparency tools, instrumental convergence issues toward power and deception, and mesa-optimizers essentially ruining what outer alignment you have. If you could figure out a reliable way to detect or make sure that deceptive models could never be reached in your training process, that would relieve a lot of my fears of X-risk from AI.
I actually think Eliezer is underrating civilizational competence once AGI is released via the MNM effect, as it happened for Covid, unfortunately this only buys time before the end. A superhuman intelligence that is deceiving human civilization based on instrumental convergence essentially will win barring pivotal acts, as Eliezer says. The goal of AI safety is to make alignment not dependent on heroic, pivotal actions.
So Andrew Critich's hopes of not needing pivotal acts only works if significant portions of the alignment problem are solved or at least ameliorated, which we are not super close to doing, So whether alignment will require pivotal acts is directly dependent on solving the alignment problem more generally.
Pivotal acts are a worse solution to alignment and shouldn't be thought of as the default solution, but it is a back-pocket solution we shouldn't forget about.
If I had to rate a crux between Eliezer Yudkowsky's/Rob Bensinger's/Nate Soares's/MIRI's views and Deepmind's Safety Team views or Andrew Critich's view, it's whether the Alignment problem is foresight-loaded (And thus civilization will be incompetent as well as safety requiring more pivotal acts) or empirically-loaded, where we don't need to see the bullets in advance (And thus civilization could be more competent and pivotal acts matter less). It's an interesting crux to be sure.
PS: Does Deepmind's Safety Team have real power to disapprove AI projects? Or are they like the Google Ethics team, where they had no power to disapprove AI projects without being fired.
We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects
So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?
Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.
Request: could you make a version of this (e.g. with all of your responses stripped) that I/anyone can make a copy of?
Here is a spreadsheet you can copy. This one has a column for each person - if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it's possible to automate this but I was too lazy.
Great analysis! I’m curious about the disagreement with needing a pivotal act. Is this disagreement more epistemic or normative? That is to say do you think they assign a very low probability of needing a pivotal act to prevent misaligned AGI? Or do they have concerns about the potential consequences of this mentality? (people competing with each other to create powerful AGI, accidentally creating a misaligned AGI as a result, public opinion, etc.)
I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic.