Disentangling arguments for the importance of AI safety

Richard_Ngo

Note: my views have shifted significantly since writing this post. I now consider items 1, 2, 3, and 6.2 to be different facets of one core argument, which I call the "second species" argument, and which I explore in depth in this report. And I don't really think of 4 as an AI safety problem any more.

I recently attended the 2019 Beneficial AGI conference organised by the Future of Life Institute. I’ll publish a more complete write-up later, but I was particularly struck by how varied attendees' reasons for considering AI safety important were. Before this, I’d observed a few different lines of thought, but interpreted them as different facets of the same idea. Now, though, I’ve identified at least 6 distinct serious arguments for why AI safety is a priority. By distinct I mean that you can believe any one of them without believing any of the others - although of course the particular categorisation I use is rather subjective, and there’s a significant amount of overlap. In this post I give a brief overview of my own interpretation of each argument (note that I don’t necessarily endorse them myself). They are listed roughly from most specific and actionable to most general. I finish with some thoughts on what to make of this unexpected proliferation of arguments. Primarily, I think it increases the importance of clarifying and debating the core ideas in AI safety.

Maximisers are dangerous. Superintelligent AGI will behave as if it’s maximising the expectation of some utility function, since doing otherwise can be shown to be irrational. Yet we can’t write down a utility function which precisely describes human values, and optimising very hard for any other function will lead to that AI rapidly seizing control (as a convergent instrumental subgoal) and building a future which contains very little of what we value (because of Goodhart’s law and the complexity and fragility of values). We won’t have a chance to notice and correct misalignment because an AI which has exceeded human level will increase its intelligence very quickly (either by recursive self-improvement or by scaling up its hardware), and then prevent us from modifying it or shutting it down.

This was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. Here I’ve tried to convey the original line of argument, although some parts of it have been strongly critiqued since then. In particular, Drexler and Shah have disputed the relevance of expected utility maximisation (the latter suggesting the concept of goal-directedness as a replacement), while Hanson and Christiano disagree that AI intelligence will increase in a very fast and discontinuous way.
Most of the arguments in this post originate from or build on this one in some way. This is particularly true of the next two arguments - nevertheless, I think that there’s enough of a shift in focus in each to warrant separate listings.

The target loading problem. Even if we knew exactly what we wanted a superintelligent agent to do, we don’t currently know (even in theory) how to make an agent which actually tries to do that. In other words, if we were to create a superintelligent AGI before solving this problem, the goals we would ascribe to that AGI (by taking the intentional stance towards it) would not be the ones we had intended to give it. As a motivating example, evolution selected humans for their genetic fitness, yet humans have goals which are very different from just spreading their genes. In a machine learning context, while we can specify a finite number of data points and their rewards, neural networks may then extrapolate from these rewards in non-humanlike ways.

This is a more general version of the “inner optimiser problem”, and I think it captures the main thrust of the latter while avoiding the difficulties of defining what actually counts as an “optimiser”. I’m grateful to Nate Soares for explaining the distinction, and arguing for the importance of this problem.

The prosaic alignment problem. It is plausible that we build “prosaic AGI”, which replicates human behaviour without requiring breakthroughs in our understanding of intelligence. Shortly after they reach human level (or possibly even before), such AIs will become the world’s dominant economic actors. They will quickly come to control the most important corporations, earn most of the money, and wield enough political influence that we will be unable to coordinate to place limits on their use. Due to economic pressures, corporations or nations who slow down AI development and deployment in order to focus on aligning their AI more closely with their values will be outcompeted. As AIs exceed human-level intelligence, their decisions will become too complex for humans to understand or provide feedback on (unless we develop new techniques for doing so), and eventually we will no longer be able to correct the divergences between their values and ours. Thus the majority of the resources in the far future will be controlled by AIs which don’t prioritise human values. This argument was explained in this blog post by Paul Christiano.

More generally, aligning multiple agents with multiple humans is much harder than aligning one agent with one human, because value differences might lead to competition and conflict even between agents that are each fully aligned with some humans. (As my own speculation, it’s also possible that having multiple agents would increase the difficulty of single-agent alignment - e.g. the question “what would humans want if I didn’t manipulate them” would no longer track our values if we would counterfactually be manipulated by a different agent).

The human safety problem. This line of argument (which Wei Dai has recently highlighted) claims that no human is “safe” in the sense that giving them absolute power would produce good futures for humanity in the long term, and therefore that building AI which extrapolates and implements the values of even a very altruistic human is insufficient. A prosaic version of this argument emphasises the corrupting effect of power, and the fact that morality is deeply intertwined with social signalling - however, I think there’s a stronger and more subtle version. In everyday life it makes sense to model humans as mostly rational agents pursuing their goals and values. However, this abstraction breaks down badly in more extreme cases (e.g. addictive superstimuli, unusual moral predicaments), implying that human values are somewhat incoherent. One such extreme case is running my brain for a billion years, after which it seems very likely that my values will have shifted or distorted radically, in a way that my original self wouldn’t endorse. Yet if we want a good future, this is the process which we require to go well: a human (or a succession of humans) needs to maintain broadly acceptable and coherent values for astronomically long time periods.

An obvious response is that we shouldn’t entrust the future to one human, but rather to some group of humans following a set of decision-making procedures. However, I don’t think any currently-known institution is actually much safer than individuals over the sort of timeframes we’re talking about. Presumably a committee of several individuals would have lower variance than just one, but as that committee grows you start running into well-known problems with democracy. And while democracy isn’t a bad system, it seems unlikely to be robust on the timeframe of millennia or longer. (Alex Zhu has made the interesting argument that the problem of an individual maintaining coherent values is roughly isomorphic to the problem of a civilisation doing so, since both are complex systems composed of individual “modules” which often want different things.)
While AGI amplifies the human safety problem, it may also help solve it if we can use it to decrease the value drift that would otherwise occur. Also, while it’s possible that we need to solve this problem in conjunction with other AI safety problems, it might be postponable until after we’ve achieved civilisational stability.
Note that I use “broadly acceptable values” rather than “our own values”, because it’s very unclear to me which types or extent of value evolution we should be okay with. Nevertheless, there are some values which we definitely find unacceptable (e.g. having a very narrow moral circle, or wanting your enemies to suffer as much as possible) and I’m not confident that we’ll avoid drifting into them by default.

Misuse and vulnerabilities. These might be catastrophic even if AGI always carries out our intentions to the best of its ability:

AI which is superhuman at science and engineering R&D will be able to invent very destructive weapons much faster than humans can. Humans may well be irrational or malicious enough to use such weapons even when doing so would lead to our extinction, especially if they’re invented before we improve our global coordination mechanisms. It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness. For more on the dangers from technological progress in general, see Bostrom’s paper on the vulnerable world hypothesis.
AI could be used to disrupt political structures, for example via unprecedentedly effective psychological manipulation. In an extreme case, it could be used to establish very stable totalitarianism, with automated surveillance and enforcement mechanisms ensuring an unshakeable monopoly on power for leaders.
AI could be used for large-scale projects (e.g. climate engineering to prevent global warming, or managing the colonisation of the galaxy) without sufficient oversight or verification of robustness. Software or hardware bugs might then induce the AI to make unintentional yet catastrophic mistakes.
People could use AIs to hack critical infrastructure (include the other AIs which manage aforementioned large-scale projects). In addition to exploiting standard security vulnerabilities, hackers might induce mistakes using adversarial examples or ‘data poisoning’.

Argument from large impacts. Even if we’re very uncertain about what AGI development and deployment will look like, it seems likely that AGI will have a very large impact on the world in general, and that further investigation into how to direct that impact could prove very valuable.

Weak version: development of AGI will be at least as big an economic jump as the industrial revolution, and therefore affect the trajectory of the long-term future. See Ben Garfinkel’s talk at EA Global London 2018. Ben noted that to consider work on AI safety important, we also need to believe the additional claim that there are feasible ways to positively influence the long-term effects of AI development - something which may not have been true for the industrial revolution. (Personally my guess is that since AI development will happen more quickly than the industrial revolution, power will be more concentrated during the transition period, and so influencing its long-term effects will be more tractable.)
Strong version: development of AGI will make humans the second most intelligent species on the planet. Given that it was our intelligence which allowed us to control the world to the large extent that we do, we should expect that entities which are much more intelligent than us will end up controlling our future, unless there are reliable and feasible ways to prevent it. So far we have not discovered any.

What should we think about the fact that there are so many arguments for the same conclusion? As a general rule, the more arguments support a statement, the more likely it is to be true. However, I’m inclined to believe that quality matters much more than quantity - it’s easy to make up weak arguments, but you only need one strong one to outweigh all of them. And this proliferation of arguments is (weak) evidence against their quality: if the conclusions of a field remain the same but the reasons given for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered socially important). This problem is exacerbated by a lack of clarity about which assumptions and conclusions are shared between arguments, and which aren’t.

On the other hand, superintelligent AGI is a very complicated topic, and so perhaps it’s natural that there are many different lines of thought. One way to put this in perspective (which I credit to Beth Barnes) is to think about the arguments which might have been given for worrying about nuclear weapons, before they had been developed. Off the top of my head, there are at least four:

They might be used deliberately.
They might be set off accidentally.
They might cause a nuclear chain reaction much larger than anticipated.
They might destabilise politics, either domestically or internationally.

And there are probably more which would have been credible at the time, but which seem silly now due to hindsight bias. So if there’d been an active anti-nuclear movement in the 30’s or early 40’s, the motivations of its members might well have been as disparate as those of AI safety advocates today. Yet the overall concern would have been (and still is) totally valid and reasonable.

I think the main takeaway from this post is that the AI safety community as a whole is still confused about the very problem we are facing. The only way to dissolve this tangle is to have more communication and clarification of the fundamental ideas in AI safety, particularly in the form of writing which is made widely available. And while it would be great to have AI safety researchers explaining their perspectives more often, I think there is still a lot of explicatory work which can be done regardless of technical background. In addition to analysis of the arguments discussed in this post, I think it would be particularly useful to see more descriptions of deployment scenarios and corresponding threat models. It would also be valuable for research agendas to highlight which problem they are addressing, and the assumptions they require to succeed.

This post has benefited greatly from feedback from Rohin Shah, Alex Zhu, Beth Barnes, Adam Marblestone, Toby Ord, and the DeepMind safety team. All opinions are my own.

Here's another argument that I've been pushing since the early days (apparently not very successfully since it didn't make it to this list :) which might be called "argument from philosophical difficulty". It appears that achieving a good long term future requires getting a lot of philosophical questions right that are hard for us to answer. Given this, initially I thought there are only three ways for AI to go right in this regard (assuming everything else goes well with the AI):

We solve all the important philosophical problems ahead of time and program the solutions into the AI.
We solve metaphilosophy (i.e., understand philosophical reasoning as well as we understand mathematical reasoning) and program that into the AI so it can solve philosophical problems on its own.
We program the AI to learn philosophical reasoning from humans or use human simulations to solve philosophical problems.

Since then people have come up with a couple more scenarios (which did make me slightly more optimistic about this problem):

We all coordinate to stop technological progress some time after AI but before space colonization, and have a period of long reflection where humans, maybe with help from AIs, spend thousands or millions of years to solve philosophical problems.
We program AIs to be corrigible to their users, some users care about getting philosophy correct so the AIs help keep them safe and get their "fair share" of the universe until philosophical problems are solved eventually, enough users care about this so that we end up with a mostly good future, and lack of philosophical knowledge doesn't cause disaster in the meantime. (My writings on "human safety problems" were in part a response to this suggestion, outlining how hard it would be to keep humans "safe" in this scenario.)

The overall argument is that, given human safety problems, realistic competitive pressures, difficulties with coordination, etc., it seems hard to end up in any of these scenarios and not have something go wrong along the way. Maybe another way to put this is, given philosophical difficulties, the target we'd have to hit with AI is even smaller than it might otherwise appear.

This [Maximisers are dangerous] was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. [...] And this proliferation of arguments is evidence against their quality: if your conclusions remain the same but your reasons for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered important in your social group).

I think many of the other arguments did appear in early discussions of AI safety, but perhaps later didn't get written up clearly or get emphasized as much as "maximisers are dangerous". I'd cite CEV as an AI safety idea that clearly took "human safety problems" strongly into consideration, and even before that, Yudkowsky wrote about the SysOp Scenario which would essentially replace physics with a different set of rules that would (in part) eliminate the potential vulnerabilities of actual physics. The early focus on creating a Singleton wasn't just due to thinking that local intelligence explosion is highly likely but also because for reasons like ones in "prosaic alignment problem", people (including me) thought a competitive multi-polar scenario might lead to unavoidably bad outcomes.

So I don't think "your conclusions remain the same but your reasons for holding those conclusions change" is fair if it was meant to apply to Yudkowsky and Bostrom and others who have been involved in AI safety from the early days.

(I still think it's great that you're doing this work of untangling and explicating the different threads of argument for the importance of AI safety, but this part seems a bit unfair or at least could be interpreted that way.)

Apologies if this felt like it was targeted specifically at you and other early AI safety advocates, I have nothing but the greatest respect for your work. I'll rewrite to clarify my intended meaning, which is more an attempt to evaluate the field as a whole. This is obviously a very vaguely-defined task, but let me take a stab at fleshing out some changes over the past decade:

1. There's now much more concern about argument 2, the target loading problem (as well as inner optimisers, insofar as they're distinct).

2. There's now less focus on recursive self-improvement as a key reason why AI will be dangerous, and more focus on what happens when hardware scales up. Relatedly, I think a greater percentage of safety researchers believe that there'll be a slow takeoff than used to be the case.

3. Argument 3 (prosaic AI alignment) is now considered more important and more tractable.

4. There's now been significant criticism of coherence arguments as a reason to believe that AGI will pursue long-term goals in an insatiable maximising fashion.

I may be wrong about these shifts - I'm speaking as a newcomer to the field who has a very limited perspective on how it's evolved over time. If so, I'd be happy to be corrected. If they have in fact occurred, here are some possible (non-exclusive) reasons why:

A. None of the proponents of the original arguments have changed their minds about the importance of those arguments, but new people came into the field because of those arguments, then disagreed with them and formulated new perspectives.

B. Some of the proponents of the original arguments have changed their minds significantly.

C. The proponents of the original arguments were misinterpreted, or overemphasised some of their beliefs at the expense of others, and actually these shifts are just a change in emphasis.

I think none of these options reflect badly on anyone involved (getting everything exactly right the first time is an absurdly high standard), but I think A and B would be weak evidence against the importance of AI safety (assuming you've already conditioned on the size of the field, etc). I also think that it's great when individual people change their minds about things, and definitely don't want to criticise that. But if the field as a whole does so (whatever that means), the dynamics of such a shift are worth examination.

I don't have strong beliefs about the relative importance of A, B and C, although I would be rather surprised if any one of them were primarily responsible for all the shifts I mentioned above.

I think none of these options reflect badly on anyone involved (getting everything exactly right the first time is an absurdly high standard), but I think A and B would be weak evidence against the importance of AI safety (assuming you’ve already conditioned on the size of the field, etc).

That depend on how much A and B. Even if a field was actually important, it would have some nonzero amount of A and B, so A and B would constitute (even weak) evidence only if it was more than what you'd expect conditional on the field being important. I think the changes you described in the parent comment are real changes and are not entirely due to C, but they're not more than the changes I'd expect to see conditional on AI safety being actually important. Do you have a different sense?

I don't think it depends on how much A and B, because the "expected amount" is not a special point. In this context, the update that I made personally was "There are more shifts than I thought there were, therefore there's probably more of A and B than I thought there was, therefore I should weakly update against AI safety being important." Maybe (to make A and B more concrete) there being more shifts than I thought downgrades my opinion of the original arguments from "absolutely incredible" to "very very good", which slightly downgrades my confidence that AI safety is important.

As a separate issue, conditional on the field being very important, I might expect the original arguments to be very very good, or I might expect them to be very good, or something else. But I don't see how that expectation can prevent a change from "absolutely exceptional" to "very very good" from downgrading my confidence.

C. The proponents of the original arguments were misinterpreted, or overemphasised some of their beliefs at the expense of others, and actually these shifts are just a change in emphasis.

My interpretation of what happened here is that more narrow AI successes made it more convincing that one could reach ASI by building all of the components of it directly, rather than necessitating building an AI that can do most of the hard work for you. If it only takes 5 cognitive modules to take over the world instead of 500, then one no longer needs to posit an extra mechanism by which a buildable system is able to reach the ability to take over the world. And so from my perspective it's mostly a shift in emphasis, with small amounts of A and B as well.

Promoted to curated: I think this classification is good and useful, both to refer to in conversation and to help people navigate the broader alignment space. And I think the post is presented in a clear and relatively concise way.

I do think there would have been value in connecting it more to past writing about similar topics, though I recognize that this might have easily doubled the effort of writing this post.

Thanks! I agree that more connection to past writings is always good, and I'm happy to update it appropriately - although, upon thinking about it, there's nothing which really comes to mind as an obvious omission (except perhaps citing sections of Superintelligence?) Of course I'm pretty biased, since I already put in the things which I thought were most important - so I'd be glad to hear any additional suggestions you have.

One place that comes to mind that had a bunch of related writing is Arbital.

I was also thinking about linking to a bunch of related taxonomies. The "Disjunctive AI Risk" paper comes to mind. I will think about other examples.

This is great, but 4 and 5 seem to be aspects of the same problem to me (i.e., that humans aren't safe agents) and I'm not sure how you're proposing to draw the line between them. For example

It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness.

If this was caused entirely by an AI pursuing an otherwise beneficial goal, it would certainly count as a failure of AI safety (and is currently studied under "safe exploration") so it seems to make sense to call the analogous human problem "human safety". Similarly coordination between AIs is considered a safety problem and studied under decision theory and game theory for AIs.

Can you explain a bit more the difference you see between 4 and 5?

To me the difference is that when I read 5 I'm thinking about people being careless or malevolent, in an everyday sense of those terms, whereas when I read 4 I'm thinking about how maybe there's no such thing as a human who's not careless or malevolent, if given enough power and presented with a weird enough situation.

I endorse ESRogs' answer. If the world were a singleton under the control of a few particularly benevolent and wise humans, with an AGI that obeys the intention of practical commands (in a somewhat naive way, say, so it'd be unable to help them figure out ethics) then I think argument 5 would no longer apply, but argument 4 would. Or, more generally: argument 5 is about how humans might behave badly under current situations and governmental structures in the short term, but makes no claim that this will be a systemic problem in the long term (we could probably solve it using a singleton + mass surveillance); argument 4 is about how we don't know of any governmental(/psychological?) structures which are very likely to work well in the long term.

Having said that, your ideas were the main (but not sole) inspiration for argument 4, so if this isn't what you intended, then I may need to rethink its inclusion.

I think this division makes sense on a substantive level, and I guess I was confused by the naming and the ordering between 4 and 5. I would define "human safety problems" to include both short term and long term problems (just like "AI safety problems" includes short term and long term problems) so I'd put both 4 and 5 under "human safety problems" instead of just 4. I guess in my posts I mostly focused on long term problems since short term problems have already been widely recognized, but as far as naming, it seems strange to exclude short term problems from "human safety problems". Also you wrote "They are listed roughly from most specific and actionable to most general" and 4 feels like a more general problem than 5 to me, although perhaps that's arguable.

I struggle to understand the difference between #2 and #3. The prosaic AI alignment problem only exists because we don't know how to make an agent that tries to do what we want it to do. Would you say that #3 is a concrete scenario for how #2 could lead to a catastrophe?

I think #3 could occur because of #2 (which I now mostly call "inner misalignment"), but it could also occur because of outer misalignment.

Broadly speaking, though, I think you're right that #2 and #3 are different types of things. Because of that and other issues, I no longer think that this post disentangles the arguments satisfactorily; I'll make a note of this at the top of the document.

We solve all the important philosophical problems ahead of time and program the solutions into the AI.
We solve metaphilosophy (i.e., understand philosophical reasoning as well as we understand mathematical reasoning) and program that into the AI so it can solve philosophical problems on its own.
We program the AI to learn philosophical reasoning from humans or use human simulations to solve philosophical problems.

Since then people have come up with a couple more scenarios (which did make me slightly more optimistic about this problem):

We all coordinate to stop technological progress some time after AI but before space colonization, and have a period of long reflection where humans, maybe with help from AIs, spend thousands or millions of years to solve philosophical problems.
We program AIs to be corrigible to their users, some users care about getting philosophy correct so the AIs help keep them safe and get their "fair share" of the universe until philosophical problems are solved eventually, enough users care about this so that we end up with a mostly good future, and lack of philosophical knowledge doesn't cause disaster in the meantime. (My writings on "human safety problems" were in part a response to this suggestion, outlining how hard it would be to keep humans "safe" in this scenario.)

This [Maximisers are dangerous] was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. [...] And this proliferation of arguments is evidence against their quality: if your conclusions remain the same but your reasons for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered important in your social group).

1. There's now much more concern about argument 2, the target loading problem (as well as inner optimisers, insofar as they're distinct).

3. Argument 3 (prosaic AI alignment) is now considered more important and more tractable.

4. There's now been significant criticism of coherence arguments as a reason to believe that AGI will pursue long-term goals in an insatiable maximising fashion.

B. Some of the proponents of the original arguments have changed their minds significantly.

C. The proponents of the original arguments were misinterpreted, or overemphasised some of their beliefs at the expense of others, and actually these shifts are just a change in emphasis.

I don't have strong beliefs about the relative importance of A, B and C, although I would be rather surprised if any one of them were primarily responsible for all the shifts I mentioned above.

I think none of these options reflect badly on anyone involved (getting everything exactly right the first time is an absurdly high standard), but I think A and B would be weak evidence against the importance of AI safety (assuming you’ve already conditioned on the size of the field, etc).

C. The proponents of the original arguments were misinterpreted, or overemphasised some of their beliefs at the expense of others, and actually these shifts are just a change in emphasis.

I do think there would have been value in connecting it more to past writing about similar topics, though I recognize that this might have easily doubled the effort of writing this post.

One place that comes to mind that had a bunch of related writing is Arbital.

I was also thinking about linking to a bunch of related taxonomies. The "Disjunctive AI Risk" paper comes to mind. I will think about other examples.

This is great, but 4 and 5 seem to be aspects of the same problem to me (i.e., that humans aren't safe agents) and I'm not sure how you're proposing to draw the line between them. For example

It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness.

Can you explain a bit more the difference you see between 4 and 5?

Having said that, your ideas were the main (but not sole) inspiration for argument 4, so if this isn't what you intended, then I may need to rethink its inclusion.

I think #3 could occur because of #2 (which I now mostly call "inner misalignment"), but it could also occur because of outer misalignment.

31

Disentangling arguments for the importance of AI safety

31