Counterarguments to the basic AI x-risk case

(Crossposted from AI Impacts Blog)

This is going to be a list of holes I see in the basic argument for existential risk from superhuman AI systems1

To start, here’s an outline of what I take to be the basic case2:

I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’

Reasons to expect this:

  1. Goal-directed behavior is likely to be valuable, e.g. economically. 
  2. Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).
  3. ‘Coherence arguments’ may imply that systems with some goal-directedness will become more strongly goal-directed over time.

II. If goal-directed superhuman AI systems are built, their desired outcomes will probably be about as bad as an empty universe by human lights 

Reasons to expect this:

  1. Finding useful goals that aren’t extinction-level bad appears to be hard: we don’t have a way to usefully point at human goals, and divergences from human goals seem likely to produce goals that are in intense conflict with human goals, due to a) most goals producing convergent incentives for controlling everything, and b) value being ‘fragile’, such that an entity with ‘similar’ values will generally create a future of virtually no value.
  2. Finding goals that are extinction-level bad and temporarily useful appears to be easy: for example, advanced AI with the sole objective ‘maximize company revenue’ might profit said company for a time before gathering the influence and wherewithal to pursue the goal in ways that blatantly harm society.
  3. Even if humanity found acceptable goals, giving a powerful AI system any specific goals appears to be hard. We don’t know of any procedure to do it, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with goals other than those they were trained according to. Randomly aberrant goals resulting are probably extinction-level bad for reasons described in II.1 above.

III. If most goal-directed superhuman AI systems have bad goals, the future will very likely be bad

That is, a set of ill-motivated goal-directed superhuman AI systems, of a scale likely to occur, would be capable of taking control over the future from humans. This is supported by at least one of the following being true:

  1. Superhuman AI would destroy humanity rapidly. This may be via ultra-powerful capabilities at e.g. technology design and strategic scheming, or through gaining such powers in an ‘intelligence explosion‘ (self-improvement cycle). Either of those things may happen either through exceptional heights of intelligence being reached or through highly destructive ideas being available to minds only mildly beyond our own.
  2. Superhuman AI would gradually come to control the future via accruing power and resources. Power and resources would be more available to the AI system(s) than to humans on average, because of the AI having far greater intelligence.

Below is a list of gaps in the above, as I see it, and counterarguments. A ‘gap’ is not necessarily unfillable, and may have been filled in any of the countless writings on this topic that I haven’t read. I might even think that a given one can probably be filled. I just don’t know what goes in it.  

This blog post is an attempt to run various arguments by you all on the way to making pages on AI Impacts about arguments for AI risk and corresponding counterarguments. At some point in that process I hope to also read others’ arguments, but this is not that day. So what you have here is a bunch of arguments that occur to me, not an exhaustive literature review. 

Counterarguments

A. Contra “superhuman AI systems will be ‘goal-directed’”

Different calls to ‘goal-directedness’ don’t necessarily mean the same concept

‘Goal-directedness’ is a vague concept. It is unclear that the ‘goal-directednesses’ that are favored by economic pressure, training dynamics or coherence arguments (the component arguments in part I of the argument above) are the same ‘goal-directedness’ that implies a zealous drive to control the universe (i.e. that makes most possible goals very bad, fulfilling II above). 

One well-defined concept of goal-directedness is ‘utility maximization’: always doing what maximizes a particular utility function, given a particular set of beliefs about the world. 

Utility maximization does seem to quickly engender an interest in controlling literally everything, at least for many utility functions one might have3. If you want things to go a certain way, then you have reason to control anything which gives you any leverage over that, i.e. potentially all resources in the universe (i.e. agents have ‘convergent instrumental goals’). This is in serious conflict with anyone else with resource-sensitive goals, even if prima facie those goals didn’t look particularly opposed. For instance, a person who wants all things to be red and another person who wants all things to be cubes may not seem to be at odds, given that all things could be red cubes. However if these projects might each fail for lack of energy, then they are probably at odds. 

Thus utility maximization is a notion of goal-directedness that allows Part II of the argument to work, by making a large class of goals deadly.

You might think that any other concept of ‘goal-directedness’ would also lead to this zealotry. If one is inclined toward outcome O in any plausible sense, then does one not have an interest in anything that might help procure O? No: if a system is not a ‘coherent’ agent, then it can have a tendency to bring about O in a range of circumstances, without this implying that it will take any given effective opportunity to pursue O. This assumption of consistent adherence to a particular evaluation of everything is part of utility maximization, not a law of physical systems. Call machines that push toward particular goals but are not utility maximizers pseudo-agents. 

Can pseudo-agents exist? Yes—utility maximization is computationally intractable, so any physically existent ‘goal-directed’ entity is going to be a pseudo-agent. We are all pseudo-agents, at best. But it seems something like a spectrum. At one end is a thermostat, then maybe a thermostat with a better algorithm for adjusting the heat. Then maybe a thermostat which intelligently controls the windows. After a lot of honing, you might have a system much more like a utility-maximizer: a system that deftly seeks out and seizes well-priced opportunities to make your room 68 degrees—upgrading your house, buying R&D, influencing your culture, building a vast mining empire. Humans might not be very far on this spectrum, but they seem enough like utility-maximizers already to be alarming. (And it might not be well-considered as a one-dimensional spectrum—for instance, perhaps ‘tendency to modify oneself to become more coherent’ is a fairly different axis from ‘consistency of evaluations of options and outcomes’, and calling both ‘more agentic’ is obscuring.)

Nonetheless, it seems plausible that there is a large space of systems which strongly increase the chance of some desirable objective O occurring without even acting as much like maximizers of an identifiable utility function as humans would. For instance, without searching out novel ways of making O occur, or modifying themselves to be more consistently O-maximizing. Call these ‘weak pseudo-agents’. 

For example, I can imagine a system constructed out of a huge number of ‘IF X THEN Y’ statements (reflexive responses), like ‘if body is in hallway, move North’, ‘if hands are by legs and body is in kitchen, raise hands to waist’.., equivalent to a kind of vector field of motions, such that for every particular state, there are directions that all the parts of you should be moving. I could imagine this being designed to fairly consistently cause O to happen within some context. However since such behavior would not be produced by a process optimizing O, you shouldn’t expect it to find new and strange routes to O, or to seek O reliably in novel circumstances. There appears to be zero pressure for this thing to become more coherent, unless its design already involves reflexes to move its thoughts in certain ways that lead it to change itself. I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

It is not clear that economic incentives generally favor the far end of this spectrum over weak pseudo-agency. There are incentives toward systems being more like utility maximizers, but also incentives against. 

The reason any kind of ‘goal-directedness’ is incentivised in AI systems is that then the system can be given an objective by someone hoping to use their cognitive labor, and the system will make that objective happen. Whereas a similar non-agentic AI system might still do almost the same cognitive labor, but require an agent (such as a person) to look at the objective and decide what should be done to achieve it, then ask the system for that. Goal-directedness means automating this high-level strategizing. 

Weak pseudo-agency fulfills this purpose to some extent, but not as well as utility maximization. However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well. 

That is, if it is true that utility maximization tends to lead to very bad outcomes relative to any slightly different goals (in the absence of great advances in the field of AI alignment), then the most economically favored level of goal-directedness seems unlikely to be as far as possible toward utility maximization. More likely it is a level of pseudo-agency that achieves a lot of the users’ desires without bringing about sufficiently detrimental side effects to make it not worthwhile. (This is likely more agency than is socially optimal, since some of the side-effects will be harms to others, but there seems no reason to think that it is a very high degree of agency.)

Some minor but perhaps illustrative evidence: anecdotally, people prefer interacting with others who predictably carry out their roles or adhere to deontological constraints, rather than consequentialists in pursuit of broadly good but somewhat unknown goals. For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.

The other arguments to expect goal-directed systems mentioned above seem more likely to suggest approximate utility-maximization rather than some other form of goal-directedness, but it isn’t that clear to me. I don’t know what kind of entity is most naturally produced by contemporary ML training. Perhaps someone else does. I would guess that it’s more like the reflex-based agent described above, at least at present. But present systems aren’t the concern.

Coherence arguments are arguments for being coherent a.k.a. maximizing a utility function, so one might think that they imply a force for utility maximization in particular. That seems broadly right. Though note that these are arguments that there is some pressure for the system to modify itself to become more coherent. What actually results from specific systems modifying themselves seems like it might have details not foreseen in an abstract argument merely suggesting that the status quo is suboptimal whenever it is not coherent. Starting from a state of arbitrary incoherence and moving iteratively in one of many pro-coherence directions produced by whatever whacky mind you currently have isn’t obviously guaranteed to increasingly approximate maximization of some sensical utility function. For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one. Probably moves like this are rarer than ones that make you more coherent in this situation, but I don’t know, and I also don’t know if this is a great model of the situation for incoherent systems that could become more coherent.

What it might look like if this gap matters: AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.

Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk

The forces for goal-directedness mentioned in I are presumably of finite strength. For instance, if coherence arguments correspond to pressure for machines to become more like utility maximizers, there is an empirical answer to how fast that would happen with a given system. There is also an empirical answer to how ‘much’ goal directedness is needed to bring about disaster, supposing that utility maximization would bring about disaster and, say, being a rock wouldn’t. Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.

What it might look like if this gap matters: There are not that many systems doing something like utility maximization in the new AI economy. Demand is mostly for systems more like GPT or DALL-E, which transform inputs in some known way without reference to the world, rather than ‘trying’ to bring about an outcome. Maybe the world was headed for more of the latter, but ethical and safety concerns reduced desire for it, and it wasn’t that hard to do something else. Companies setting out to make non-agentic AI systems have no trouble doing so. Incoherent AIs are never observed making themselves more coherent, and training has never produced an agent unexpectedly. There are lots of vaguely agentic things, but they don’t pose much of a problem. There are a few things at least as agentic as humans, but they are a small part of the economy.

B. Contra “goal-directed AI systems’ goals will be bad”

Small differences in utility functions may not be catastrophic

Arguably, humans are likely to have somewhat different values to one another even after arbitrary reflection. If so, there is some extended region of the space of possible values that the values of different humans fall within. That is, ‘human values’ is not a single point.

If the values of misaligned AI systems fall within that region, this would not appear to be worse in expectation than the situation where the long-run future was determined by the values of humans other than you. (This may still be a huge loss of value relative to the alternative, if a future determined by your own values is vastly better than that chosen by a different human, and if you also expected to get some small fraction of the future, and will now get much less. These conditions seem non-obvious however, and if they obtain you should worry about more general problems than AI.)

Plausibly even a single human, after reflecting, could on their own come to different places in a whole region of specific values, depending on somewhat arbitrary features of how the reflecting period went. In that case, even the values-on-reflection of a single human is an extended region of values space, and an AI which is only slightly misaligned could be the same as some version of you after reflecting.

There is a further larger region, ‘that which can be reliably enough aligned with typical human values via incentives in the environment’, which is arguably larger than the circle containing most human values. Human society makes use of this a lot: for instance, most of the time particularly evil humans don’t do anything too objectionable because it isn’t in their interests. This region is probably smaller for more capable creatures such as advanced AIs, but still it is some size.

Thus it seems that some amount of AI divergence from your own values is probably broadly fine, i.e. not worse than what you should otherwise expect without AI. 

Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly. The question is a quantitative one of whether we can get it close enough. And how close is ‘close enough’ is not known. 

What it might look like if this gap matters: there are many superintelligent goal-directed AI systems around. They are trained to have human-like goals, but we know that their training is imperfect and none of them has goals exactly like those presented in training. However if you just heard about a particular system’s intentions, you wouldn’t be able to guess if it was an AI or a human. Things happen much faster than they were, because superintelligent AI is superintelligent, but not obviously in a direction less broadly in line with human goals than when humans were in charge.

Differences between AI and human values may be small 

AI trained to have human-like goals will have something close to human-like goals. How close? Call it d, for a particular occasion of training AI. 

If d doesn’t have to be 0 for safety (from above), then there is a question of whether it is an acceptable size. 

I know of two issues here, pushing d upward. One is that with a finite number of training examples, the fit between the true function and the learned function will be wrong. The other is that you might accidentally create a monster (‘misaligned mesaoptimizer’) who understands its situation and pretends to have the utility function you are aiming for so that it can be freed and go out and manifest its own utility function, which could be just about anything. If this problem is real, then the values of an AI system might be arbitrarily different from the training values, rather than ‘nearby’ in some sense, so d is probably unacceptably large. But if you avoid creating such mesaoptimizers, then it seems plausible to me that d is very small. 

If humans also substantially learn their values via observing examples, then the variation in human values is arising from a similar process, so might be expected to be of a similar scale. If we care to make the ML training process more accurate than the human learning one, it seems likely that we could. For instance, d gets smaller with more data.

Another line of evidence is that for things that I have seen AI learn so far, the distance from the real thing is intuitively small. If AI learns my values as well as it learns what faces look like, it seems plausible that it carries them out better than I do.

As minor additional evidence here, I don’t know how to describe any slight differences in utility functions that are catastrophic. Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster? Are we talking about the scenario where the AI values a slightly different concept of justice, or values satisfaction a smidgen more relative to joy than it should? And then that’s a moral disaster because it is wrought across the cosmos? Or is it that it looks at all of our inaction and thinks we want stuff to be maintained very similar to how it is now, so crushes any efforts to improve things? 

What it might look like if this gap matters: when we try to train AI systems to care about what specific humans care about, they usually pretty much do, as far as we can tell. We basically get what we trained for. For instance, it is hard to distinguish them from the human in question. (It is still important to actually do this training, rather than making AI systems not trained to have human values.)

Maybe value isn’t fragile

Eliezer argued that value is fragile, via examples of ‘just one thing’ that you can leave out of a utility function, and end up with something very far away from what humans want. For instance, if you leave out ‘boredom’ then he thinks the preferred future might look like repeating the same otherwise perfect moment again and again. (His argument is perhaps longer—that post says there is a lot of important background, though the bits mentioned don’t sound relevant to my disagreement.) This sounds to me like ‘value is not resilient to having components of it moved to zero’, which is a weird usage of ‘fragile’, and in particular, doesn’t seem to imply much about smaller perturbations. And smaller perturbations seem like the relevant thing with AI systems trained on a bunch of data to mimic something. 

You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces? Almost none of the faces on thispersondoesnotexist.com are blatantly morphologically unusual in any way, let alone noseless. Admittedly one time I saw someone whose face was neon green goo, but I’m guessing you can get the rate of that down pretty low if you care about it.

Eight examples, no cherry-picking:

Skipping the nose is the kind of mistake you make if you are a child drawing a face from memory. Skipping ‘boredom’ is the kind of mistake you make if you are a person trying to write down human values from memory. My guess is that this seemed closer to the plan in 2009 when that post was written, and that people cached the takeaway and haven’t updated it for deep learning which can learn what faces look like better than you can.

What it might look like if this gap matters: there is a large region ‘around’ my values in value space that is also pretty good according to me. AI easily lands within that space, and eventually creates some world that is about as good as the best possible utopia, according to me. There aren’t a lot of really crazy and terrible value systems adjacent to my values.

Short-term goals

Utility maximization really only incentivises drastically altering the universe if one’s utility function places a high enough value on very temporally distant outcomes relative to near ones. That is, long term goals are needed for danger. A person who cares most about winning the timed chess game in front of them should not spend time accruing resources to invest in better chess-playing.

AI systems could have long-term goals via people intentionally training them to do so, or via long-term goals naturally arising from systems not trained so. 

Humans seem to discount the future a lot in their usual decision-making (they have goals years in advance but rarely a hundred years) so the economic incentive to train AI to have very long term goals might be limited.

It’s not clear that training for relatively short term goals naturally produces creatures with very long term goals, though it might.

Thus if AI systems fail to have value systems relatively similar to human values, it is not clear that many will have the long time horizons needed to motivate taking over the universe.

What it might look like if this gap matters: the world is full of agents who care about relatively near-term issues, and are helpful to that end, and have no incentive to make long-term large scale schemes. Reminiscent of the current world, but with cleverer short-termism.

C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

Human success isn’t from individual intelligence

The argument claims (or assumes) that surpassing ‘human-level’ intelligence (i.e. the mental capacities of an individual human) is the relevant bar for matching the power-gaining capacity of humans, such that passing this bar in individual intellect means outcompeting humans in general in terms of power (argument III.2), if not being able to immediately destroy them all outright (argument III.1.). In a similar vein, introductions to AI risk often start by saying that humanity has triumphed over the other species because it is more intelligent, as a lead in to saying that if we make something more intelligent still, it will inexorably triumph over humanity.

This hypothesis about the provenance of human triumph seems wrong. Intellect surely helps, but humans look to be powerful largely because they share their meager intellectual discoveries with one another and consequently save them up over time4. You can see this starkly by comparing the material situation of Alice, a genius living in the stone age, and Bob, an average person living in 21st Century America. Alice might struggle all day to get a pot of water, while Bob might be able to summon all manner of delicious drinks from across the oceans, along with furniture, electronics, information, etc. Much of Bob’s power probably did flow from the application of intelligence, but not Bob’s individual intelligence. Alice’s intelligence, and that of those who came between them.

Bob’s greater power isn’t directly just from the knowledge and artifacts Bob inherits from other humans. He also seems to be helped for instance by much better coordination: both from a larger number people coordinating together, and from better infrastructure for that coordination (e.g. for Alice the height of coordination might be an occasional big multi-tribe meeting with trade, and for Bob it includes global instant messaging and banking systems and the Internet). One might attribute all of this ultimately to innovation, and thus to intelligence and communication, or not. I think it’s not important to sort out here, as long as it’s clear that individual intelligence isn’t the source of power.

It could still be that with a given bounty of shared knowledge (e.g. within a given society), intelligence grants huge advantages. But even that doesn’t look true here: 21st Century geniuses live basically like 21st Century people of average intelligence, give or take.

Why does this matter? Well for one thing, if you make AI which is merely as smart as a human, you shouldn’t then expect it to do that much better than a genius living in the stone age. That’s what human-level intelligence gets you: nearly nothing. A piece of rope after millions of lifetimes. Humans without their culture are much like other animals. 

To wield the control-over-the-world of a genius living in the 21st Century, the human-level AI would seem to need something like the other benefits that the 21st century genius gets from their situation in connection with a society. 

One such thing is access to humanity’s shared stock of hard-won information. AI systems plausibly do have this, if they can get most of what is relevant by reading the internet. This isn’t obvious: people also inherit information from society through copying habits and customs, learning directly from other people, and receiving artifacts with implicit information (for instance, a factory allows whoever owns the factory to make use of intellectual work that was done by the people who built the factory, but that information may not available explicitly even for the owner of the factory, let alone to readers on the internet). These sources of information seem likely to also be available to AI systems though, at least if they are afforded the same options as humans.

My best guess is that AI systems easily do better than humans on extracting information from humanity’s stockpile, and on coordinating, and so on this account are probably in an even better position to compete with humans than one might think on the individual intelligence model, but that is a guess. In that case perhaps this misunderstanding makes little difference to the outcomes of the argument. However it seems at least a bit more complicated. 

Suppose that AI systems can have access to all information humans can have access to. The power the 21st century person gains from their society is modulated by their role in society, and relationships, and rights, and the affordances society allows them as a result. Their power will vary enormously depending on whether they are employed, or listened to, or paid, or a citizen, or the president. If AI systems’ power stems substantially from interacting with society, then their power will also depend on affordances granted, and humans may choose not to grant them many affordances (see section ‘Intelligence may not be an overwhelming advantage’ for more discussion).

However suppose that your new genius AI system is also treated with all privilege. The next way that this alternate model matters is that if most of what is good in a person’s life is determined by the society they are part of, and their own labor is just buying them a tiny piece of that inheritance, then if they are for instance twice as smart as any other human, they don’t get to use technology that it twice as good. They just get a larger piece of that same shared technological bounty purchasable by anyone. Because each individual person is adding essentially nothing in terms of technology, so twice that is still basically nothing. 

In contrast, I think people are often imagining that a single entity somewhat smarter than a human will be able to quickly use technologies that are somewhat better than current human technologies. This seems to be mistaking the actions of a human and the actions of a human society. If a hundred thousand people sometimes get together for a few years and make fantastic new weapons, you should not expect an entity somewhat smarter than a person to make even better weapons. That’s off by a factor of about a hundred thousand. 

There might be places you can get far ahead of humanity by being better than a single human—it depends how much accomplishments depend on the few most capable humans in the field, and how few people are working on the problem. But for instance the Manhattan Project took a hundred thousand people several years, and von Neumann (a mythically smart scientist) joining the project did not reduce it to an afternoon. Plausibly to me, some specific people being on the project caused it to not take twice as many person-years, though the plausible candidates here seem to be more in the business of running things than doing science directly (though that also presumably involves intelligence). But even if you are an ambitious somewhat superhuman intelligence, the influence available to you seems to plausibly be limited to making a large dent in the effort required for some particular research endeavor, not single-handedly outmoding humans across many research endeavors.

This is all reason to doubt that a small number of superhuman intelligences will rapidly take over or destroy the world (as in III.i.). This doesn’t preclude a set of AI systems that are together more capable than a large number of people from making great progress. However some related issues seem to make that less likely.

Another implication of this model is that if most human power comes from buying access to society’s shared power, i.e. interacting with the economy, you should expect intellectual labor by AI systems to usually be sold, rather than for instance put toward a private stock of knowledge. This means the intellectual outputs are mostly going to society, and the main source of potential power to an AI system is the wages received (which may allow it to gain power in the long run). However it seems quite plausible that AI systems at this stage will generally not receive wages, since they presumably do not need them to be motivated to do the work they were trained for. It also seems plausible that they would be owned and run by humans. This would seem to not involve any transfer of power to that AI system, except insofar as its intellectual outputs benefit it (e.g. if it is writing advertising material, maybe it doesn’t get paid for that, but if it can write material that slightly furthers its own goals in the world while also fulfilling the advertising requirements, then it sneaked in some influence.) 

If there is AI which is moderately more competent than humans, but not sufficiently more competent to take over the world, then it is likely to contribute to this stock of knowledge and affordances shared with humans. There is no reason to expect it to build a separate competing stock, any more than there is reason for a current human household to try to build a separate competing stock rather than sell their labor to others in the economy. 

In summary:

  1. Functional connection with a large community of other intelligences in the past and present is probably a much bigger factor in the success of humans as a species or individual humans than is individual intelligence. 
  2. Thus this also seems more likely to be important for AI success than individual intelligence. This is contrary to a usual argument for AI superiority, but probably leaves AI systems at least as likely to outperform humans, since superhuman AI is probably superhumanly good at taking in information and coordinating.
  3. However it is not obvious that AI systems will have the same access to society’s accumulated information e.g. if there is information which humans learn from living in society, rather than from reading the internet. 
  4. And it seems an open question whether AI systems are given the same affordances in society as humans, which also seem important to making use of the accrued bounty of power over the world that humans have. For instance, if they are not granted the same legal rights as humans, they may be at a disadvantage in doing trade or engaging in politics or accruing power.
  5. The fruits of greater intelligence for an entity will probably not look like society-level accomplishments unless it is a society-scale entity
  6. The route to influence with smaller fruits probably by default looks like participating in the economy rather than trying to build a private stock of knowledge.
  7. If the resources from participating in the economy accrue to the owners of AI systems, not to the systems themselves, then there is less reason to expect the systems to accrue power incrementally, and they are at a severe disadvantage relative to humans. 

Overall these are reasons to expect AI systems with around human-level cognitive performance to not destroy the world immediately, and to not amass power as easily as one might imagine. 

What it might look like if this gap matters: If AI systems are somewhat superhuman, then they do impressive cognitive work, and each contributes to technology more than the best human geniuses, but not more than the whole of society, and not enough to materially improve their own affordances. They don’t gain power rapidly because they are disadvantaged in other ways, e.g. by lack of information, lack of rights, lack of access to positions of power. Their work is sold and used by many actors, and the proceeds go to their human owners. AI systems do not generally end up with access to masses of technology that others do not have access to, and nor do they have private fortunes. In the long run, as they become more powerful, they might take power if other aspects of the situation don’t change. 

AI agents may not be radically superior to combinations of humans and non-agentic machines

‘Human level capability’ is a moving target. For comparing the competence of advanced AI systems to humans, the relevant comparison is with humans who have state-of-the-art AI and other tools. For instance, the human capacity to make art quickly has recently been improved by a variety of AI art systems. If there were now an agentic AI system that made art, it would make art much faster than a human of 2015, but perhaps hardly faster than a human of late 2022. If humans continually have access to tool versions of AI capabilities, it is not clear that agentic AI systems must ever have an overwhelmingly large capability advantage for important tasks (though they might). 

(This is not an argument that humans might be better than AI systems, but rather: if the gap in capability is smaller, then the pressure for AI systems to accrue power is less and thus loss of human control is slower and easier to mitigate entirely through other forces, such as subsidizing human involvement or disadvantaging AI systems in the economy.)

Some advantages of being an agentic AI system vs. a human with a tool AI system seem to be:

  1. There might just not be an equivalent tool system, for instance if it is impossible to train systems without producing emergent agents.
  2. When every part of a process takes into account the final goal, this should make the choices within the task more apt for the final goal (and agents know their final goal, whereas tools carrying out parts of a larger problem do not).
  3. For humans, the interface for using a capability of one’s mind tends to be smoother than the interface for using a tool. For instance a person who can do fast mental multiplication can do this more smoothly and use it more often than a person who needs to get out a calculator. This seems likely to persist.

1 and 2 may or may not matter much. 3 matters more for brief, fast, unimportant tasks. For instance, consider again people who can do mental calculations better than others. My guess is that this advantages them at using Fermi estimates in their lives and buying cheaper groceries, but does not make them materially better at making large financial choices well. For a one-off large financial choice, the effort of getting out a calculator is worth it and the delay is very short compared to the length of the activity. The same seems likely true of humans with tools vs. agentic AI with the same capacities integrated into their minds. Conceivably the gap between humans with tools and goal-directed AI is small for large, important tasks.

What it might look like if this gap matters: agentic AI systems have substantial advantages over humans with tools at some tasks like rapid interaction with humans, and responding to rapidly evolving strategic situations.  One-off large important tasks such as advanced science are mostly done by tool AI. 

Trust

If goal-directed AI systems are only mildly more competent than some combination of tool systems and humans (as suggested by considerations in the last two sections), we still might expect AI systems to out-compete humans, just more slowly. However AI systems have one serious disadvantage as employees of humans: they are intrinsically untrustworthy, while we don’t understand them well enough to be clear on what their values are or how they will behave in any given case. Even if they did perform as well as humans at some task, if humans can’t be certain of that, then there is reason to disprefer using them. This can be thought of as two problems: firstly, slightly misaligned systems are less valuable because they genuinely do the thing you want less well, and secondly, even if they were not misaligned, if humans can’t know that (because we have no good way to verify the alignment of AI systems) then it is costly in expectation to use them. (This is only a further force acting against the supremacy of AI systems—they might still be powerful enough that using them is enough of an advantage that it is worth taking the hit on trustworthiness.)

What it might look like if this gap matters: in places where goal-directed AI systems are not typically hugely better than some combination of less goal-directed systems and humans, the job is often given to the latter if trustworthiness matters. 

Headroom

For AI to vastly surpass human performance at a task, there needs to be ample room for improvement above human level. For some tasks, there is not—tic-tac-toe is a classic example. It is not clear how close humans (or technologically aided humans) are from the limits to competence in the particular domains that will matter. It is to my knowledge an open question how much ‘headroom’ there is. My guess is a lot, but it isn’t obvious.

How much headroom there is varies by task. Categories of task for which there appears to be little headroom: 

  1. Tasks where we know what the best performance looks like, and humans can get close to it. For instance, machines cannot win more often than the best humans at Tic-tac-toe (playing within the rules) or solve Rubik’s cubes much more reliably, or extracting calories from fuel
  2. Tasks where humans are already be reaping most of the value—for instance, perhaps most of the value of forks is in having a handle with prongs attached to the end, and while humans continue to design slightly better ones, and machines might be able to add marginal value to that project more than twice as fast as the human designers, they cannot perform twice as well in terms of the value of each fork, because forks are already 95% as good as they can be. 
  3. Better performance is quickly intractable. For instance, we know that for tasks in particular complexity classes, there are computational limits to how well one can perform across the board. Or for chaotic systems, there can be limits to predictability. (That is, tasks might lack headroom not because they are simple, but because they are complex. E.g. AI probably can’t predict the weather much further out than humans.)

Categories of task where a lot of headroom seems likely:

  1. Competitive tasks where the value of a certain level of performance depends on whether one is better or worse than one’s opponent, so that the marginal value of more performance doesn’t hit diminishing returns, as long as your opponent keeps competing and taking back what you just won. Though in one way this is like having little headroom: there’s no more value to be had—the game is zero sum. And while there might often be a lot of value to be gained by doing a bit better on the margin, still if all sides can invest, then nobody will end up better off than they were. So whether this seems more like high or low headroom depends on what we are asking exactly. Here we are asking if AI systems can do much better than humans: in a zero sum contest like this, they likely can in the sense that they can beat humans, but not in the sense of reaping anything more from the situation than the humans ever got.
  2. Tasks where it is twice as good to do the same task twice as fast, and where speed is bottlenecked on thinking time.
  3. Tasks where there is reason to think that optimal performance is radically better than we have seen. For instance, perhaps we can estimate how high Chess Elo rankings must go before reaching perfection by reasoning theoretically about the game, and perhaps it is very high (I don’t know).
  4. Tasks where humans appear to use very inefficient methods. For instance, it was perhaps predictable before calculators that they would be able to do mathematics much faster than humans, because humans can only keep a small number of digits in their heads, which doesn’t seem like an intrinsically hard problem. Similarly, I hear humans often use mental machinery designed for one mental activity for fairly different ones, through analogy. For instance, when I think about macroeconomics, I seem to be basically using my intuitions for dealing with water. When I do mathematics in general, I think I’m probably using my mental capacities for imagining physical objects.

What it might look like if this gap matters: many challenges in today’s world remain challenging for AI. Human behavior is not readily predictable or manipulable very far beyond what we have explored, only slightly more complicated schemes are feasible before the world’s uncertainties overwhelm planning; much better ads are soon met by much better immune responses; much better commercial decision-making ekes out some additional value across the board but most products were already fulfilling a lot of their potential; incredible virtual prosecutors meet incredible virtual defense attorneys and everything is as it was; there are a few rounds of attack-and-defense in various corporate strategies before a new equilibrium with broad recognition of those possibilities; conflicts and ‘social issues’ remain mostly intractable. There is a brief golden age of science before the newly low-hanging fruit are again plucked and it is only lightning fast in areas where thinking was the main bottleneck, e.g. not in medicine.

Intelligence may not be an overwhelming advantage

Intelligence is helpful for accruing power and resources, all things equal, but many other things are helpful too. For instance money, social standing, allies, evident trustworthiness, not being discriminated against (this was slightly discussed in section ‘Human success isn’t from individual intelligence’). AI systems are not guaranteed to have those in abundance. The argument assumes that any difference in intelligence in particular will eventually win out over any differences in other initial resources. I don’t know of reason to think that. 

Empirical evidence does not seem to support the idea that cognitive ability is a large factor in success. Situations where one entity is much smarter or more broadly mentally competent than other entities regularly occur without the smarter one taking control over the other:

  1. Species exist with all levels of intelligence. Elephants have not in any sense won over gnats; they do not rule gnats; they do not have obviously more control than gnats over the environment. 
  2. Competence does not seem to aggressively overwhelm other advantages in humans: 
    1. Looking at the world, intuitively the big discrepancies in power are not seemingly about intelligence.
    2. IQ 130 humans apparently earn very roughly $6000-$18,500 per year more than average IQ humans.
    3. Elected representatives are apparently smarter on average, but it is a slightly shifted curve, not a radically difference.
    4. MENSA isn’t a major force in the world.
    5. Many places where people see huge success through being cognitively able are ones where they show off their intelligence to impress people, rather than actually using it for decision-making. For instance, writers, actors, song-writers, comedians, all sometimes become very successful through cognitive skills. Whereas scientists, engineers and authors of software use cognitive skills to make choices about the world, and less often become extremely rich and famous, say. If intelligence were that useful for strategic action, it seems like using it for that would be at least as powerful as showing it off. But maybe this is just an accident of which fields have winner-takes-all type dynamics.
    6. If we look at people who evidently have good cognitive abilities given their intellectual output, their personal lives are not obviously drastically more successful, anecdotally.
    7. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at  the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here). And in terms of AI progress, amateur human play was reached in the 50s, roughly when research began, and world champion level play was reached in 1997. 

And theoretically I don’t know why one would expect greater intelligence to win out over other advantages over time.  There are actually two questionable theories here: 1) Charlotte having more overall control than David at time 0 means that Charlotte will tend to have an even greater share of control at time 1. And, 2) Charlotte having more intelligence than David at time 0 means that Charlotte will have a greater share of control at time 1 even if Bob has more overall control (i.e. more of other resources) at time 1.

What it might look like if this gap matters: there are many AI systems around, and they strive for various things. They don’t hold property, or vote, or get a weight in almost anyone’s decisions, or get paid, and are generally treated with suspicion. These things on net keep them from gaining very much power. They are very persuasive speakers however and we can’t stop them from communicating, so there is a constant risk of people willingly handing them power, in response to their moving claims that they are an oppressed minority who suffer. The main thing stopping them from winning is that their position as psychopaths bent on taking power for incredibly pointless ends is widely understood.

Unclear that many goals realistically incentivise taking over the universe

I have some goals. For instance, I want some good romance. My guess is that trying to take over the universe isn’t the best way to achieve this goal. The same goes for a lot of my goals, it seems to me. Possibly I’m in error, but I spend a lot of time pursuing goals, and very little of it trying to take over the universe. Whether a particular goal is best forwarded by trying to take over the universe as a substep seems like a quantitative empirical question, to which the answer is virtually always ‘not remotely’. Don’t get me wrong: all of these goals involve some interest in taking over the universe. All things equal, if I could take over the universe for free, I do think it would help in my romantic pursuits. But taking over the universe is not free. It’s actually super duper duper expensive and hard. So for most goals arising, it doesn’t bear considering. The idea of taking over the universe as a substep is entirely laughable for almost any human goal.

So why do we think that AI goals are different? I think the thought is that it’s radically easier for AI systems to take over the world, because all they have to do is to annihilate humanity, and they are way better positioned to do that than I am, and also better positioned to survive the death of human civilization than I am. I agree that it is likely easier, but how much easier? So much easier to take it from ‘laughably unhelpful’ to ‘obviously always the best move’? This is another quantitative empirical question.

What it might look like if this gap matters: Superintelligent AI systems pursue their goals. Often they achieve them fairly well. This is somewhat contrary to ideal human thriving, but not lethal. For instance, some AI systems are trying to maximize Amazon’s market share, within broad legality. Everyone buys truly incredible amounts of stuff from Amazon, and people often wonder if it is too much stuff. At no point does attempting to murder all humans seem like the best strategy for this. 

Quantity of new cognitive labor is an empirical question, not addressed

Whether some set of AI systems can take over the world with their new intelligence probably depends how much total cognitive labor they represent. For instance, if they are in total slightly more capable than von Neumann, they probably can’t take over the world. If they are together as capable (in some sense) as a million 21st Century human civilizations, then they probably can (at least in the 21st Century).

It also matters how much of that is goal-directed at all, and highly intelligent, and how much of that is directed at achieving the AI systems’ own goals rather than those we intended them for, and how much of that is directed at taking over the world. 

If we continued to build hardware, presumably at some point AI systems would account for most of the cognitive labor in the world. But if there is first an extended period of more minimal advanced AI presence, that would probably prevent an immediate death outcome, and improve humanity’s prospects for controlling a slow-moving AI power grab. 

What it might look like if this gap matters: when advanced AI is developed, there is a lot of new cognitive labor in the world, but it is a minuscule fraction of all of the cognitive labor in the world. A large part of it is not goal-directed at all, and of that, most of the new AI thought is applied to tasks it was intended for. Thus what part of it is spent on scheming to grab power for AI systems is too small to grab much power quickly. The amount of AI cognitive labor grows fast over time, and in several decades it is most of the cognitive labor, but humanity has had extensive experience dealing with its power grabbing.

Speed of intelligence growth is ambiguous

The idea that a superhuman AI would be able to rapidly destroy the world seems prima facie unlikely, since no other entity has ever done that. Two common broad arguments for it:

  1. There will be a feedback loop in which intelligent AI makes more intelligent AI repeatedly until AI is very intelligent.
  2. Very small differences in brains seem to correspond to very large differences in performance, based on observing humans and other apes. Thus any movement past human-level will take us to unimaginably superhuman level.

These both seem questionable.

  1. Feedback loops can happen at very different rates. Identifying a feedback loop empirically does not signify an explosion of whatever you are looking at. For instance, technology is already helping improve technology. To get to a confident conclusion of doom, you need evidence that the feedback loop is fast.
  2. It does not seem clear that small improvements in brains lead to large changes in intelligence in general, or will do on the relevant margin. Small differences between humans and other primates might include those helpful for communication (see Section ‘Human success isn’t from individual intelligence’), which do not seem relevant here. If there were a particularly powerful cognitive development between chimps and humans, it is unclear that AI researchers find that same insight at the same point in the process (rather than at some other time). 

A large number of other arguments have been posed for expecting very fast growth in intelligence at around human level. I previously made a list of them with counterarguments, though none seemed very compelling. Overall, I don’t know of strong reason to expect very fast growth in AI capabilities at around human-level AI performance, though I hear such arguments might exist. 

What it would look like if this gap mattered: AI systems would at some point perform at around human level at various tasks, and would contribute to AI research, along with everything else. This would contribute to progress to an extent familiar from other technological progress feedback, and would not e.g. lead to a superintelligent AI system in minutes.

Key concepts are vague

Concepts such as ‘control’, ‘power’, and ‘alignment with human values’ all seem vague. ‘Control’ is not zero sum (as seemingly assumed) and is somewhat hard to pin down, I claim. What an ‘aligned’ entity is exactly seems to be contentious in the AI safety community, but I don’t know the details. My guess is that upon further probing, these conceptual issues are resolvable in a way that doesn’t endanger the argument, but I don’t know. I’m not going to go into this here.

What it might look like if this gap matters: upon thinking more, we realize that our concerns were confused. Things go fine with AI in ways that seem obvious in retrospect. This might look like it did for people concerned about the ‘population bomb’ or as it did for me in some of my youthful concerns about sustainability: there was a compelling abstract argument for a problem, and the reality didn’t fit the abstractions well enough to play out as predicted.

D. Contra the whole argument

The argument overall proves too much about corporations

Here is the argument again, but modified to be about corporations. A couple of pieces don’t carry over, but they don’t seem integral.

I. Any given corporation is likely to be ‘goal-directed’

Reasons to expect this:

  1. Goal-directed behavior is likely to be valuable in corporations, e.g. economically
  2. Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).
  3. ‘Coherence arguments’ may imply that systems with some goal-directedness will become more strongly goal-directed over time.

II. If goal-directed superhuman corporations are built, their desired outcomes will probably be about as bad as an empty universe by human lights

Reasons to expect this:

  1. Finding useful goals that aren’t extinction-level bad appears to be hard: we don’t have a way to usefully point at human goals, and divergences from human goals seem likely to produce goals that are in intense conflict with human goals, due to a) most goals producing convergent incentives for controlling everything, and b) value being ‘fragile’, such that an entity with ‘similar’ values will generally create a future of virtually no value. 
  2. Finding goals that are extinction-level bad and temporarily useful appears to be easy: for example, corporations with the sole objective ‘maximize company revenue’ might profit for a time before gathering the influence and wherewithal to pursue the goal in ways that blatantly harm society.
  3. Even if humanity found acceptable goals, giving a corporation any specific goals appears to be hard. We don’t know of any procedure to do it, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with goals other than those that they were trained according to. Randomly aberrant goals resulting are probably extinction-level bad, for reasons described in II.1 above.
     

III. If most goal-directed corporations have bad goals, the future will very likely be bad

That is, a set of ill-motivated goal-directed corporations, of a scale likely to occur, would be capable of taking control of the future from humans. This is supported by at least one of the following being true:

  1. A corporation would destroy humanity rapidly. This may be via ultra-powerful capabilities at e.g. technology design and strategic scheming, or through gaining such powers in an ‘intelligence explosion‘ (self-improvement cycle). Either of those things may happen either through exceptional heights of intelligence being reached or through highly destructive ideas being available to minds only mildly beyond our own.
  2. Superhuman AI would gradually come to control the future via accruing power and resources. Power and resources would be more available to the corporation than to humans on average, because of the corporation having far greater intelligence.

This argument does point at real issues with corporations, but we do not generally consider such issues existentially deadly. 

One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.

What it might look like if this counterargument matters: something like the current world. There are large and powerful systems doing things vastly beyond the ability of individual humans, and acting in a definitively goal-directed way. We have a vague understanding of their goals, and do not assume that they are coherent. Their goals are clearly not aligned with human goals, but they have enough overlap that many people are broadly in favor of their existence. They seek power. This all causes some problems, but problems within the power of humans and other organized human groups to keep under control, for some definition of ‘under control’.

Conclusion

I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was overwhelmingly likely.

New Comment
21 comments, sorted by Click to highlight new comments since: Today at 12:51 PM

I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was .

Suppose you went through the following exercise. For each scenario described under "What it might look like if this gap matters", ask:

  1. Is this an existentially secure state of affairs?
  2. If not, what are the main obstacles to reaching existential security from here?

and collected the obstacles, you might assemble a list like this one, which might update you toward AI x-risk being "overwhelmingly likely". (Personally, if I had to put a number on it, I'd say 80%.)

Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:

briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load

(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a tweet of engagement, but in reality I probably won't, sorry.)

I asked Nate what he meant by B, and he said:

section B seemed to me to be saying "AIs can figure out what a face is". And, ok, sure, but if you ask them for the faciest possible thing, it's not very human!facelike.

which is one of many objections, ofc (others including "ah yes but can you aim it at a human concept" )

Note: "ask them for the faciest possible thing" seems confused.

How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier & find the image that maxes out the face logit", but if so, why is that the relevant operationalization? It doesn't correspond to how such a model is actually used.

EDIT: Here is what the first looks like for StyleGAN2-ADA.

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.

This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We'd ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we'd want them to.

Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.

Yeah. Wake me up when we find a single agent which makes decisions by extremizing its own concept activations. EG I'm pretty sure that people don't reflectively, most strongly want to make friends with entities which maximally activate their potential-friend detection circuitry.

"Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.


(sort of nitpicking): 
I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change of variables formula).  I expect the argument to go through, but it would be interesting to do this with an invertible generator (e.g. normalizing flow) and see if it actually does.

Also from Ronny: 

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.

That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

Could someone clarify the relevance of ribosomes?

Nate's B) currently seems confused. I read a connotation "we need the AGI's learned concepts to be safe under extreme optimization pressure, such that, when extremized, they yield reasonable results (e.g. human faces from maximizing the AI-faceishness-concept-activation of an image)." 

But I think agents will not maximize their own concept activations, when choosing plans. An agent's values will optimize the world; the values don't optimize themselves. For example, I think that I am not looking for a romantic relationship which maximally activates my "awesome relationship" concept, if that's a thing I have. It's true that conditional on such a plan being considered, my relationship-shard might bid for that plan with strength monotonically increasing on "predicted activation of awesome-relationship". 

And conditional on such a plan getting considered, where that concept activation is maximized, I would therefore be very inclined to pursue that plan.

But I think it's not true that my relationship-shard is optimizing its own future activations by extremizing future concept activations. I think that this plan won't get found, and the agent won't want to find this plan. Values are not the optimization target. (This point explained in more detail: Alignment allows "nonrobust" decision-influences and doesn't require robust grading)

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 

Great post!

A. Contra “superhuman AI systems will be ‘goal-directed’”

I somewhat agree, see Consequentialism & Corrigibility. I’m a bit unclear on whether this is intended as an argument for “AGI almost definitely won’t have a zealous drive to control the universe” versus “AGI won’t necessarily have a zealous drive to control the universe”. I agree with the latter but not the former.

Also, the more different groups make AGIs, the more likely it is that someone will make one with a “zealous drive to control the universe”. Then we have to think about whether the non-zealous ones will have solved the problem posed by the zealous ones. In this context, there starts to be a contradiction between “we don’t need to worry about the non-zealous ones because they won’t be doing hardcore long-term consequentialist planning” versus “we don’t need to worry about the zealous ones because the non-zealous ones are so powerful and foresightful that, whatever plan the latter might come up with, the former can preemptively think of it and defend against it”. More on this topic in a forthcoming post hopefully in the next couple weeks. (EDIT—I added the link)

B. Contra “goal-directed AI systems’ goals will be bad”

I somewhat agree, see Section 14.6 here. Comments above also apply here, e.g. it’s not obvious that docile helpful human-norm-following AGIs will actually do what’s necessarily to defend against zealous universe-controlling AGIs, again wait for my forthcoming post.

Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

I mostly see these comments as arguments that “AI that can overpower humanity” might happen a bit later than one might otherwise expect, rather than arguments that it’s not going to happen at all. For example, if collaborative groups of humans are more successful than individual humans, well, sooner or later we’re going to have collaborative groups of AIs too. By the time we have a whole society of trillions of AIs, it stops feeling very reassuring. (The ability of AIs to self-replicate seems particularly relevant here.) If humans-using-tools are powerful, well sooner or later (I would argue sooner) AIs are going to be using tools too. (And inventing new tools.) The trust issue stops applying when we get to a world where AIs can start their own companies etc., and thus only need to trust each other (and the “each other” might be copies of themselves). The headroom argument seems adjacent to the lump-of-labor fallacy.

Hmm, OK, I guess the real point of all that is to argue for slow takeoff which then implies that doom is unlikely? (“at some point AI systems would account for most of the cognitive labor in the world. But if there is first an extended period of more minimal advanced AI presence, that would probably prevent an immediate death outcome, and improve humanity’s prospects for controlling a slow-moving AI power grab.”) Again, I’m not quite sure what we’re arguing. I think there’s still serious x-risk regardless of slow vs fast takeoff, and I think there’s still “less than certain doom” regardless of slow vs fast takeoff. In fact, I’m not even confident that x-risk is lower under slow takeoff than fast.

Well anyway, I have an object-level belief that there are already way more than enough GPUs on the planet to support AIs that can overpower humanity—see here—and I think that will be much more true by the time we have real-deal AGIs (which I for one expect to be probably after 2030 at least). I agree that this is a relevant empirical question though.

The idea that a superhuman AI would be able to rapidly destroy the world seems prima facie unlikely, since no other entity has ever done that.

I think there’s pretty good direct reason to believe that it is currently possible to start lots of simultaneously deadly pandemics and crop diseases etc., with an amount of competence already available to small teams of humans or maybe even individual humans. But we don’t currently have ongoing deliberate pandemics. I consider this pretty strong evidence that nobody on Earth with even moderate competence is trying to “destroy the world”, so to speak. So the fact that nobody has succeeded at doing so doesn’t really provide much evidence about the tractability of doing that. (Again, more on this topic in a forthcoming post.)

Competence does not seem to aggressively overwhelm other advantages in humans: 

[...]

g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at  the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here).

The usage of capabilities/competence is inconsistent here. In points a-f, you argue that general intelligence doesn't aggressively overwhelm other advantages in humans. But in point g, the ELO difference between the best and worst players is less determined by general intelligence than by how much practice people have had.

If we instead consistently talk about domain-relevant skills: In the real world, we do see huge advantages from having domain-specific skills. E.g. I expect elected representatives to be vastly better at politics than medium humans.

If we instead consistently talk about general intelligence: The chess data doesn't falsify the hypothesis that human-level variation in general intelligence is small. To gather data about that, we'd want to analyse the ELO-difference between humans who have practiced similarly much but who have very different g.

(There are some papers on the correlation between intelligence and chess performance, so maybe you could get the relevant data from there. E.g. this paper says that (not controlling for anything) most measurements of cognitive ability correlates with chess performance at about ~0.24 (including IQ iff you exclude a weird outlier where the correlation was -0.51).)

Thanks for writing this!

Regarding your point on corporations: One of the reasons to worry about some forms of AI is that they might soon build other, more powerful forms of AI. So the development of very human-like Ems, for example might lead relatively quickly to the development of de novo AI, and so on; hence we worry about Ems even if we think extremely human-like Ems do not pose an x-risk on their own. In the same way, corporations are the ones moving forward fastest on building ML-based AI, and the misalignment between corporations and the long-term future of life on Earth is a very significant cause of the overall level of AI-related x-risk in the world today. So if someone had said 500 years ago "hey let's not build corporations because they will probably be subtly or overtly misaligned with us and that will lead to the destruction of life on Earth", then fastforward to today and it seems like that person has been proven correct.

I really like this post. I also like that you provide concrete and specific observables which you think would obtain under each counterargument. I found it refreshing to imagine so many non-orthodox futures.

Small differences in utility functions may not be catastrophic

For three months, I have been sitting on a post (originally) called "What's up with humans with different values not wanting to kill each other?". It seems to me like "value has to be perfect or Goodhart into oblivion" just... doesn't make sense, that isn't how the world works AFAICT. But I gave up on pressing this point because I wasn't able to communicate the obvious-to-me misprediction made by "imperfection" arguments. Maybe I'll publish that post now. (EDIT: Published!)

Eliezer's original "value is fragile" argument doesn't claim that all perturbations (setting to zero) shatter value into oblivion, but rarely do I perceive people to be considering the dimensions along which value is robust, rather than (AFAICT) reasoning "imperfection  Goodhart  doom." (And "If you didn't get bored, that'd suck" is importantly different from "If the AI doesn't care about you being entertained in just the right way, that'd suck", but I digress...)

AI agents may not be radically superior to combinations of humans and non-agentic machines

On the model of "economic pressure explains a lot of AI outcomes", I disagree with this counterargument because of intuitions around Ahmdahl's law (see Gwern here). 

However, this feels somewhat irrelevant. It sure feels to me like a better model is "people do things which are cool and push limits, especially if that makes money." Even if "centaur" human+AI hybrid processes are economically competitive on eg running corporations, I expect Mooglebook to train an agent anyways sooner or later.

Unclear that many goals realistically incentivise taking over the universe

It seems like people often implicitly assume some kind of space-time additivity of utility, where entities want to "tile" the universe with something. That goals are "grabby" by default. This seems plausible to me but not slam-dunk. 

(It's unclear that I should care about far-away people the same as I do about nearby people. Suppose an AI's main decision-influence is gathering diamonds around it. Should this AI generalize its values to caring about diamonds throughout the cosmos and throughout Tegmark IV? I think "not necessarily." If not, then "AI goals will tile across spacetime and relaities" seems quantitative and uncertain to me.)

The argument overall proves too much about corporations

I agree. I think many alignment arguments prove too much, or are taken too far, especially without relying on specifics of eg SGD dynamics. For example, selection-style arguments can be useful for considering failure modes, but often seem to be taken as open-and-shut counterarguments to proposals. 

"That's selecting for deception." So? Evolution selected for wolves with biological sniper rifles, and didn't get wolves with biological sniper rifles. Evolution selected for humans caring about IGF, and didn't get humans to care about IGF. (For more, see Reward is not the optimization target, anticipated question no.2)

I have now published a conversation between Ege Erdil and Ronny Fernandez about this post. You can find it here.

I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

I agree, but if there is any kind of evolutionary variation in the thing then surely the variations that move towards stronger goal-directedness will be favored.

I think that overcoming this molochian dynamic is the alignment problem: how do you build a powerful system that carefully balances itself and the whole world in such a way that does not slip down the evolutionary slope towards pursuing psychopathic goals by any means necessary?

I think this balancing is possible, it's just not the default attractor, and the default attractor seems to have a huge basin.

I really appreciate this post!

For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.

Fascinatingly, EA employers in particular seem to seek employees who do try to forward organization goals in unforeseen ways!

Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.