Response to Katja Grace's AI x-risk counterarguments

Johannes Treutlein

Thanks for this response, I'm glad to see more public debate on this!

The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?

I think the crux here is whether we expect AI systems to by default collude with one another. They might -- they have a lot of things in common that humans don't, especially if they're copies of one another! But coordination in general is hard, especially if it has to be surreptitious.

As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.

This isn't totally fanciful -- the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances -- if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d'etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it's also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they'll really have your back if you support them?

An interesting consequence of this is that it's ambiguous whether making AI more cooperative makes the situation better or worse.

[-]Erik Jenner3y30

Interesting points, I agree that our response to part C doesn't address this well.

AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It's harder to make an airtight x-risk argument using fast takeoff, since I don't think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we're figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for).

A more robust worry (and what I'd probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don't take over the world in obvious, visible ways. They usually don't break laws in ways we'd notice, they don't kill humans, etc. On paper, humans "own" the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively "take over the world" (in the sense that humans don't actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren't great at collaborating against humanity.

Addressing a few specific points:

humanity will get better at aligning and controlling AI systems as we gain more experience with them,

True to some extent, but I'd expect AI progress to be much faster than human improvement at dealing with AI (the latter is also bounded presumably). So I think the crux is the next point:

and we may be able to enlist the help of AI systems to keep others in check.

Yeah, that's an important point. I think the crux boils down to how well approaches like IDA or debate are going to work? I don't think that we currently know exactly how to make them work sufficiently well for this, I have less strong thoughts on whether they can be made to work or how difficult that would be.

[-]AdamGleave3y20

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.

In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control.

However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.

[-]Johannes Treutlein3y10

I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.

E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.

[-]Rudi C3y00

This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.

[-]Erik Jenner3y10

I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.

[-]David Scott Krueger (formerly: capybaralet)3y50

This is a great response to a great post!
I mostly agree with the points made here, so I'll just highlight differences in my views.

Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").

A few other notable differences:

We don't necessarily need to reach some "safe and stable state". X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.
I would add "x-safety is a common good" as a key concern in addition to "People will keep building better and better AI systems".
A lot of my concerns in slow take off scenarios look sort of like "AI supercharges corporations that pursue misaligned objectives". (I'm not sure how much of a disagreement this actually is).
"But we also think that most of the gaps it describes in the AI x-risk case have already been addressed elsewhere or vanish when using a slightly different version of the AI x-risk argument." <-- I think this is too strong.
EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).

[-]Noosphere893y23

EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).

This is a crux for me, as it is why I don't think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn't change much regarding risk in total.

[-]David Scott Krueger (formerly: capybaralet)3y10

TBC, "more concerned" doesn't mean I'm not concerned about the other ones... and I just noticed that I make this mistake all the time when reading people say they are more concerned about present-day issues than x-risk....... hmmm........

[-]Erik Jenner3y21

Thanks for the interesting comments!

Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").

Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my crux is mainly how much slow takeoff really helps (most of the reasons I expect iterative design to fail for AI still apply, e.g. deception or "getting what we measure"). I didn't really see arguments in Katja's post for why slow takeoff means we're fine.

We don't necessarily need to reach some "safe and stable state". X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.

Agreed, and I think this is a weakness of our post. I have a sense that most of the arguments you could make using the "existentially secure state" framing could also be made more generally, but I haven't figured out a framing I really like yet unfortunately.

EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).

Would be interested in discussing this more at some point. Given your comment, I'd now guess I dismissed this too quickly and there are things I haven't thought of. My spontaneous reasoning for being less concerned about this is something like "the better our models become (e.g. larger and larger pretrained models), the easier it should be to make them output things humans approve of". An important aspect is also that this is the type of problem where it's more obvious if things are going wrong (i.e. iterative design should work here---as long as we can tell the model isn't aligned yet, it seems more plausible we can avoid deploying it).

[-]David Scott Krueger (formerly: capybaralet)3y20

Responding in order:
1) yeah I wasn't saying it's what her post is about. But I think you can get two more interesting cruxy stuff by interpreting it that way.
2) yep it's just a caveat I mentioned for completeness.
3) Your spontaneous reasoning doesn't say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us. Also, I think we're already at "we can't tell if the model is aligned or not", but this won't stop deployment. I think the default situation isn't that we can tell if things are going wrong, but people won't be careful enough even given that, so maybe it's just a difference of perspective or something... hmm.......

^{^}

As an aside, we do in fact think that in the long term, economic incentives could lead to very bad outcomes (via Moloch/race to the bottom-style dynamics). But this seems to happen far more slowly than AI capability gains, if at all.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

32

Response to Katja Grace's AI x-risk counterarguments

32

Summary

Notes on the basic case for AI x-risk

Why will we keep building more and more capable AI systems?

Responses to specific counterarguments

A. Contra “superhuman AI systems will be ‘goal-directed’”

B. Contra “goal-directed AI systems’ goals will be bad”

C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

D. Contra the whole argument

Summary of potential cruxes