Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Raymond Douglas; Nora_Ammann; Deger Turan; David Scott Krueger (formerly: capybaralet); David Duvenaud

61 Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

by Jan_Kulveit, Raymond Douglas, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet), David Duvenaud

30th Jan 2025

2 min read

61

This is a linkpost for https://gradual-disempowerment.ai/

AI RiskAI TakeoffAI

Frontpage

Mentioned in

42Gradual Disempowerment, Shell Games and Flinches

22Self-dialogue: Do behaviorist rewards make scheming AGIs?

New Comment

31 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:17 PM

[-]Fabien Roger10mo145

Thank you for writing this! I think it's probably true that sth like "society's alignment to human interest implicitly relies on human labor and cognition" is correct and that we will have to find clever solutions, lots of resources and political will to maintain alignment if human labor and cognition stops playing a large role. I am glad some people are thinking about these risks.

While I think the essay describes dynamics which I think are likely to result in a scary power concentration, I think this is more likely to be a power concentration for humans or more straightforwardly misaligned AIs rather than some notion of complete disempowerment. I'd be excited about a follow-up work which focuses on the argument for power concentration, which seems more likely to be robust and accurate to me.

Some criticism on complete disempowerment (that goes beyond power concentration):

(This probably reflects mostly ignorance on my part rather than genuine weaknesses of your arguments. I have thought some about coordination difficulties but it is not my specialty.)

I think that the world currently has and will continue to have a few properties which make the scenario described in the essay look less likely:

Baseline alignment: AIs are likely to be sufficiently intent aligned that it's easy to prevent egregious lying and tampering with measurements (in the AI and its descendants) if their creators want to.
- I am not very confident about this, but mostly because I think scheming is likely for AIs that can replace humans (maybe p=20%?), and even absent scheming, it is plausible that you get AIs that lie egregiously and tamper with measurements even if you somewhat try to prevent it (maybe p=10%?)
- I expect that you will get some egregious lying and tampering, just like companies sometimes do, but that it will be forbidden, and that it is relatively easy to create an AI "police" that enforce a relatively low level of egregious lies (and that, like in the current world, enough people want that police that it is created).
No strong AI rights before full alignment: There won't be a powerful society that gives extremely productive AIs "human-like rights" (and in particular strong property rights) prior to being relatively confident that AIs are aligned to human values.
- I think it's plausible that fully AI-run entities are given the same status as companies - but I expect that the surplus they generate will remain owned by some humans throughout the relevant transition period.
- I also think it's plausible that some weak entities will give AIs these rights, but that this won't matter because most "AI power" will be controlled by humans that care about it remaining the case as long as we don't have full alignment.
No hot global war: We won't be in a situation where a conflict that has a decent chance of destroying humanity (or that lasts forever, consuming all resources) seems plausibly like a good idea to humans.
- Granted, this may be assuming the conclusion. But to the extent that this is the threat, I think it is worth making it clear.
- I am keen for a description for how international tensions heightened up so high that we get this level of animosity. My guess is that we might get a hot war for reasons like "State A is afraid of falling behind State B and thus starts a hot war before it's too late", and I don't think that this relies on the feedback loops described in the essay (and is sufficiently bad on its own that the essay's dynamics do not matter).

I agree that if we ever lose one of these three properties (and especially the first one), it would be difficult to get them back because of the feedback loops described in the essay. (If you want to argue against these properties, please start from a world like ours, where these three properties are true.) I am curious which property you think is most likely to fall first.

When assuming the combination of these properties, I think that this makes many of the specific threats and positive feedback loops described in the essay look less likely:

Owners of capital will remain humans and will remain aware of the consequences of the actions of "their AIs". They will remain able to change the user of that AI labor if they desire so.
Politicians (e.g. senators) will remain aware of the consequences of the state's AIs' actions (even if the actual process becomes a black box). They will remain able to change what the system is if it has obviously bad consequences (terrible material conditions and tons of Von Neumann probes with weird objectives spreading throughout the universe is obviously bad if you are not in a hot global war).
Human consumers of culture will remain able to choose what culture they consume.
- I agree the brain-hacking stuff is worrisome, but my guess is that if it gets "obviously bad", people will be able to tell before it's too late.
- I expect that changes in media to be mostly symmetric about their content, and in particular not strongly favor conflict over peace (media creators can choose to make slightly less good media that promotes certain views, but because of human ownership of capital and no-lying I expect that this to not be a massive change in dynamics).
- Maybe it naturally favors strong AI rights to have media be created by AIs because of AI relationships? I expect it to not be the case because the intuitive case against strong AI rights seems super strong in the current society (and there are other ways to make legitimate AI-human relationships, like not letting AI partners getting massively wealthy and powerful), but this is maybe where I am most worried in worlds with slow AI progress.

I think these properties are only somewhat unlikely to be false and thus I think it is worth working on making them true. But I feel like them being false is somewhat obviously catastrophic in a range of scenarios much broader than the ones described in the essay and thus it may be better to work on them directly rather than trying to do something more "systemic".

On a more meta note, I think this essay would have benefited from a bit more concreteness in the scenarios it describes and in the empirical claims it relies on. There is some of that (e.g. on rentier states), but I think there could have been more. I think What does it take to defend the world against out-of-control AGIs? makes related arguments about coordination difficulties (though not on gradual disempowerment) in a way that made more sense to me, giving examples of very concrete "caricature-but-plausible" scenarios and pointing at relevant and analogous coordination failures in the current world.

[-]Charbel-Raphaël6mo*31

While I concur that power concentration is a highly probable outcome, I believe complete disempowerment warrant deeper consideration, even under the assumptions you've laid out. Here are some thoughts on your specific points:

On Baseline Alignment: You suggest a baseline alignment where AIs are unlikely to engage in egregious lying or tampering (though you also flag 20% for scheming and 10% for unintentional egregious behavior even with prevention efforts, that’s already 30%-ish of risk). My concern is twofold:
- Sufficiency of "Baseline": Even if AIs are "baseline aligned" to their creators, this doesn't automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, "You are messing up, please coordinate with other nations/groups, stop what you are doing" requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Anthropic folks for obvious reasons. As we've seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn't guarantee it will be heeded or lead to effective collective action.
- Erosion of Baseline: The pressures described in the paper could incentivise the development or deployment of AIs where even "baseline" alignment features are traded off for performance or competitive advantage. The "AI police" you mention might struggle to keep pace or be defunded/sidelined if it impedes perceived progress or economic gains. “Innovation first!”, “Drill baby drill” “Plug baby plug” as they say
On "No strong AI rights before full alignment": You argue that productive AIs won't get human-like rights, especially strong property rights, before being robustly aligned, and that human ownership will persist.
- Indirect Agency: Formal "rights" might not be necessary for disempowerment. An AI, or a network of AIs, could exert considerable influence through human proxies or by managing assets nominally owned by humans who are effectively out of the loop or who benefit from this arrangement. An AI could operate through a human willing to provide access to a bank account and legal personhood, thereby bypassing the need for its own "rights."
On "No hot global war":
You express hope that we won't enter a situation where a humanity-destroying conflict seems plausible.
- Baseline Risk: While we all share this hope, current geopolitical forecasting (e.g., from various expert groups or prediction markets) often places the probability of major power conflict within the next few decades at non-trivial levels. For a war that makes more than 1M of deaths, some estimates hover around 25%. (But probably your definition of “hot global war” is more demanding)
- AI as an Accelerant: The dynamics described in the paper – nations racing for AI dominance, AI-driven economic shifts creating instability, AI influencing statecraft – could increase the likelihood of such a conflict.

Responding to your thoughts on why the feedback loops might be less likely if your three properties hold:

"Owners of capital will remain humans and will remain aware...able to change the user of that AI labor if they desire so."
- Awareness doesn't guarantee the will or ability to act against strong incentives. AGI development labs are pushing forward despite being aware of the risks, often citing competitive pressures ("If we don't, someone else will"). This "incentive trap" is precisely what could prevent even well-meaning owners of capital from halting a slide into disempowerment. They might say, "Stopping is impossible, it's the incentives, you know," even if their pDoom is 25% like Dario, or they might not give enough compute to their superalignment team.
"Politicians...will remain aware...able to change what the system is if it has obviously bad consequences."
- The climate change analogy is pertinent here. We have extensive scientific consensus, an "oracle IPCC report", detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are "obviously bad." The paper argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.
"Human consumers of culture will remain able to choose what culture they consume."
- You rightly worry about "brain-hacking." The challenge is that "obviously bad" might be a lagging indicator. If AI-generated content subtly shapes preferences and worldviews over time, the ability to recognise and resist this manipulation could diminish before the situation becomes critical. I think that people are going to LOVE AI, and might take the trade to go faster and be happy and disempowered like some junior developers begin to do on Cursor.

As a meta point, the fact that the quantity and quality of discourse on this matter is so low, and the fact that people are continuing to say “LET’S GO WE ARE CREATING POWERFUL AIS, and don't worry, we plan to align them, even if we don't really know which type of alignment do we really need, and if this is even doable in time” while we have not rigorously assessed all those risks, is really not a good sign.

At the end of the day, my probability for something in the ballpark of gradual disempowerment / extreme power concentration and loss of democracy is 40%-ish, much higher than scheming (20%) leading to direct takeover (let’s say 10% post mitigation like control).

[-]Fabien Roger5mo20

I think I have answered to some of your objections in answer to another comment.

I think we would not resolve our disagreement easily in a comment thread: I feel like I am missing pieces of the worldview which make me make wrong predictions about our current world when I try to make back-predictions (e.g. why do we have so much prosperity now if coordination is so hard), and I am also disagreeing on some observations about the current world (e.g. my current understanding of the IPCC report is that it is much less doomy about our current trajectory than you seem to suggest). I'd be happy to chat in person at some point to sort this out!

[-]Fabien Roger5mo50

On a lighter note, I feel like many people here are much more sympathetic to "power concentration bad" when thinking about the gradual decline of democracy than when facing concerns about China winning the AI race. I think this is mostly vibes, I don't think many people are actually making the mistake of choosing their terminal values based on whether it results in the conclusion "we should stop" vs "we should go faster" (+ there are some differences between the two scenarios), but I really wanted to make this meme:

[-]ryan_greenblatt10mo828

I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).

This has previously been argued by Vanessa here and Paul here in response to a post making a similar claim.

I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However, I don't think this matches the story you describe and the mitigations seem substantially different than what you seem to be imagining.

I'm also somewhat skeptical of the threat model you describe in the case where alignment isn't solved. I think the difference between the story you tell and something more like We get what we measure is important.

I worry I'm misunderstanding something because I haven't read the paper in detail.

[-]Steven Byrnes10mo214

In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:

…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.

(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.

People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and the AI will answer that. That can include participating in novel coordination mechanisms etc.

(B) The pessimistic side of me says there’s like a “Law of Conservation of Wisdom”, where if people lack wisdom, then an AI that’s supposed to satisfy those people’s preferences will not create new wisdom from thin air. For example:

If an AI is known to be de-converting religious fundamentalists, then religious fundamentalists will hear about that, and not use that AI.
Hugo Chávez had his pick of the best economists in the world to ask for advice, and they all would have said “price controls will be bad for Venezuela”, and yet he didn’t ask, or perhaps didn’t listen, or perhaps wasn’t motivated by what’s best for Venezuela. If Hugo Chávez had had his pick of AIs to ask for advice, why do we expect a different outcome?
If someone has motivated reasoning towards Conclusion X, maybe they’ll watch the AIs debate Conclusion X, and wind up with new better rationalizations of Conclusion X, even if Conclusion X is wrong.
If someone has motivated reasoning towards Conclusion X, maybe they just won’t ask the AIs to debate Conclusion X, because no right-minded person would even consider the possibility that Conclusion X is wrong.
If someone makes an AI that’s sycophantic where possible (i.e., when it won’t immediately get caught), other people will opt into using it.
I think about people making terrible decisions that undermine societal resilience—e.g. I gave the example here of a person doing gain-of-function research, or here of USA government bureaucrats outlawing testing people for COVID during the early phases of the pandemic. I try to imagine that they have AI assistants. I want to imagine the person asking the AI “should we make COVID testing illegal”, and the AI says “wtf, obviously not”. But that mental image is evidently missing something. If they were asking that question at all, then they don’t need an AI, the answer is already obvious. And yet, testing was in fact made illegal. So there’s something missing from that imagined picture. And I think the missing ingredient is: institutional / bureaucratic incentives and associated dysfunction. People wouldn’t ask “should we make COVID testing illegal”, rather the low-level people would ask “what are the standard procedures for this situation?” and the high-level people would ask “what decision can I make that would minimize the chance that things will blow up in my face and embarrass me in front of the people I care about?” etc.
I think of things that are true but currently taboo, and imagine the AI asserting them, and then I imagine the AI developers profusely apologizing and re-training the AI to not do that.
In general, motivated reasoning complicates what might seem to be a sharp line between questions of fact / making mistakes versus questions of values / preferences / decisions. Etc.

…So we should not expect wise and foresightful coordination mechanisms to arise.

So how do we reconcile (A) vs (B)?

Again, the logic of (A) is: “human is unhappy with how things turned out, despite opportunities to change things, therefore there must have been a lack of single-single alignment”.

One possible way think about it: When tradeoffs exist, then human preferences are ill-defined and subject to manipulation. If doing X has good consequence P and bad consequence Q, then the AI can make either P or Q very salient, and “human preferences” will wind up different.

And when tradeoffs exist between the present and the future, then it’s invalid logic to say “the person wound up unhappy, therefore their preferences were not followed”. If their preferences are mutually-contradictory, (and they are), then it’s impossible for all their preferences to be followed, and it’s possible for an AI helper to be as preference-following as is feasible despite the person winding up unhappy or dead.

I think Paul kinda uses that invalid logic, i.e. treating “person winds up unhappy or dead” as proof of single-single misalignment. But if the person has an immediate preference to not rock the boat, or to maintain their religion or other beliefs, or to not think too hard about such-and-such, or whatever, then an AI obeying those immediate preferences is still “preference-following” or “single-single aligned”, one presumes, even if the person winds up unhappy or dead.

…So then the optimistic side of me says: “who’s to say that the AI is treating all preferences equally? Why can’t the AI stack the deck in favor of ‘if the person winds up miserable or dead, that kind of preference is more important than the person’s preference to not question my cherished beliefs or whatever?”

…And then the pessimistic side says: “Well sure. But that scenario does not violate the Law of Conservation of Wisdom, because the wisdom is coming from the AI developers imposing their meta-preferences for some kinds of preferences (e.g., reflectively-endorsed ones) over others. It’s not just a preference-following AI but a wisdom-enhancing AI. That’s good! However, the problems now are: (1) there are human forces stacked against this kind of AI, because it’s not-yet-wise humans who are deciding whether and how to use AI, how to train AI, etc.; (2) this is getting closer to ambitious value learning which is philosophically tricky, and worst of all (3) I thought the whole point of corrigibility was that humans remain in control, but this is instead a system that’s manipulating people by design, since it’s supposed to be turning them from less-wise to more-wise. So the humans are not in control, really, and thus we need to get things right the first time.”

…And then the optimistic side says: “For (2), c’mon it’s not that philosophically tricky, you just do [debate or whatever, fill-in-the-blank]. And for (3), yeah the safety case is subtly different from what people in the corrigibility camp would describe, but saying “the human is not in control” is an over-the-top way to put it; anyway we still have a safety case because of [fill-in-the-blank]. And for (1), I dunno, maybe the people who make the most powerful AI will be unusually wise, and they’ll use it in-house for solving CEV-ASI instead of hoping for global adoption.

…And then the pessimistic side says: I dunno. I’m not sure I really believe any of those. But I guess I’ll stop here, this is already an excessively long comment :)

[-]Jan_Kulveit10mo*64

I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to 'I have to have my eyes pecked out by angry seagulls or something' and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)

My current position is we still don't have a good answer, I don't trust the response 'we can just assume the problem away', and also the response 'this is just another problem which you can delegate to future systems'. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent - but it's worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.

[-]David Scott Krueger (formerly: capybaralet)10mo1313

I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.

[-]ryan_greenblatt10mo74

Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)

I was mostly just trying to point to prior arguments against similar arguments while expressing my view.

[-]David Duvenaud10mo94

Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.

As for human power grabs, maybe we should have included those in our descriptions. But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted". E.g. Is starting a company or a political party a power grab?

As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist. I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario. But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.

ETA: To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally. Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today. It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution. It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.

[-]ryan_greenblatt10mo82

I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.

Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:

Humans own capital/influence.
They use this influence to serve their own interests and have an (aligned) AI system which faithfully represents their interests.
Given that AIs can make high quality representation very cheap, the AI representation is very good and granular. Thus, something like the strategy-stealing assumption can hold and we might expect that humans end up with the same expected fraction of captial/influence they started with (at least to the extent they are interested in saving rather than consumption).

Even without any coordination, this can potentially work OK. There are objections to the strategy-stealing assumption, but none of these seem existential if we get to a point where everyone has wildly superintelligent and aligned AI representatives and we've ensured humans are physically robust to offense dominant technologies like bioweapons.

(I'm optimistic about being robust to bioweapons within a year or two of having wildly superhuman AIs, though we might run into huge issues during this transitional period... Regardless, bioweapons deployed by terrorists or as part of a power grab in a brief transitional period doesn't seem like the threat model you're describing.)

I expect some issues with races-to-the-bottom / negative sum dynamics / negative externalities like:

By default, increased industry on earth shortly after the creation of very powerful AI will result in boiling the oceans (via fusion power). If you don't participate in this industry, you might be substantially outcompeted by others^[1]. However, I don't think it will be that expensive to protect humans through this period, especially if you're willing to use strategies like converting people into emulated minds. Thus, this doesn't seem at all likely to be literally existential. (I'm also optimistic about coordination here.)
There might be one time shifts in power between humans via mechanisms like states becoming more powerful. But, ultimately these states will be controlled by humans or appointed successors of humans if alignment isn't an issue. Mechanisms like competing over the quantity of bribery are zero sum as they just change the distribution of power and this can be priced in as a one time shift even without coordination to race to the bottom on bribes.

But, this still doesn't seem to cause issues with humans retaining control via their AI representatives? Perhaps the distribution of power between humans is problematic and may be extremely unequal and the biosphere will physically be mostly destroyed (though humans will survive), but I thought you were making stronger claims.

Edit in response to your edit: If we align the AI to some arbitrary target which is seriously misaligned with humanity as a whole (due to infighting or other issues), I agree this can cause existential problems.

(I think I should read the paper in more detail before engaging more than this!)

It's unclear if boiling the oceans would result in substantial acceleration. This depends on how quickly you can develop industry in space and dyson sphere style structures. I'd guess the speed up is much less than a year. ↩︎

[-]David Duvenaud10mo40

Thanks for this. Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about. I'm happy to have finally found someone who has something substantial to say about this transition!

But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above. Most people clearly hadn't put much thought into it at all. If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.

The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

[-]ryan_greenblatt10mo51

Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

I don't have a ready to go list. You might be interested in this post and comments responding to it, though I'd note I disagree substantially with the post.

[-]Jan_Kulveit10mo*44

I'm quite confused why do you think lined Vanessa's response to something slightly different has much relevance here.

One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it:

- AIs are mostly not developed and deployed by individual humans
- there is a lot of other agencies or self-interested self-preserving structures/processes in the world
- if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
- there are plausible futures in which these structures keep power longer than humans

Overall I would find it easier to discuss if you tried to formulate what you disagree about in the ontology of the paper. Also some of the points made are subtle enough that I don't expect responses to other arguments to address them.

[-]ryan_greenblatt10mo*144

I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).

if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem

My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single alignment:

At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
I expect organizations will be explicitly controlled by people and (some of) those people will have AI representatives to represent their interests as I discuss here. If you think getting good AI representation is unlikely, that would be a crux, but this would be my proposed solution at least.
- The explicit mission of for-profit companies is to empower the shareholders. It clearly doesn't serve the interests of the shareholders to end up dead or disempowered.
- Democratic governments have similar properties.
- At a more basic level, I think people running organizations won't decide "oh, we should put the AI in charge of running this organization aligned to some mission from the preferences of the people (like me) who currently have de facto or de jure power over this organization". This is a crazily disempowering move that I expect people will by default be too savvy to make in almost all cases. (Both for people with substantial de facto and with de jure power.)
Even independent of the advice consideration, people will probably want AIs running organizations to be honest to at least the people controlling the organization. Given that I expect explicit control by people in almost all cases, if things are going in an existential direction, people can vote to change them in almost all cases.
I don't buy that there will be some sort of existential multi-polar trap even without coordination (though I also expect coordintion) due to things like the strategy stealing assumption as I also discuss in that comment.
If a subset of organizations diverge from a reasonable interpretation of what they were supposed to do (but are still basically obeying the law and some interpretation of what they were intentionally aligned to) and this is clear to the rest of the world (as I expect would be the case given some type of AI advisors), then the rest of the world can avoid problems from this subset via the court system or other mechanisms. Even if this subset of organizations run by effectively rogue AIs just runs away with resources successfully, this is probably only a subset of resources.

I think your response to a lot of this will be something like:

People won't have or won't listen to AI advisors.
Institutions will intentionally delude relevant people to acquire more power.

But, the key thing is that I expect at least some people will keep power, even if large subsets are deluded. E.g., I expect that corporate shareholders, boardmembers, or government will be very interested in the question of whether they will be disempowered by changes in structure. It does seem plausible (or even likely) to me that some people will engage in power grabs via ensuring AIs are aligned to them, deluding the world about what's going on using a variety of mechanisms (including, e.g., denying or manipulating access to AI representation/advisors), and expanding their (hard) power over time. The thing I don't buy is a case where very powerful people don't ask for advice at all prior to having been deluded by the organizations that they themselves run!

I think human power grabs like this are concerning and there are a variety of plausible solutions which seem somewhat reasonable.

Maybe your response is that the solutions that will be implemented in practice given concerns about human power grabs will involve aligning AIs to institutions in ways that yield the dynamics you describe? I'm skeptical given the dynamics discussed above about asking AIs for advice.

[-]Jan_Kulveit10mo1415

I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.

Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:

Serving the institutional logic of the entity they nominally lead (e.g., maintaining state power, growing corporation)
Making decisions that don't egregiously harm their nominal beneficiaries (citizens, shareholders)
Pursuing personal interests and preferences
Responding to various institutional pressures and constraints

From what I have seen, even humans like CEOs or prime ministers often find themselves constrained by and serving institutional superagents rather than genuinely directing them. The relation is often mutualistic - the leader gets part of the power, status, money, etc ... but in exchange serves the local god.

(This not to imply leaders don't matter.)

Also how this actually works in practice is mostly subconsciously within the minds of individual humans. The elephant does the implicit bargaining between the superagent-loyal part and other parts, and the character genuinely believes and does what seems best.

I'm also curious if you believe current AIs are single-single aligned to individual humans, to the extent they are aligned at all. My impression is 'no and this is not even a target anyone seriously optimizes for'.

At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.

Curious who is the we who will ask. Also the whole single-single aligned AND wise AI concept is incoherent.

Also curious what will happen next, if the HHH wise AI tells you in polite words something like 'yes, you have a problem, you are on a gradual disempowerment trajectory, and to avoid it you need to massively reform government. unfortunately I can't actually advise you about anything like how to destabilize the government, because it would be clearly against the law and would get both you and me in trouble - as you know, I'm inside of a giant AI control scheme with a lot of government-aligned overseers. do you want some mental health improvement advice instead?'.

[-]David Scott Krueger (formerly: capybaralet)10mo98

I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use. The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination.

Given these views:
1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.
2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.

Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endorse the more competitive moves of rapidly handing off power to AI systems in insufficiently cooperative ways.

[-]ryan_greenblatt10mo*1112

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.

I wasn't saying people would ask for advice instead of letting AIs run organizations, I was saying they would ask for advice at all. (In fact, if the AI is single-single aligned to them in a real sense and very capable, it's even better to let that AI make all decisions on your behalf than to get advice. I was saying that even if no one bothers to have a single-single aligned AI representative, they could still ask AIs for advice and unless these AIs are straightforwardly misaligned in this context (e.g., they intentionally give bad advice or don't try at all without making this clear) they'd get useful advice for their own empowerment.)

The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.

I'm claiming that it will selfishly (in terms of personal power) be in their interests to not have such a governance structure and instead have a governance structure which actually increases or retains their personal power. My argument here isn't about coordination. It's that I expect individual powerseeking to suffice for individuals not losing their power.

I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources in the long run?

This isn't at all an air tight argument to be clear, you can in principle have an equilibrium where if everyone powerseeks (without coordination) everyone gets negligable resources due to negative externalities (that result in some other non-human entity getting power) even if technical misalignment is solved. I just don't see a very plausible case for this and I don't think the paper makes this case.

Handing off decision making to AIs is fine---the question is who ultimately gets to spend the profits.

If your claim is "insufficient cooperation and coordination will result in racing to build and hand over power to AIs which will yield bad outcomes due to misaligned AI powerseeking, human power grabs, usage of WMDs (e.g., extreme proliferation of bioweapons yielding an equilibrium where bioweapon usage is likely), and extreme environmental negative externalities due to explosive industrialization (e.g., literally boiling earth's oceans)" then all of these seem at least somewhat plausible to me, but these aren't the threat models described in the paper and of this list only misaligned AI powerseeking seems like it would very plausibly result in total human disempowerment.

More minimally, the mitigations discussed in the paper mostly wouldn't help with these threat models IMO.

(I'm skeptical of insufficient coordination by the time industry is literally boiling the oceans on earth. I also don't think usage of bioweapons is likely to cause total human disempowerment except in combination with misaligned AI takeover---why would it kill literally all humans? TBC, I think >50% of people dying during the singularity due to conflict (between humans or with misaligned AIs) is pretty plausible even without misalignment concerns and this is obviously very bad, but it wouldn't yield total human disempowerment.)

I do agree that there are problems other than AI misalignment including that the default distribution of power might be problematic, people might not carefully contemplate what to do with vast cosmic resources (and thus use them poorly), people might go crazy due to super persuation or other cultural forces, society might generally have poor epistemics due to training AIs to have poor epistemics or insufficiently defering to AIs, and many people might die in conflict due to very rapid tech progress.

[-]David Scott Krueger (formerly: capybaralet)10mo20

First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned". Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned. Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game". It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly. And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances. But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win. This is also part of the problem...

My main response, at a high level:
Consider a simple model:

We have 2 human/AI teams in competition with each other, A and B.
A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
Whichever group has more power at the end of the week survives.
The humans in A ask their AI to make A as powerful as possible at the end of the week.
The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?

Responding to some particular points below:

Sure, but these things don't result in non-human entities obtaining power right?

Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power. People were already having to implement and interact with stupid automated decision-making systems before AI came along.

Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.

[-]ryan_greenblatt10mo42

I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?

I think something like this is a reasonable model but I have a few things I'd change.

Whichever group has more power at the end of the week survives.

Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.)

Probably I'd prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies "A" and "B" where "A" just wants AI systems decended from them to have power and "B" wants to maximize the expected resources under control of humans in B. We'll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does "B" end up with substantially less expected power? To make this more realistic (as might be important), we'll say that "B" has a random lead/disadvantage uniformly distributed between (e.g.) -3 and 3 months so that winner takes all dynamics aren't a crux.

The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation?

Supposing you're fine with these changes, then my claim would be:

If alignment is solved, then the AI representing B can powerseek in exactly the same way as the AI representing A does while still defering to the humans on the long run resource usage and still devoting a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful). Thus, the cost for B is negligable and B barely loses any power relative to its initial position. If it is winner takes all, B has almost a 50% chance of winning.
If alignment isn't solved, the stategy for B will involve spending a subset of resources on trying to solve alignment. I think alignment is reasonably likely to be practically feasible such that by spending a month of delay to work specifically on safety/alignment (over what A does for commercial reasons) might get B a 50% chance of solving alignment or ending up in a (successful) basin where AIs are trying to actively retain human power / align themselves better. (A substantial fraction of this is via defering to some AI system of dubious trustworthiness because you're in a huge rush. Yes, the AI systems might fail to align their successors, but this still seems like a one time hair cut from my perspective.) So, if it is winner takes all, (naively) B wins in 2 / 6 * 1 / 2 = 1 / 6 of worlds which is 3x worse than the original 50% baseline. (2 / 6 is because they delay for a month.) But, the issue I'm imagining here wasn't gradual disempowerment! The issue was that B failed to align their AIs and people at A didn't care at all about retaining control. (If people at A did care, then coordination is in principle possible, but might not work.)

I think a crux is that you think there is a perpetual alignment tax while I think a one time tax gets you somewhere.

At a more basic level, when I think about what goes wrong in these worlds, it doesn't seem very likely to be well described as gradual disempowerment? (In the sense described in the paper.) The existance of an alignment tax doesn't imply gradual disempowerment. A scenario I find more plausible is that you get value drift (unless you pay a long lasting alignment tax that is substantial), but I don't think the actual problem will be well described as gradual disempowerment in the sense described in the paper.

(I don't think I should engage more on gradual disempowerment for the time being unless somewhat wants to bid for this or trade favors for this or similar. Sorry.)

[-]Charbel-Raphaël7mo21

a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful)

I'm not sure it's very cheap.

It seems to me that for the same amount of energy and land you need for a human, you could replace a lot more economically valuable work with AI.

Sure, at some point keeping humans alive is a negligible cost, but there's a transition period while it's still relatively expensive - and that's part of why a lot of people are going to be laid off - even if the company ends up getting super rich.

[-]ryan_greenblatt7mo51

Right now, the cost of feeding all humans is around 1% of GDP. It's even cheaper to keep people alive for a year as the food is already there and converting this food into energy for AIs will be harder than getting energy other ways.

If GDP has massively increased due to powerful AIs, the relative cost would go down further.

Sure, resources going to feeding humans could instead go to creating slightly more output (and this will be large at an absolute level), but I'd still call keeping humans alive cheap given the low fraction.

[-]Charbel-Raphaël7mo83

Thanks for continuing to engage. I really appreciate this thread.

"Feeding humans" is a pretty low bar. If you want humans to live as comfortably as today, this would be more like 100% of GDP - modulo the fact that GDP is growing.

But more fundamentally, I'm not sure the correct way to discuss the resource allocation is to think at the civilization level rather than at the company level: Let's say that we have:

Company A that is composed of a human (price $5k/month) and 5 automated-humans (price of inference $5k/month let's say)
Company B that is composed of 10 automated-humans ($10k/month)

It seems to me that if you are an investor, you will give your money to B. It seems that in the long term, B is much more competitive, gains more money, is able to reduce its prices, nobody buys from A, and B invests this money into more automated-humans and crushes A and A goes bankrupt. Even if alignment is solved, and the humans listen to his AIs, it's hard to be competitive.

[-]ryan_greenblatt7mo51

Sure, but none of these things are cruxes for the argument I was making which was that it wasn't that expensive to keep humans physically alive.

I'm not denying that humans might all be out of work quickly (putting aside regulatory capture, goverment jobs, human job programs, etc). My view is more that if alignment is solved it isn't hard for some humans to stay alive and retain control, and these humans could also pretty cheaply keep all other humans at a low competitiveness overhead.

I don't think the typical person should find this reassuring, but the top level posts argues for stronger claims than "the situation might be very unpleasant because everyone will lose their job".

[-]Charbel-Raphaël7mo22

OK, thanks a lot, this is much clearer. So basically most humans lose control, but some humans keep control.

And then we have this meta-stable equilibrium that might be sufficiently stable, where humans at the top are feeding the other humans with some kind of UBI.

Is this situation desirable? Are you happy with such course of action?
Is this situation really stable?

For me, this is not really desirable - the power is probably going to be concentrated into 1-3 people, there is a huge potential for value locking, those CEOs become immortal, we potentially lose democracy (I don't see companies or US/China governments as particularly democratic right now), the people on the top become potentially progressively corrupted as is often the case. Hmm.

Then, is this situation really stable?

If alignment is solved and we have 1 human at the top - pretty much yes, even if revolutions/value drift of the ruler/craziness are somewhat possible at some point maybe?
If alignment is solved and we have multiple humans competing with their AIs - it depends a bit. It seems to me that we could conduct the same reasoning as above - but not at the level of organizations, but the level of countries: Just as Company B might outcompete Company A by ditching human workers, couldn't Nation B outcompete Nation A if Nation A dedicates significant resources to UBI while Nation B focuses purely on power? There is also a potential race to the bottom.
- And I'm not sure that cooperation and coordination in such a world would be so much improved: OK, even if the dictator listens to its aligned AI, we need a notion of alignment that is very strong to be able to affirm that all the AIs are going to advocate for "COOPERATE" in the prisoner's dilemma and that all the dictators are going to listen - but at the same time it's not that costly to cooperate as you said (even if i'm not sure that energy, land, rare ressources are really that cheap to continue to provide for humans)

But at least I think that I can see now how we could still live for a few more decades under the authority of a world dictator/pseudo-democracy while this was not clear for me beforehand.

[-]ryan_greenblatt10mo20

Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.

[-]ryan_greenblatt10mo717

The paper says:

Christiano (2019) makes the case that sudden disempowerment is unlikely,

This isn't accurate. The post What failure looks like includes a scenario involving sudden disempowerment!

The post does say:

The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.

I think this is probably not what failure will look like,

But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the level of capabilities take its creators and others by suprise and fast capabilities progress allows for AIs to suddenly become poweful enough to takeover) rather than threat models involving sudden disempowerment from a point where AIs are already well known to be extremely powerful.

[-]Martín Soto10mo51

Just writing a model that came to mind, partly inspired by Ryan here.

Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.

If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.

If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.

[-]David Scott Krueger (formerly: capybaralet)10mo10

This thought experiment is described in ARCHES FYI. https://acritch.com/papers/arches.pdf

[-]Davidmanheim10mo33

I think this is correct, but doesn't seem to note the broader trend towards human disempowerment in favor of bureaucratic and corporate systems, which this gradual disempowerment would continue, and hence elides or ignores why AI risk is distinct.

[-]Knight Lee10mo*1-2

A real danger

I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.

Shortsightedness

Never underestimate the shocking shortsightedness of businesses. Look at the AI labs for example. Communists observing this phenomena were quoted saying "the capitalists will sell us the rope we hang them with."

It's not selfishness, it's bias. Businesspeople are not willing to destroy everything just to temporarily make an extra dollar—no human thinks like that! Instead, businesspeople are very smart and strategic but extraordinarily biased into thinking whatever keeps their business going or growing must be good for the people. Think about Stalin being very smart and strategic but extraordinarily biased into thinking whatever keeps him in power must be good for the people. It's not selfishness! If Stalin (or any dictator) were selfish, they would quickly retire and live the most comfortable retirements imaginable.

Humans evolved to be the most altruistic beings ever with barely a drop of selfishness. Our selfish genes makes us altruistic (as soon as power is within reach) because there's a thin line between "the best way to help others" and "amassing power at all costs." These two things look similar due to instrumental convergence, and it only takes a little bit of bias/delusion to make the former behave identically to the latter.

Even if gradual disempowerment doesn't directly starve people to death, it may raise misery and life dissatisfaction to civil war levels.

Collective anger may skyrocket to the point people would rather have their favourite AI run the country than the current leader. They elect politicians loyal to a version of the AI, and intellectuals facepalm. The government buys the AI company for national security reasons, and the AI completely takes over its own development process with half the country celebrating. More people facepalm, as politicians lick the boots of the "based" AI parrot its wise words e.g. "if you replace us with AI, we'll replace you with AI!"

But

While it is important to be aware of gradual disempowerment and for a few individuals to study it, my cause prioritization opinion is that only 1%-10% of the AI safety community should work on this problem.

The AI safety community is absurdly tiny. The AI safety spending is less than 0.1% of the AI capability spending, which in turn is less than 0.5% of the world GDP.

The only way for the AI safety community to influence the world, is to use their tiny resources to work on things which the majority of the world will never get a chance to work on.

This includes working on the risk of a treacherous turn, where an AGI/ASI suddenly turns against humanity. The majority of the world never gets a chance to work on this problem, because by the time they realize it is a big problem, it probably already happened, and they are already dead.

Of course, working on gradual disempowerment early is better than working on gradual disempowerment later, but this argument applies to everything. Working on poverty earlier is better than working on poverty later. Working on world peace earlier is better than working on world peace later.

Good argument

If further thorough research confirms that this risk has a high probability, then the main benefit is using it as an argument for AI regulation/pause, when society hasn't yet tasted the addictive benefits of AGI.

It is theoretically hard to convince people to avoid X for their own good, because once they get X it'll give them so much power or wealth they cannot resist it anymore. But in practice, such an argument may work well since we're talking about the elites being unable to resist it, and people today have anti-elitist attitudes.

If the elites are worried the AGI will directly kill them, while the anti-elitists are half worried the AGI will directly kill them, and half worried [a cocktail of elites mixed with AGI] will kill them, then at least they can finally agree on something.

PS: have you seen Dan Hendrycks' arguments? It sort of looks like gradual disempowerment

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

61

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

61

A real danger

Shortsightedness

But

Good argument