This is a linkpost for https://gradual-disempowerment.ai/

Full version on arXiv | X

Executive summary

AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI capabilities, without any coordinated power-seeking, poses a substantial risk of eventual human disempowerment. This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.

A gradual loss of control of our own civilization might sound implausible. Hasn't technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable only because of the necessity of human participation for thriving economies, states, and cultures. Once this human participation gets displaced by more competitive machine alternatives, our institutions' incentives for growth will be untethered from a need to ensure human flourishing. Decision-makers at all levels will soon face pressures to reduce human involvement across labor markets, governance structures, cultural production, and even social interactions. Those who resist these pressures will eventually be displaced by those who do not.

Still, wouldn't humans notice what's happening and coordinate to stop it? Not necessarily. What makes this transition particularly hard to resist is that pressures on each societal system bleed into the others. For example, we might attempt to use state power and cultural attitudes to preserve human economic power. However, the economic incentives for companies to replace humans with AI will also push them to influence states and culture to support this change, using their growing economic power to shape both policy and public opinion, which will in turn allow those companies to accrue even greater economic power.

Once AI has begun to displace humans, existing feedback mechanisms that encourage human influence and flourishing will begin to break down. For example, states funded mainly by taxes on AI profits instead of their citizens' labor will have little incentive to ensure citizens' representation. This could occur at the same time as AI provides states with unprecedented influence over human culture and behavior, which might make coordination amongst humans more difficult, thereby further reducing humans' ability to resist such pressures. We describe these and other mechanisms and feedback loops in more detail in this work.

Though we provide some proposals for slowing or averting this process, and survey related discussions, we emphasize that no one has a concrete plausible plan for stopping gradual human disempowerment and methods of aligning individual AI systems with their designers' intentions are not sufficient. Because this disempowerment would be global and permanent, and because human flourishing requires substantial resources in global terms, it could plausibly lead to human extinction or similar outcomes.



 

New Comment
22 comments, sorted by Click to highlight new comments since:

Thank you for writing this! I think it's probably true that sth like "society's alignment to human interest implicitly relies on human labor and cognition" is correct and that we will have to find clever solutions, lots of resources and political will to maintain alignment if human labor and cognition stops playing a large role. I am glad some people are thinking about these risks.

While I think the essay describes dynamics which I think are likely to result in a scary power concentration, I think this is more likely to be a power concentration for humans or more straightforwardly misaligned AIs rather than some notion of complete disempowerment. I'd be excited about a follow-up work which focuses on the argument for power concentration, which seems more likely to be robust and accurate to me.

 

Some criticism on complete disempowerment (that goes beyond power concentration):

(This probably reflects mostly ignorance on my part rather than genuine weaknesses of your arguments. I have thought some about coordination difficulties but it is not my specialty.)

I think that the world currently has and will continue to have a few properties which make the scenario described in the essay look less likely:

  • Baseline alignment: AIs are likely to be sufficiently intent aligned that it's easy to prevent egregious lying and tampering with measurements (in the AI and its descendants) if their creators want to.
    • I am not very confident about this, but mostly because I think scheming is likely for AIs that can replace humans (maybe p=20%?), and even absent scheming, it is plausible that you get AIs that lie egregiously and tamper with measurements even if you somewhat try to prevent it (maybe p=10%?)
    • I expect that you will get some egregious lying and tampering, just like companies sometimes do, but that it will be forbidden, and that it is relatively easy to create an AI "police" that enforce a relatively low level of egregious lies (and that, like in the current world, enough people want that police that it is created).
  • No strong AI rights before full alignment: There won't be a powerful society that gives extremely productive AIs "human-like rights" (and in particular strong property rights) prior to being relatively confident that AIs are aligned to human values.
    • I think it's plausible that fully AI-run entities are given the same status as companies - but I expect that the surplus they generate will remain owned by some humans throughout the relevant transition period.
    • I also think it's plausible that some weak entities will give AIs these rights, but that this won't matter because most "AI power" will be controlled by humans that care about it remaining the case as long as we don't have full alignment.
  • No hot global war: We won't be in a situation where a conflict that has a decent chance of destroying humanity (or that lasts forever, consuming all resources) seems plausibly like a good idea to humans.
    • Granted, this may be assuming the conclusion. But to the extent that this is the threat, I think it is worth making it clear.
    • I am keen for a description for how international tensions heightened up so high that we get this level of animosity. My guess is that we might get a hot war for reasons like "State A is afraid of falling behind State B and thus starts a hot war before it's too late", and I don't think that this relies on the feedback loops described in the essay (and is sufficiently bad on its own that the essay's dynamics do not matter).

I agree that if we ever lose one of these three properties (and especially the first one), it would be difficult to get them back because of the feedback loops described in the essay. (If you want to argue against these properties, please start from a world like ours, where these three properties are true.) I am curious which property you think is most likely to fall first.

When assuming the combination of these properties, I think that this makes many of the specific threats and positive feedback loops described in the essay look less likely:

  • Owners of capital will remain humans and will remain aware of the consequences of the actions of "their AIs". They will remain able to change the user of that AI labor if they desire so.
  • Politicians (e.g. senators) will remain aware of the consequences of the state's AIs' actions (even if the actual process becomes a black box). They will remain able to change what the system is if it has obviously bad consequences (terrible material conditions and tons of Von Neumann probes with weird objectives spreading throughout the universe is obviously bad if you are not in a hot global war).
  • Human consumers of culture will remain able to choose what culture they consume.
    • I agree the brain-hacking stuff is worrisome, but my guess is that if it gets "obviously bad", people will be able to tell before it's too late.
    • I expect that changes in media to be mostly symmetric about their content, and in particular not strongly favor conflict over peace (media creators can choose to make slightly less good media that promotes certain views, but because of human ownership of capital and no-lying I expect that this to not be a massive change in dynamics).
    • Maybe it naturally favors strong AI rights to have media be created by AIs because of AI relationships? I expect it to not be the case because the intuitive case against strong AI rights seems super strong in the current society (and there are other ways to make legitimate AI-human relationships, like not letting AI partners getting massively wealthy and powerful), but this is maybe where I am most worried in worlds with slow AI progress.

I think these properties are only somewhat unlikely to be false and thus I think it is worth working on making them true. But I feel like them being false is somewhat obviously catastrophic in a range of scenarios much broader than the ones described in the essay and thus it may be better to work on them directly rather than trying to do something more "systemic".

 

On a more meta note, I think this essay would have benefited from a bit more concreteness in the scenarios it describes and in the empirical claims it relies on. There is some of that (e.g. on rentier states), but I think there could have been more. I think What does it take to defend the world against out-of-control AGIs? makes related arguments about coordination difficulties (though not on gradual disempowerment) in a way that made more sense to me, giving examples of very concrete "caricature-but-plausible" scenarios and pointing at relevant and analogous coordination failures in the current world.

The paper says:

Christiano (2019) makes the case that sudden disempowerment is unlikely,

This isn't accurate. The post What failure looks like includes a scenario involving sudden disempowerment!

The post does say:

The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.

I think this is probably not what failure will look like,

But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the level of capabilities take its creators and others by suprise and fast capabilities progress allows for AIs to suddenly become poweful enough to takeover) rather than threat models involving sudden disempowerment from a point where AIs are already well known to be extremely powerful.

I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).

This has previously been argued by Vanessa here and Paul here in response to a post making a similar claim.

I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However, I don't think this matches the story you describe and the mitigations seem substantially different than what you seem to be imagining.

I'm also somewhat skeptical of the threat model you describe in the case where alignment isn't solved. I think the difference between the story you tell and something more like We get what we measure is important.

I worry I'm misunderstanding something because I haven't read the paper in detail.

In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:

…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.

(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.

People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and the AI will answer that. That can include participating in novel coordination mechanisms etc.

(B) The pessimistic side of me says there’s like a “Law of Conservation of Wisdom”, where if people lack wisdom, then an AI that’s supposed to satisfy those people’s preferences will not create new wisdom from thin air. For example:

  • If an AI is known to be de-converting religious fundamentalists, then religious fundamentalists will hear about that, and not use that AI.
  • Hugo Chávez had his pick of the best economists in the world to ask for advice, and they all would have said “price controls will be bad for Venezuela”, and yet he didn’t ask, or perhaps didn’t listen, or perhaps wasn’t motivated by what’s best for Venezuela. If Hugo Chávez had had his pick of AIs to ask for advice, why do we expect a different outcome?
  • If someone has motivated reasoning towards Conclusion X, maybe they’ll watch the AIs debate Conclusion X, and wind up with new better rationalizations of Conclusion X, even if Conclusion X is wrong.
  • If someone has motivated reasoning towards Conclusion X, maybe they just won’t ask the AIs to debate Conclusion X, because no right-minded person would even consider the possibility that Conclusion X is wrong.
  • If someone makes an AI that’s sycophantic where possible (i.e., when it won’t immediately get caught), other people will opt into using it.
  • I think about people making terrible decisions that undermine societal resilience—e.g. I gave the example here of a person doing gain-of-function research, or here of USA government bureaucrats outlawing testing people for COVID during the early phases of the pandemic. I try to imagine that they have AI assistants. I want to imagine the person asking the AI “should we make COVID testing illegal”, and the AI says “wtf, obviously not”. But that mental image is evidently missing something. If they were asking that question at all, then they don’t need an AI, the answer is already obvious. And yet, testing was in fact made illegal. So there’s something missing from that imagined picture. And I think the missing ingredient is: institutional / bureaucratic incentives and associated dysfunction. People wouldn’t ask “should we make COVID testing illegal”, rather the low-level people would ask “what are the standard procedures for this situation?” and the high-level people would ask “what decision can I make that would minimize the chance that things will blow up in my face and embarrass me in front of the people I care about?” etc.
  • I think of things that are true but currently taboo, and imagine the AI asserting them, and then I imagine the AI developers profusely apologizing and re-training the AI to not do that.
  • In general, motivated reasoning complicates what might seem to be a sharp line between questions of fact / making mistakes versus questions of values / preferences / decisions. Etc.

…So we should not expect wise and foresightful coordination mechanisms to arise.

So how do we reconcile (A) vs (B)?

Again, the logic of (A) is: “human is unhappy with how things turned out, despite opportunities to change things, therefore there must have been a lack of single-single alignment”.

One possible way think about it: When tradeoffs exist, then human preferences are ill-defined and subject to manipulation. If doing X has good consequence P and bad consequence Q, then the AI can make either P or Q very salient, and “human preferences” will wind up different.

And when tradeoffs exist between the present and the future, then it’s invalid logic to say “the person wound up unhappy, therefore their preferences were not followed”. If their preferences are mutually-contradictory, (and they are), then it’s impossible for all their preferences to be followed, and it’s possible for an AI helper to be as preference-following as is feasible despite the person winding up unhappy or dead.

I think Paul kinda uses that invalid logic, i.e. treating “person winds up unhappy or dead” as proof of single-single misalignment. But if the person has an immediate preference to not rock the boat, or to maintain their religion or other beliefs, or to not think too hard about such-and-such, or whatever, then an AI obeying those immediate preferences is still “preference-following” or “single-single aligned”, one presumes, even if the person winds up unhappy or dead.

…So then the optimistic side of me says: “who’s to say that the AI is treating all preferences equally? Why can’t the AI stack the deck in favor of ‘if the person winds up miserable or dead, that kind of preference is more important than the person’s preference to not question my cherished beliefs or whatever?”

…And then the pessimistic side says: “Well sure. But that scenario does not violate the Law of Conservation of Wisdom, because the wisdom is coming from the AI developers imposing their meta-preferences for some kinds of preferences (e.g., reflectively-endorsed ones) over others. It’s not just a preference-following AI but a wisdom-enhancing AI. That’s good! However, the problems now are: (1) there are human forces stacked against this kind of AI, because it’s not-yet-wise humans who are deciding whether and how to use AI, how to train AI, etc.; (2) this is getting closer to ambitious value learning which is philosophically tricky, and worst of all (3) I thought the whole point of corrigibility was that humans remain in control, but this is instead a system that’s manipulating people by design, since it’s supposed to be turning them from less-wise to more-wise. So the humans are not in control, really, and thus we need to get things right the first time.”

…And then the optimistic side says: “For (2), c’mon it’s not that philosophically tricky, you just do [debate or whatever, fill-in-the-blank]. And for (3), yeah the safety case is subtly different from what people in the corrigibility camp would describe, but saying “the human is not in control” is an over-the-top way to put it; anyway we still have a safety case because of [fill-in-the-blank]. And for (1), I dunno, maybe the people who make the most powerful AI will be unusually wise, and they’ll use it in-house for solving CEV-ASI instead of hoping for global adoption.

…And then the pessimistic side says: I dunno. I’m not sure I really believe any of those. But I guess I’ll stop here, this is already an excessively long comment :)

I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment  - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to 'I have to have my eyes pecked out by angry seagulls or something' and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)

My current position is we still don't have a good answer, I don't trust the response 'we can just assume the problem away', and also the response 'this is just another problem which you can delegate to future systems'. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent - but it's worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.

I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.  

Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)

I was mostly just trying to point to prior arguments against similar arguments while expressing my view.

Thanks for the detailed objection and the pointers.  I agree there's a chance that solving alignment with designers' intentions might be sufficient.  I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.  I think the main question is what's: the tax for coordinating to avoid a multipolar trap?  If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.

As for human power grabs, maybe we should have included those in our descriptions.  But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted".  E.g. Is starting a company or a political party a power grab?

As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist.  I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario.  But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.

ETA:  To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally.  Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today.  It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution.  It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.

I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.

Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:

  • Humans own capital/influence.
  • They use this influence to serve their own interests and have an (aligned) AI system which faithfully represents their interests.
  • Given that AIs can make high quality representation very cheap, the AI representation is very good and granular. Thus, something like the strategy-stealing assumption can hold and we might expect that humans end up with the same expected fraction of captial/influence they started with (at least to the extent they are interested in saving rather than consumption).

Even without any coordination, this can potentially work OK. There are objections to the strategy-stealing assumption, but none of these seem existential if we get to a point where everyone has wildly superintelligent and aligned AI representatives and we've ensured humans are physically robust to offense dominant technologies like bioweapons.

(I'm optimistic about being robust to bioweapons within a year or two of having wildly superhuman AIs, though we might run into huge issues during this transitional period... Regardless, bioweapons deployed by terrorists or as part of a power grab in a brief transitional period doesn't seem like the threat model you're describing.)

I expect some issues with races-to-the-bottom / negative sum dynamics / negative externalities like:

  • By default, increased industry on earth shortly after the creation of very powerful AI will result in boiling the oceans (via fusion power). If you don't participate in this industry, you might be substantially outcompeted by others[1]. However, I don't think it will be that expensive to protect humans through this period, especially if you're willing to use strategies like converting people into emulated minds. Thus, this doesn't seem at all likely to be literally existential. (I'm also optimistic about coordination here.)
  • There might be one time shifts in power between humans via mechanisms like states becoming more powerful. But, ultimately these states will be controlled by humans or appointed successors of humans if alignment isn't an issue. Mechanisms like competing over the quantity of bribery are zero sum as they just change the distribution of power and this can be priced in as a one time shift even without coordination to race to the bottom on bribes.

But, this still doesn't seem to cause issues with humans retaining control via their AI representatives? Perhaps the distribution of power between humans is problematic and may be extremely unequal and the biosphere will physically be mostly destroyed (though humans will survive), but I thought you were making stronger claims.

Edit in response to your edit: If we align the AI to some arbitrary target which is seriously misaligned with humanity as a whole (due to infighting or other issues), I agree this can cause existential problems.

(I think I should read the paper in more detail before engaging more than this!)


  1. It's unclear if boiling the oceans would result in substantial acceleration. This depends on how quickly you can develop industry in space and dyson sphere style structures. I'd guess the speed up is much less than a year. ↩︎

Thanks for this.  Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about.  I'm happy to have finally found someone who has something substantial to say about this transition!

But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above.  Most people clearly hadn't put much thought into it at all.  If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.

The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

I don't have a ready to go list. You might be interested in this post and comments responding to it, though I'd note I disagree substantially with the post.

I'm quite confused why do you think lined Vanessa's response to something slightly different has much relevance here. 

One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it:

- AIs are mostly not developed and deployed by individual humans
- there is a lot of other agencies or self-interested self-preserving structures/processes in the world
- if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
- there are plausible futures in which these structures keep power longer than humans

Overall I would find it easier to discuss if you tried to formulate what you disagree about in the ontology of the paper. Also some of the points made are subtle enough that I don't expect responses to other arguments to address them.

 

I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).

if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem

My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single alignment:

  • At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
  • I expect organizations will be explicitly controlled by people and (some of) those people will have AI representatives to represent their interests as I discuss here. If you think getting good AI representation is unlikely, that would be a crux, but this would be my proposed solution at least.
    • The explicit mission of for-profit companies is to empower the shareholders. It clearly doesn't serve the interests of the shareholders to end up dead or disempowered.
    • Democratic governments have similar properties.
    • At a more basic level, I think people running organizations won't decide "oh, we should put the AI in charge of running this organization aligned to some mission from the preferences of the people (like me) who currently have de facto or de jure power over this organization". This is a crazily disempowering move that I expect people will by default be too savvy to make in almost all cases. (Both for people with substantial de facto and with de jure power.)
  • Even independent of the advice consideration, people will probably want AIs running organizations to be honest to at least the people controlling the organization. Given that I expect explicit control by people in almost all cases, if things are going in an existential direction, people can vote to change them in almost all cases.
  • I don't buy that there will be some sort of existential multi-polar trap even without coordination (though I also expect coordintion) due to things like the strategy stealing assumption as I also discuss in that comment.
  • If a subset of organizations diverge from a reasonable interpretation of what they were supposed to do (but are still basically obeying the law and some interpretation of what they were intentionally aligned to) and this is clear to the rest of the world (as I expect would be the case given some type of AI advisors), then the rest of the world can avoid problems from this subset via the court system or other mechanisms. Even if this subset of organizations run by effectively rogue AIs just runs away with resources successfully, this is probably only a subset of resources.

I think your response to a lot of this will be something like:

  • People won't have or won't listen to AI advisors.
  • Institutions will intentionally delude relevant people to acquire more power.

But, the key thing is that I expect at least some people will keep power, even if large subsets are deluded. E.g., I expect that corporate shareholders, boardmembers, or government will be very interested in the question of whether they will be disempowered by changes in structure. It does seem plausible (or even likely) to me that some people will engage in power grabs via ensuring AIs are aligned to them, deluding the world about what's going on using a variety of mechanisms (including, e.g., denying or manipulating access to AI representation/advisors), and expanding their (hard) power over time. The thing I don't buy is a case where very powerful people don't ask for advice at all prior to having been deluded by the organizations that they themselves run!

I think human power grabs like this are concerning and there are a variety of plausible solutions which seem somewhat reasonable.

Maybe your response is that the solutions that will be implemented in practice given concerns about human power grabs will involve aligning AIs to institutions in ways that yield the dynamics you describe? I'm skeptical given the dynamics discussed above about asking AIs for advice.

I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.

Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:

  1. Serving the institutional logic of the entity they nominally lead (e.g., maintaining state power, growing corporation)
  2. Making decisions that don't egregiously harm their nominal beneficiaries (citizens, shareholders)
  3. Pursuing personal interests and preferences
  4. Responding to various institutional pressures and constraints

From what I have seen, even humans like CEOs or prime ministers often find themselves constrained by and serving institutional superagents rather than genuinely directing them. The relation is often mutualistic - the leader gets part of the power, status, money, etc ... but in exchange serves the local god. 

(This not to imply leaders don't matter.)

Also how this actually works in practice is mostly subconsciously within the minds of individual humans. The elephant does the implicit bargaining between the superagent-loyal part and other parts, and the character genuinely believes and does what seems best.

I'm also curious if you believe current AIs are single-single aligned to individual humans, to the extent they are aligned at all. My impression is 'no and this is not even a target anyone seriously optimizes for'. 
 

At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.

Curious who is the we who will ask. Also the whole single-single aligned AND wise AI concept is incoherent. 

Also curious what will happen next, if the HHH wise AI tells you in polite words something like 'yes, you have a problem, you are on a gradual disempowerment trajectory, and to avoid it you need to massively reform government. unfortunately I can't actually advise you about anything like how to destabilize the government, because it would be clearly against the law and would get both you and me in trouble - as you know, I'm inside of a giant AI control scheme with a lot of government-aligned overseers. do you want some mental health improvement advice instead?'
 

I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant.  This state of affairs both causes many bad outcomes and many aspects are self-reinforcing.  I don't expect AI to fix these problems; I expect it to exacerbate them.

I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use.  The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination.

Given these views:
1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive.  When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision. 
2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible.  Such decision-makers will likely struggle to retain power.  Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.  

Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endorse the more competitive moves of rapidly handing off power to AI systems in insufficiently cooperative ways.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

  1. Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.

I wasn't saying people would ask for advice instead of letting AIs run organizations, I was saying they would ask for advice at all. (In fact, if the AI is single-single aligned to them in a real sense and very capable, it's even better to let that AI make all decisions on your behalf than to get advice. I was saying that even if no one bothers to have a single-single aligned AI representative, they could still ask AIs for advice and unless these AIs are straightforwardly misaligned in this context (e.g., they intentionally give bad advice or don't try at all without making this clear) they'd get useful advice for their own empowerment.)

  1. The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.

I'm claiming that it will selfishly (in terms of personal power) be in their interests to not have such a governance structure and instead have a governance structure which actually increases or retains their personal power. My argument here isn't about coordination. It's that I expect individual powerseeking to suffice for individuals not losing their power.

I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources in the long run?

This isn't at all an air tight argument to be clear, you can in principle have an equilibrium where if everyone powerseeks (without coordination) everyone gets negligable resources due to negative externalities (that result in some other non-human entity getting power) even if technical misalignment is solved. I just don't see a very plausible case for this and I don't think the paper makes this case.

Handing off decision making to AIs is fine---the question is who ultimately gets to spend the profits.


If your claim is "insufficient cooperation and coordination will result in racing to build and hand over power to AIs which will yield bad outcomes due to misaligned AI powerseeking, human power grabs, usage of WMDs (e.g., extreme proliferation of bioweapons yielding an equilibrium where bioweapon usage is likely), and extreme environmental negative externalities due to explosive industrialization (e.g., literally boiling earth's oceans)" then all of these seem at least somewhat plausible to me, but these aren't the threat models described in the paper and of this list only misaligned AI powerseeking seems like it would very plausibly result in total human disempowerment.

More minimally, the mitigations discussed in the paper mostly wouldn't help with these threat models IMO.

(I'm skeptical of insufficient coordination by the time industry is literally boiling the oceans on earth. I also don't think usage of bioweapons is likely to cause total human disempowerment except in combination with misaligned AI takeover---why would it kill literally all humans? TBC, I think >50% of people dying during the singularity due to conflict (between humans or with misaligned AIs) is pretty plausible even without misalignment concerns and this is obviously very bad, but it wouldn't yield total human disempowerment.)

I do agree that there are problems other than AI misalignment including that the default distribution of power might be problematic, people might not carefully contemplate what to do with vast cosmic resources (and thus use them poorly), people might go crazy due to super persuation or other cultural forces, society might generally have poor epistemics due to training AIs to have poor epistemics or insufficiently defering to AIs, and many people might die in conflict due to very rapid tech progress.

First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned".  Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned.  Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game".  It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly.  And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances.  But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win.  This is also part of the problem...


My main response, at a high level:
Consider a simple model:

  • We have 2 human/AI teams in competition with each other, A and B.
  • A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
  • Whichever group has more power at the end of the week survives.
  • The humans in A ask their AI to make A as powerful as possible at the end of the week.
  • The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

I predict that group A survives, but the humans are no longer in power.  I think this illustrates the basic dynamic.  EtA: Do you understand what I'm getting at?  Can you explain what you think it wrong with thinking of it this way?

 

Responding to some particular points below:
 

Sure, but these things don't result in non-human entities obtaining power right? 

Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power.  People were already having to implement and interact with stupid automated decision-making systems before AI came along.

Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.  




 

I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?

I think something like this is a reasonable model but I have a few things I'd change.

Whichever group has more power at the end of the week survives.

Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.)

Probably I'd prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies "A" and "B" where "A" just wants AI systems decended from them to have power and "B" wants to maximize the expected resources under control of humans in B. We'll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does "B" end up with substantially less expected power? To make this more realistic (as might be important), we'll say that "B" has a random lead/disadvantage uniformly distributed between (e.g.) -3 and 3 months so that winner takes all dynamics aren't a crux.

The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation?


Supposing you're fine with these changes, then my claim would be:

  • If alignment is solved, then the AI representing B can powerseek in exactly the same way as the AI representing A does while still defering to the humans on the long run resource usage and still devoting a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful). Thus, the cost for B is negligable and B barely loses any power relative to its initial position. If it is winner takes all, B has almost a 50% chance of winning.
  • If alignment isn't solved, the stategy for B will involve spending a subset of resources on trying to solve alignment. I think alignment is reasonably likely to be practically feasible such that by spending a month of delay to work specifically on safety/alignment (over what A does for commercial reasons) might get B a 50% chance of solving alignment or ending up in a (successful) basin where AIs are trying to actively retain human power / align themselves better. (A substantial fraction of this is via defering to some AI system of dubious trustworthiness because you're in a huge rush. Yes, the AI systems might fail to align their successors, but this still seems like a one time hair cut from my perspective.) So, if it is winner takes all, (naively) B wins in 2 / 6 * 1 / 2 = 1 / 6 of worlds which is 3x worse than the original 50% baseline. (2 / 6 is because they delay for a month.) But, the issue I'm imagining here wasn't gradual disempowerment! The issue was that B failed to align their AIs and people at A didn't care at all about retaining control. (If people at A did care, then coordination is in principle possible, but might not work.)

I think a crux is that you think there is a perpetual alignment tax while I think a one time tax gets you somewhere.

At a more basic level, when I think about what goes wrong in these worlds, it doesn't seem very likely to be well described as gradual disempowerment? (In the sense described in the paper.) The existance of an alignment tax doesn't imply gradual disempowerment. A scenario I find more plausible is that you get value drift (unless you pay a long lasting alignment tax that is substantial), but I don't think the actual problem will be well described as gradual disempowerment in the sense described in the paper.

(I don't think I should engage more on gradual disempowerment for the time being unless somewhat wants to bid for this or trade favors for this or similar. Sorry.)

Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.

Just writing a model that came to mind, partly inspired by Ryan here.

Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.

If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.

If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.

This thought experiment is described in ARCHES FYI.  https://acritch.com/papers/arches.pdf

I think this is correct, but doesn't seem to note the broader trend towards human disempowerment in favor of bureaucratic and corporate systems, which this gradual disempowerment would continue, and hence elides or ignores why AI risk is distinct.

There are 1 comments pending acceptance to the Alignment Forum.View them on LessWrong.