# All of Andrew_Critch's Comments + Replies

AGI Ruin: A List of Lethalities

Eliezer, thanks for sharing these ideas so that more people can be on the lookout for failures.  Personally, I think something like 15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately, and I think it's not crazy to think the fraction is more like 90% or higher (which I judge to be your view).

FWIW, I do not agree with the following stance, because I think it exposes the world to more x-risk:

So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

John, it seems like you're continuing to make the mistake-according-to-me of analyzing the consequences of a pivotal act without regard for the consequences of the intentions leading up to the act.  The act can't come out of a vacuum, and you can't built a project compatible with the kind of invasive pivotal acts I'm complaining about without causing a lot of problems leading up to the act, including triggering a lot of fear and panic for other labs and institutions.  To summarize from the post title: pivotal act intentions directly have negative consequences fox x-safety, and people thinking about the acts alone seem to be ignoring the consequences of the intentions leading up to the act, which is a fallacy.

2johnswentworth1mo
I see the argument you're making there. I still think my point stands: the strategically relevant question is not whether unilateral pivotal act intentions will cause problems, the question is whether aiming for a unilateral pivotal act would or would not reduce the chance of human extinction much more than aiming for a multilateral pivotal act. The OP does not actually attempt to compare the two, it just lists some problems with aiming for a unilateral pivotal act. I do think that aiming for a unilateral act increases the chance of successfully executing the pivotal act by multiple orders of magnitude, even accounting for the part where other players react to the intention, and that completely swamps the other considerations.
2Ben Pace1mo
Just as a related idea, in my mind, I often do a kind of thinking that HPMOR!Harry would call “Hufflepuff Bones”, where I look for ways a problem is solvable in physical reality at all, before considering ethical and coordination and even much in the way of practical concerns.
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Eliezer, from outside the universe I might take your side of this bet.  But I don't think it's productive to give up on getting mainstream institutions to engage in cooperative efforts to reduce x-risk.

A propos, I wrote the following post in reaction to positions-like-yours-on-this-issue, but FYI it's not just you (maybe 10% you though?):
https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

> In other words, the humans and human-aligned institutions no

4Ben Pace1y
Sounds great! I was thinking myself about setting aside some time to write a summary of this comment section (as I see it).
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.

I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Yeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> Both [cultures A and B] are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.

I was asking you why you thought A'  would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?

Ah! Yes, this is really getting to the crux of thing... (read more)

Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

I'm wondering why the easiest way is to copy A'---why was A' better at... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thanks for the pointer to grace2020whose!  I've added it to the original post now under "successes in our agent-agnostic thinking".

But I also think the AI safety community has had important contributions on this front.

For sure, that is the point of the "successes" section.  Instead of "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes" I should probably have said "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes, and to my ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thanks for this synopsis of your impressions, and +1 to the two points you think we agree on.

I also read the post as also implying or suggesting some things I'd disagree with:

As for these, some of them are real positions I hold, while some are not:

• That there is some real sense in which "cooperation itself is the problem."

I don't hold that view.  I the closest view I hold is more like: "Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment."

• Relatedly, that cooperation plays a qual

Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment

Sounds like we are on broadly the same page. I would have said "Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or..." and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

Yes, I agree with this.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or wo

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment

In fairness, writing “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.”

(I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI argui... (read more)

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"...

I think that probably would be true.

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which trigg

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination

Yes.

(my sense is that I'm quite skeptical about most of the particular kinds of work you advocate

That is also my sense, and a major reason I suspect multi/multi delegation dynamics will remain neglected among x-risk oriented researchers for the next 3-5 years at least.

If you disagree, then I expect the main disagr

3Paul Christiano1y
I was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future? * One obvious reason is single-single misalignment---A' is willing to deploy misaligned AI in order to get an advantage, while B isn't---but you say "their tech is aligned with them" so it sounds like you're setting this aside. But maybe you mean that A' has values that make alignment easy, while B has values that make alignment hard, and so B's disadvantage still comes from single-single misalignment even though A''s systems are aligned? * Another advantage is that A' can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn't seem like it can cause A' to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment. * Wei Dai has suggested [https://www.alignmentforum.org/posts/Sn5NiiD5WBi4dLzaB/agi-will-drastically-increase-economies-of-scale] that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concer
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

My best understanding of your position is: "Sure, but they will be trying really hard. So additional researchers working on the problem won't much change their probability of success, and you should instead work on more-neglected problems."

That is not my position if "you" in the story is "you, Paul Christiano" :)  The closest position I have to that one is : "If another Paul comes along who cares about x-risk, they'll have more positive impact by focusing on multi-agent and multi-stakeholder issues or 'ethics' with AI tech than if they focus on intent... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Carl, thanks for this clear statement of your beliefs.  It sounds like you're saying (among other things) that American and Chinese cultures will not engage in a "race-to-the-bottom" in terms of how much they displace human control over the AI technologies their companies develop.  Is that right?  If so, could you give me a % confidence on that position somehow?  And if not, could you clarify?

To reciprocate: I currently assign a ≥10% chance of a race-to-the-bottom on AI control/security/safety between two or more cultures this century, ... (read more)

The US and China might well wreck the world  by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment (for both companies and governments). Competitive pressures are the main reason why AI systems with inadequate 1-to-1 alignment would be given long enough leashes to bring catastrophe. I would cosign Vanessa... (read more)

Another (outer) alignment failure story

(I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)

I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

I'm saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying ma... (read more)

Another (outer) alignment failure story

Paul, thanks writing this; it's very much in line with the kind of future I'm most worried about.

For me, it would be super helpful if you could pepper throughout the story mentions of the term "outer alignment" indicating which events-in-particular you consider outer alignment failures.  Is there any chance you could edit it to add in such mentions?  E.g., I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

5Paul Christiano1y
I'd say that every single machine in the story is misaligned, so hopefully that makes it easy :) I'm basically always talking about intent alignment, as described in this post [https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6]. (I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research.

I don't mean to say this post warrants a new kind of AI alignment research, and I don't think I said that, but perhaps I'm missing some kind of subtext I'm inadvertently sending?

I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are "new kinds" of research (I promoted them heavily in my preceding post), and none of which I would call "alignment re... (read more)

1. Any solution to single-single alignment will involve a tradeoff between alignment and capability.
2. If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
3. If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.
4. Given the technical knowledge to design c
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment").   I prefer the scalar/process usage, because it seems to me t... (read more)

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off.  I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.

Right now the United States has a GDP of >$20T, US plus its NATO allies and Japan >$40T, the PRC >\$14T,... (read more)

5Paul Christiano1y
3Paul Christiano1y
I think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled labor) better than others (e.g. the median line worker). That's a really important and common complaint with the existing economic order, but I don't really see how it indicates a Pareto improvement or is related to the central thesis of your post about firms failing to help their shareholders. (In general wage labor is supposed to benefit you by giving you money, and then the question is whether the stuff you spend money on benefits you.))

If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better.  I just don't see that happening given the structure of today's global economy and tech industry.

In your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "... (read more)

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment").   I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean.  (I sometimes use "misaligned" as a boolean due to it

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.

How are you inferring this?  From the fact that a negative outcome eventually obtained?  Or from particular misaligned decisions each system made?  It would be helpful if you could point to a particular single-agent decision in one of the stories that you view as evidence of that single agent being highly misaligned with its user or creator.  I can then reply with how I envision that decision being made even with high single-a... (read more)

How are you inferring this?  From the fact that a negative outcome eventually obtained?  Or from particular misaligned decisions each system made?

I also thought the story strongly suggested single-single misalignment, though it doesn't get into many of the concrete decisions made by any of the systems so it's hard to say whether particular decisions are in fact misaligned.

The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or eve... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I hadn't read it (nor almost any science fiction books/stories) but yes, you're right!  I've now added a callback to Autofac after the "facotiral DAO" story.  Thanks.

Some AI research areas and their relevance to existential safety

Good to hear!

If I read that term ["AI existential safety"] without a definition I would assume it meant "reducing the existential risk posed by AI." Hopefully you'd be OK with that reading. I'm not sure if you are trying to subtly distinguish it from Nick's definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say "existential risk" (e.g. the LW definition is like yours).

Yep, that's my intention.  If given the chance I'd also shift the meaning of "existential risk" a b... (read more)

Some AI research areas and their relevance to existential safety

The OP's conclusion seems to be that social AI alignment should be the main focus. Personally, I'm less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.

Thanks for the feedback, Vanessa.  I've just written a follow-up post to better illustrate a class of societal-scale failure modes ("unsafe robust agent-agnostic processes") that constitutes the majority of the probability mass I currently place on human extinction precipitated by transformative AI advancements (especially ... (read more)

Some AI research areas and their relevance to existential safety

My actual thought process for believing GDPR is good is not that it "is a sample from the empirical distribution of governance demands", but that it intializes the process of governments (and thereby the public they represent) weighing in on what tech companies can and cannot design their systems to reason about, and more specifically the degree to which systems are allowed to reason about humans.  Having a regulatory structure in place for restricting access to human data is a good first step, but we'll probably also eventually want restrictions for ... (read more)

Some AI research areas and their relevance to existential safety

> Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world.  This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.

One approach to this pr

Some AI research areas and their relevance to existential safety

It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X).

Yes, this is more or less my assumption.  I think slower progress on OODR will delay release dates  of transformative tech much more than it will improve quality/safety on the eventual date of release.

A more plausible model is that deployment decisions will be based on many axes of quality, e.g. supp

The ground of optimization

This post reminds me of thinking from 1950s when people taking inspiration from Wiener's work on cybernetics tried to operationalize "purposeful behavior" in terms of robust convergence to a goal state:

https://heinonline.org/HOL/Page?collection=journals&handle=hein.journals/josf29&id=48&men_tab=srchresults

> When an optimizing system deviates beyond its own rim, we say that it dies. An existential catastrophe is when the optimizing system of life on Earth moves beyond its own outer rim.

I appreciate the direct attention to this process a... (read more)

Syntax, semantics, and symbol grounding, simplified

I may write more on this later, but for now I just want to express exuberance at someone in the x-risk space thinking and writing about this :)

3Stuart Armstrong2y
Express, express away _