Andrew Critch

This is Dr. Andrew Critch's professional LessWrong account. Andrew is the CEO of Encultured AI, and works for ~1 day/week as a Research Scientist at the Center for Human-Compatible AI (CHAI) at UC Berkeley. He also spends around a ½ day per week volunteering for other projects like Berkeley Existential Risk initiative and the Survival and Flourishing Fund. Andrew earned his Ph.D. in mathematics at UC Berkeley studying applications of algebraic geometry to machine learning models. During that time, he cofounded the Center for Applied Rationality and SPARC. Dr. Critch has been offered university faculty and research positions in mathematics, mathematical biosciences, and philosophy, worked as an algorithmic stock trader at Jane Street Capital’s New York City office, and as a Research Fellow at the Machine Intelligence Research Institute. His current research interests include logical uncertainty, open source game theory, and mitigating race dynamics between companies and nations in AI development.

Wiki Contributions

Comments

AGI Ruin: A List of Lethalities

Eliezer, thanks for sharing these ideas so that more people can be on the lookout for failures.  Personally, I think something like 15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately, and I think it's not crazy to think the fraction is more like 90% or higher (which I judge to be your view).

FWIW, I do not agree with the following stance, because I think it exposes the world to more x-risk:

So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it. 

Specifically, I think a considerable fraction of the remaining AI x-risk facing humanity stems from people pulling desperate (unsafe) moves with AGI to head off other AGI projects.  So, in that regard, I think that particular comment of yours is probably increasing x-risk a bit.  If I were a 90%-er like you, it's possible I'd endorse it, but even then it might make things worse by encouraging more desperate unilateral actions.

That said, overall I think this post is a big help, because it helps to put responsibility in the hands of more people to not do the crazy/stupid/reckless things you're describing here... and while I might disagree on the fraction/probability, I agree that some groups would destroy humanity more or less immediately if they developed AGI.  And, while I might disagree on some of the details of how human extinction eventually plays out, I do think human extinction remains the default outcome of humanity's path toward replacing itself with automation, probably within our lifetimes unfortunately.


 

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

John, it seems like you're continuing to make the mistake-according-to-me of analyzing the consequences of a pivotal act without regard for the consequences of the intentions leading up to the act.  The act can't come out of a vacuum, and you can't built a project compatible with the kind of invasive pivotal acts I'm complaining about without causing a lot of problems leading up to the act, including triggering a lot of fear and panic for other labs and institutions.  To summarize from the post title: pivotal act intentions directly have negative consequences fox x-safety, and people thinking about the acts alone seem to be ignoring the consequences of the intentions leading up to the act, which is a fallacy.

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Eliezer, from outside the universe I might take your side of this bet.  But I don't think it's productive to give up on getting mainstream institutions to engage in cooperative efforts to reduce x-risk.

A propos, I wrote the following post in reaction to positions-like-yours-on-this-issue, but FYI it's not just you (maybe 10% you though?):
https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB 

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

> In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).

This doesn't feel like other words to me, it feels like a totally different claim.

Hmm, perhaps this is indicative of a key misunderstanding.

For example, natural monopolies in the production web wouldn't charge each other marginal costs, they would charge profit-maximizing profits. 

Why not?  The third paragraph of the story indicates that: "Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations." In other words, at that point it could certainly happen that two monopolies would agree to charge each other lower cost if it benefitted both of them.  (Unless you'd count that as instance of "charging profit-maximizing costs"?)  The concern is that the subprocesses of each company/institution that get good at (or succeed at) bargaining with other institutions are subprocesses that (by virtue of being selected for speed and simplicity) are less aligned with human existence than the original overall company/institution, and that less-aligned subprocess grows to take over the institution, while always taking actions that are "good" for the host institution when viewed as a unilateral move in an uncoordinated game (hence passing as "aligned").

At this point, my plan is try to consolidate what I think the are main confusions in the comments of this post, into one or more new concepts to form the topic of a new post.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.

I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Yeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this question about deployment-time as a hanging disagreement node.

I'm reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.

Yes to both points; I'd thought of writing a debate dialogue on this topic trying to cover both sides, but commenting with you about it is turning out better I think, so thank for that!

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> Both [cultures A and B] are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.

I was asking you why you thought A'  would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?

 

Ah! Yes, this is really getting to the crux of things.  The short answer is that I'm worried about the following failure mode:

Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

(Here's, I'm using the word "culture" to encode a mix of information subsuming utility functions, beliefs, and decision theory, cognitive capacities, and other features determining the general tendencies of an agent or collective.)

Of course, an easy antidote to this failure mode is to have A or B win instead of A', because A and B both have some human values other than power-maximizing.  The problem is that this whole situation is premised on a conflict between A and B over which culture should win, and then the following observation applies:

  • Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important.

In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).  This observation is slightly different from observations that "simple values dominate engineering efforts" as seen in stories about singleton paperclip maximizers.  A key feature of the Production Web dynamic is now just that it's easy to build production maximizers, but that it's easy to accidentally cooperate on building a production-maximizing systems that destroy both you and your competitors.

This feels inconsistent with many of the things you are saying in your story, but 

Thanks for noticing whatever you think are the inconsistencies; if you have time, I'd love for you to point them out.

I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concerns into my language.

This seems pretty likely to me.  The bolded attribution to Dai above is a pretty important RAAP in my opinion, and it's definitely a theme in the Production Web story as I intend it.  Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures.  Throughout this, each cultural subprocess is doing what its "host culture" wants it to do from a unilateral perspective (work faster / keep up with the competitor cultures), but the overall effect is destruction of the host cultures (a la Prisoner's Dilemma) by the cultural subprocesses.

If I had to use alignment language, I'd say "the production web overall is misaligned with human culture, while each part of the web is sufficiently well-aligned with the human entit(ies) who interact with it that it is allowed to continue operating".  Too low of a bar for "allowed to continue operating" is key to the failure mode, of course, and you and I might have different predictions about what bar humanity will actually end up using at roll-out time.  I would agree, though, that conditional on a given roll-out date, improving E[alignment_tech_quality] on that date is good and complimentary to improving E[cooperation_tech_quality] on that date.

Did this get us any closer to agreement around the Production Web story?  Or if not, would it help to focus on the aforementioned inconsistencies with homogenous-coordination-advantage?

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thanks for the pointer to grace2020whose!  I've added it to the original post now under "successes in our agent-agnostic thinking".

But I also think the AI safety community has had important contributions on this front.

For sure, that is the point of the "successes" section.  Instead of "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes" I should probably have said "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes, and to my eye there should be more communication across the boundary of that bubble."

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thanks for this synopsis of your impressions, and +1 to the two points you think we agree on.

I also read the post as also implying or suggesting some things I'd disagree with:

As for these, some of them are real positions I hold, while some are not:

  • That there is some real sense in which "cooperation itself is the problem."

I don't hold that view.  I the closest view I hold is more like: "Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment."

  • Relatedly, that cooperation plays a qualitatively different role than other kinds of cognitive enhancement or institutional improvement. 

I don't hold the view you attribute to me here, and I agree wholesale with the following position, including your comparisons of cooperation with brain enhancement and improving belief accuracy:

I think that both cooperative improvements and cognitive enhancement operate by improving people's ability to confront problems, and both of them have the downside that they also accelerate the arrival of many of our future problems (most of which are driven by human activity). My current sense is that cooperation has a better tradeoff than some forms of enhancement (e.g. giving humans bigger brains) and worse than others (e.g. improving the accuracy of people's and institution's beliefs about the world).

... with one caveat: some beliefs are self-fulfilling, such as cooperation/defection. There are ways of improving belief accuracy that favor defection, and ways that favor cooperation.  Plausibly to me, the ways of improving belief accuracy that favor defection are worse that mo accuracy improvement at all.  I'm particularly firm in this view, though; it's more of a hedge.

  • That the nature of the coordination problem for AI systems is qualitatively different from the problem for humans, or somehow is tied up with existential risk from AI in a distinctive way. I think that the coordination problem amongst reasonably-aligned AI systems is very similar to coordination problems amongst humans, and that interventions that improve coordination amongst existing humans and institutions (and research that engages in detail with the nature of existing coordination challenges) are generally more valuable than e.g. work in multi-agent RL or computational social choice.

I do hold this view!  Particularly the bolded part.  I also agree with the bolded parts of your counterpoint, but I think you might be underestimating the value of technical work (e.g., CSC, MARL) directed at improving coordination amongst existing humans and human institutions.

I think blockchain tech is a good example of an already-mildly-transformative technology for implementing radically mutually transparent and cooperative strategies through smart contracts.  Make no mistake: I'm not claiming blockchain tehc is going to "save the world"; rather, it's changing the way people cooperate, and is doing so as a result of a technical insight.  I think more technical insights are in order to improve cooperation and/or the global structure of society, and it's worth spending research efforts to find them.  

Reminder: this is not a bid for you personally to quit working on alignment!

  • That this story is consistent with your prior arguments for why single-single alignment has low (or even negative) value. For example, in this comment you wrote "reliability is a strongly dominant factor in decisions in deploying real-world technology, such that to me it feels roughly-correctly to treat it as the only factor." But in this story people choose to adopt technologies that  are less robustly aligned because they lead to more capabilities. This tradeoff has real costs even for the person deploying the AI (who is ultimately no longer able to actually receive any profits at all from the firms in which they are nominally a shareholder). So to me your story seems inconsistent with that position and with your prior argument. (Though I don't actually disagree with the framing in this story, and I may simply not understand your prior position.)

My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.  In practice, I think the threshold by default will not be "Reliable enough to partake in a globally cooperative technosphere that preserves human existence", but rather, "Reliable enough to optimize unilaterally for the benefits of the stakeholders of each system, i.e., to maintain or increase each stakeholder's competitive advantage."  With that threshold, there easily arises a RAAP racing to the bottom on how much human control/safety/existence is left in the global economy.  I think both purely-human interventions (e.g., talking with governments) and sociotechnical interventions (e.g., inventing cooperation-promoting tech) can improve that situation.  This is not to say "cooperation is all you need", any more than I than I would say "alignment is all you need".

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

Yes, I agree with this.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination?  I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me.

Yes! +10 to this!  For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply.  I'm trying to say "Hey, if you care about AI x-risk, alignment isn't the only game in town", and staking some personal reputation points to push against the status quo where almost-everyone x-risk oriented will work on alignment almost-nobody x-risk-oriented will work on cooperation/coordination or multi/multi delegation.

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"... 

Load More