This is a linkpost for http://arxiv.org/abs/2403.05540

Abstract: In an effort to inform the discussion surrounding existential risks from AI, we formulate Extinction-level Goodhart’s Law as "Virtually any goal specification, pursued to the extreme, will result in the extinction[1] of humanity'', and we aim to understand which formal models are suitable for investigating this hypothesis. Note that we remain agnostic as to whether Extinction-level Goodhart's Law holds or not. As our key contribution, we identify a set of conditions that are necessary for a model that aims to be informative for evaluating specific arguments for Extinction-level Goodhart's Law. Since each of the conditions seems to significantly contribute to the complexity of the resulting model, formally evaluating the hypothesis might be exceedingly difficult. This raises the possibility that whether the risk of extinction from artificial intelligence is real or not, the underlying dynamics might be invisible to current scientific methods.


Together with Chris van Merwijk and Ida Mattsson, we have recently written a philosophy-venue version of some of our thoughts on Goodhart's Law in the context of powerful AI [link].[2] This version of the paper has no math in it, but it attempts to point at one aspect of "Extinction-level Goodhart's Law" that seems particularly relevant for AI advocacy – namely, that the fields of AI and CS would have been unlikely to come across evidence of AI risk, using the methods that are popular in those fields, even if the law did hold in the real world.

Since commenting on link-posts is inconvenient, I split off some of the ideas from the paper into the following separate posts:

We have more material on this topic, including writing with math[3] in it, but this is mostly not yet in a publicly shareable form. The exception is the post Extinction-level Goodhart's Law as a Property of the Environment (which is not covered by the paper). If you are interested in discussing anything related to this, definitely reach out.

  1. ^

    A common comment is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as "existential", and reserve the "extinction" version for the less ambiguous notion of literal extinction.

  2. ^

    As an anecdote, it seems worth mentioning that I tried, and failed, to post the paper to arXiv --- by now, it has been stuck there with "on hold" status for three weeks. Given that the paper is called "Existential Risk from AI: Invisible to Science?", there must be some deeper meaning to this. [EDIT: After ~2 months, the paper is now on arXiv.]

  3. ^

    Or rather, it has pseudo-math in it. By which I mean that it looks like math, but it is built on top of vague concepts such as "optimisation power" and "specification complexity". And while I hope that we will one day be able to formalise these, I don't know how to do so at this point.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 5:04 AM

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

For more discussion see here, here, and here.

[I haven't read the sequence, I'm just responding the focus on extinction. In practice, my claims in this comment mean that I disagree with the arguments under "Human extinction as a convergently instrumental subgoal".]

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?

(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)

I would say "catastrophic outcome (>50% chance the AI kills >1 billion people)" or something and then footnote. Not sure though. The standard approach is to say "existential risk".

Quick reaction:

  • I didn't want to use the ">1 billion people" formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
  • I didn't want to use "existential risk", because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
  • And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/whatever takes over and does its thing. Which probably includes getting rid of literally everyone, modulo some important (but probably not decision-relevant?) questions about anthropics and negotiating with aliens.

Maybe "full loss-of-control to AIs"? Idk.