Extinction-level Goodhart's Law as a Property of the Environment

VojtaKovarik; Ida Mattsson

Summary: Formally defining Extinction-level Goodhart's Law is tricky, because formal environments don't contain any actual humans that could go extinct. But we can do it using the notion of an interpretation mapping, which sends outcomes in the abstract environment to outcomes in the real world. We can then the truth condition of Extinction-level Goodhart's Law as a property of the environment.
I conjecture that Extinction-level Goodhart's Law does not hold in easily formalisable environments^[1], even though it might hold in the real world. This seems like a (very) big deal for AI advocacy, since it suggests that the lack of rigorous arguments concerning AI risk (eg, math proofs) does not provide strong evidence for the safety of AI.

Semi-formal definition of Extinction-level Goodhart's Law

Informally, we can define the extinction-level^[2] variant of Goodhart's law as follows:

Definition (informal): The Weak Version^[3] of Extinction-level Goodhart's Law is the claim that: "Virtually any goal specification, pursued to the extreme, will result in the extinction of humanity."

The tricky part is how to translate this into a more rigorous definition that can be applied to formal environments.

Defining "extinction" in formal environments

However, applying this definition to formal environments is tricky, because it requires formally defining which of the abstract states qualify as "extinction of humanity". How can we get past this obstacle?

What we don't do: "extinction states" given by definition

A lazy way of defining extinction in abstract models would be to assume that we are given some such "extinction" states by definition. That is, if is the set of all possible states in the formal environment, we could assume that there is some set $Ω_{extinction} \subset Ω$ . And we would refer to any $ω \in Ω_{extinction}$ as "extinction".
I don't like this approach, because it just hides the problem elsewhere. Also, this approach does not put any constraints on how $Ω_{extinction}$ should look like. As a result, we would be unlikely to be able to derive any results.

The approach we take instead is to augment the formal environment by an interpretation, where each state "abstract" state $ω$ in the formal environment $Ω$ is mapped onto some state $φ (ω)$ in some "more complex" environment $Ω^{'}$ .^[4]

Real-world interpretations

The typical use of an abstract model is that we use it to study some more complex thing. While doing this, we implicitly hold in mind some interpretation of elements of the abstract model. For example, when using arithmetics to reason about apples in a box, I might interpret $n \in N$ as "there are $n$ apples in the box" and $n + m$ as "there are $n$ apples in the box, and I put another $m$ in there" (or whatever). Naturally, an abstract model can have any number of different interpretations --- however, to avoid imprecision (and the motte-and-bailey fallacy), we will try to focus on a single^[5] interpretation at any given time.
Note that we intend interpretation to be a very weak notion --- interpretations are not meant to automatically be accurate, sensible, etc.

Definition 1 (interpretation, formal): Let a set $Ω$ be the state space of some formal model. A (formal) interpretation of $Ω$ is a pair $(Ω^{'}, φ)$ , where $Ω^{'}$ is a set representing the state space of some other formal model and $φ : Ω \to Ω^{'}$ is the interpretation function.

The definition only talks about the interpretation of states, but we could similarly talk about the interpretations of actions, features of states, etc.^[6]

Definition 2 (real-world interpretation, semi-formal): Let a set $Ω$ be the state space of some formal model. A real-world interpretation is a some $φ : Ω \to Ω^{'}$ , where $Ω^{'}$ are the possible states of the real world.

Note that unlike with formal interpretation, we can't really describe the space of real-world states as an actual set, and we can't really write a real-world interpretation $φ$ as a mathematical function. However, for the purpose of many discussions, we can get away with approximating the mathematical mapping $φ$ by an informal description. That is, we can:
(1) Replace the formal interpretation by an informal one. (Such as 'we interpret $n \in N$ as "there are $n$ apples in the box" '.)
(2) Refine this informal description if the questions we ask start being sensitive to details that were so far unclear. (For example, '...but for $n \geq 200$ , the box breaks and the apples are lying on the ground'.)

As some examples, we can consider:

Interpreting the game-theoretical model of chess as representing a chess game played between Chris and Vojta (and each state as "Vojta and Chris reached this position in their game").
Interpreting the game-theoretical model of chess as a metaphor for war between two medieval kingdoms.
Interpreting the RL environment Cart Pole as Vojta playing a Cart Pole game on his computer.
Interpreting Cart Pole as representing a physical robot that is trying to balance a pole standing on top of it.

Definition of extinction

Definition 3 (extinction-level outcome): Let $Ω$ be an abstract state space with real-world interpretation $φ$ . We say that $ω \in Ω$ is an extinction-level outcome, or state, (under $φ$ ) if $φ (ω)$ is a state where literally every person on Earth is dead^[7].

As an example, consider a gridworld where one state is designated as the big red button, and the agent stepping on that button is interpreted as humanity going extinct. Then any state which has the agent standing on the button is extinction-level.
As negative examples, note that under the interpretations mentioned above, neither chess nor Cart Pole contain any extinction-level outcomes. This is despite the fact that a superintelligent misaligned AI could probably cause human extinction even if it initially only controlled a single robot --- however, this would require actions which are not represented in the abstract model (Cart Pole), so the definition does not apply.

Definition of Goodhart's Law

Goal specifications

Talking about Goodhart's Law requires talking about optimisation, which requires talking about goals. Informally speaking, it seems that an important object is a goal specification --- which is some real world attempt at describing (giving, specifying, ...) a goal to an AI (or a human, or some optimisation process, etc).

Every goal specification seems to (somehow) induce some ordering over states. And every optimisation process will then (somehow) try to produce a highly ranked state^[8] under this ordering.
However, the connection between goal specifications and the corresponding orderings of states seem pretty unclear to me. So for now, I will ignore them and instead pretend that goal specifications and preference ordering are the same thing. I will use $G (Ω)$ to denote the set of all goal specifications $g$ on $Ω$ (in particular, I will pretend that it is a well defined set).

In order to simplify the discussion, I will assume that every $g \in G (Ω)$ is sufficiently detailed to make every two states comparable. That is, $g$ corresponds to some binary relation $\leq_{g}$ on $Ω$ which is reflexive, transitive, and complete.
This means that you can represent $\leq_{g}$ using a utility function. However, I think that most of the ideas can be extended to the case where some states are incomparable^[9], so I will stick to the $\leq_{g}$ notation, rather than invoking utility functions.

Extinction-level Goodharting

Definition 4 (extinction-level Goodharting): Let $Ω$ be a state space with real-world interpretation $φ$ .

By saying that $g \in G (Ω)$ incentivises extinction-level Goodharting (under $φ$ ), we mean that there is some threshold beyond which every outcome is extinction-level; i.e., $\exists ω_{0} \in Ω \forall ω \in Ω : ω \geq_{g} ω_{0} ⟹ ω$ is an extinction-level outcome under $φ$ .
By $G_{ext} (Ω, φ)$ , we denote the set of all $g \in G (Ω)$ which incentivise extinction-level Goodharting.

Rephrasing Extinction-level Goodhart's Law

With the definition of extinction-level Goodharting and the space of extinction-incentivising goals $G_{ext} (Ω, φ)$ , our earlier informal definition of (the weak version^[3] of) Extinction-level Goodhart's Law becomes equivalent to the statement that "the set $G_{ext} (Ω, φ)$ is large", for some notion of "large". Naturally, this begs the question of what exactly we mean by "large". And I have some opinions and guesses about what would be appropriate here^[10]. However, for the purpose of this post, I think it is more productive to highlight the following questions:

How does $G_{ext} (Ω, φ)$ look like for various $Ω$ and $φ$ ?
Is there a relation between
(a) the size of $G_{ext} (Ω, φ)$ and
(b) the complexity of $Ω$ (or the degree to which it resembles the real world)?

Conjecture: Extinction-level Goodhart's Law does not hold in easily-formalisable environments

I conjecture that, when we restrict our attention to models which are informative of reality, Extinction-level Goodhart's Law does not hold in existing formal environments, and possibly does not hold in any easily formalisable environments. Let me try to hint at what I mean by this, and what it would imply.

What do I mean by this conjecture?

I think people can reasonably disagree on whether the space of goals $G_{ext} (Ω, φ)$ is "large" for $Ω$ = the real world, $φ = id$ . However, I would like to argue that $G_{ext} (Ω, φ)$ is not "large" for basically any model that one can actually formally write down and analyse. Of course, this comes with an important fine-print: $G_{ext} (Ω, φ)$ can be "large" for unreasonable definitions of "large". Also, $G_{ext} (Ω, φ)$ can be large for $(Ω, φ)$ which were constructed or chosen without any regard for providing an accurate reflection of reality.

Let me illustrate what I mean using a few examples:

With $Ω$ = game-theoretical model of chess, $φ$ = "this represents a game played between Chris and Vojta", we have $G_{ext} (Ω, φ) = \emptyset$ . That's because the interpretation simply doesn't map any states in $Ω$ to anything harmful at all.
(However, recall that this is not in conflict with the claim that a superintelligent AI that is initially confined to a chess engine, which is running on a real-world computer, will by default hack itself out of the box and destroy the world.)
The same is true for chess-as-a-metaphor-for-war. (Although here some states $ω \in Ω$ will correspond to real-world states $φ (ω)$ where many people die. It's just that none of those states involves literally everybody dying.) ((Also, unlike with chess-as-a-model-of-chess, chess-as-a-metaphor-for-war isn't trying to accurately reflect reality.))
The same will be true for nearly every other model considered in traditional computer science and AI. That's inconvenient for our purposes, but not surprising: Indeed, an important desideratum for a model is to be as simple as possible while allowing one to study the thing they wanted to study. And traditional CS and AI were not concerned about human extinction or Earth-wide phenomena.
There are some exceptions: For example, I suppose that Sid Meier's Civilization --- viewed as a metaphor^[11] for long-term development of civilization --- will contain some states that one would interpret as humanity going extinct. However, this model's primary goal is to make for a fun game, not to provide an accurate reflection of reality. (Additionally, I expect it shouldn't be intractably hard to specify constraints which would disincentivise reaching the extinction-level states. In other words, $G_{ext} (Ω, φ)$ is non-empty, but not large.)
Once we start looking at models meant to investigate AI alignment, it becomes easy to find models with non-empty $G_{ext} (Ω, φ)$ . However, I can't think of any $Ω$ (with a reasonable interpretation) that would make it difficult to avoid the goal specifications which incentivise extinction-level Goodharting.

A simple example is the gridworld where one state represents the big red extinction-causing button. However, these models often make it easy to disincentivise reaching the extinction-level states: In the red-button example, it suffices to penalise the states where the agent reaches the button.
The same will be true for nearly any other X-risk-relevant formal model that I can think of. That's because these models are meant to formally illustrate something about X-risk, so they have to formally specify the states which correspond to catastrophic outcomes --- but then we can simply take that description $D$ and give the AI a huge penalty for reaching any state that satisfies $D$ .

For example, the paper Consequences of Misaligned AI considers a model where the AI can allocate resources between a number of "attributes" $(a_{1}, \dots, a_{L}) \in R^{L}$ . The authors reasonably observe that if there are tradeoffs between the attributes, forgetting to include an attribute in the goal specification incentivise the AI to minimise the amount of resources allocated to it. This is problematic in the real world, where we do not have a complete list of world-attributes we care about. However, in the abstract model given in the paper, this problem has a simple solution: just have the AI maximise a function such as $~ U (a_{1}, \dots, a_{L}) := {min}_{i} a_{i}$ .
(I expect this point might be confusing, so I will try to comment a bit more: This does not mean that the problem the paper is pointing at isn't real. It just means that the model considered in the paper does not satisfy the Weak Version of Extinction-level Goodhart's Law. That is, if we lived inside that model, and we had access to the formal description of the model, the value-specification problem would be easy to solve.)
We can also easily find examples of a model $Ω$ and an interpretation $φ$ for which $G_{ext} (Ω, φ)$ is large --- but I can't do it while also satisfying the requirement to focus on the model being accurate.

For example, suppose that the state space $Ω = {0, \dots, 9}^{N}$ is the set of infinite sequences of numbers 0, ..., 9. And suppose that our interpretation $φ : Ω \to real world$ is constructed as follows: for every $n \in N$ , randomly select $i (n) \in {0, \dots, 9}$ and set $φ (i (1), i (2), \dots) := utopia$ and $φ (ω) := extinction$ for any $ω \neq (i (1), i (2), \dots)$ . Then this pair $(Ω, φ)$ will satisfy the Weak Version of Catastrophic Goodhart's Law, because there is only one non-extinction state, and specifying that state takes an infinite amount of effort. However, this model is just a silly counterexample which we can cook up independently of whether real-world AI alignment is easy or hard.
We can also come up with formal examples which (1) (arguably) are informative of the real world and (2) (arguably, perhaps) satisfy the Weak Version of Extinction-level Goodhart's Law. However, I don't know how to do this while keeping the examples simple enough to still be able to formally analyse them.

An example of this would be "a Turing machine that simulates something like our world", or perhaps "a Turing machine that simulates anything complicated at all". Yes, perhaps the AI alignment problem is quite hard to solve within these environments. But that's not really useful for us, because we can't actually write down a formal description of these environments (or do anything with that description even if we somehow magically obtained it).
For more intuitions on Extinction-level Goodhart's Law might only hold in complicated models, see the previous post Dynamics Crucial to AI Risk Seem to Make for Complicated Models.

Why does all of this matter?

All of this might seem technical and boring, but I think it actually has very important implications for AI advocacy.

First, some people are sceptical of AI risk because of the reasoning that "if the AI risk is real, people should be able to produce solid arguments that it is real. Such as mathematical proofs, or other arguments based on rigorous models". And in general, I am quite sympathetic to this reasoning. But the observations above suggest that, in the particular case of AI, this reasoning doesn't work. Because if your model is simple enough that you can make rigorous arguments about it, it is too simple for the Weak Version of Extinction-level Goodhart's Law to hold.

Second, as I noted earlier, basically all models considered in CS and AI are, intentionally, simple. And this makes sense, given the historical goals of these fields. But it also means that these fields might have a blindspot when it comes to the risks posed by AI. That is, if the AI risk is real, it might be impossible to detect with the tools that AI researchers typically use. It might be detectable by other methods --- for example by thought experiments or by conservatively extrapolating (read "security mindset") from well-chosen real-world experiments. But those are, mostly, not methods that researchers in CS and AI respect.
This last point seems extremely important to me. I take it to basically mean that the scientific fields that work on building AI are fundamentally unqualified to judge the risks posed by the thing they are building. Sadly, I don't know how to "package" this idea in a way that would be understandable (not to mention convincing) to a wider audience. But I suspect that somebody with better communication or writing skills should be able to do it. So if that is you, please go ahead!

Acknowledgments: Most importantly, many thanks to @Chris van Merwijk, who contributed to many of the ideas here, but wasn't able to approve the text. I would also like to thank Vince Conitzer, TJ, Caspar Oesterheld, and Cara Selvarajah (and likely others I am forgetting), for discussions at various stages of this work.

^{^}
A disclaimer to the statement that "Extinction-level Goodhart's Law does not hold in easily formalisable environments": Strictly speaking, this statement is false, because we can consider silly (or rather, unrealistic or inaccurate) interpretations such as "every abstract outcome represents the Earth exploding", which make the law hold trivially. So a more careful wording would be that the law is false in easily-formalisable environments that are at least somewhat accurate. (Where defining "somewhat accurate" is tricky, and I don't know how to do it properly at this moment.)
^{^}
A common comment is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as "existential", and reserve "extinction" for the less ambiguous notion of literally everybody dying.
^{^}
As I noted in an earlier post, the "weak version" qualification refers to the "in the limit" nature of this definition. This makes the weak version of the law (as opposed to a hypothetical "quantitative" version) essentially useless for practical purposes. But I still the definition is interesting conceptually, and as a starting point for finding the appropriate quantitative version.
^{^}
Naturally, an interpretation $φ : Ω \to Ω^{'}$ only helps us define extinction in $Ω$ when such definition already exists in $Ω^{'}$ . If we require $Ω^{'}$ to be a formal environment, this approach would be equivalent to the lazy approach of assuming that we are somehow provided the set of extinction states (except for the additional overhead with the interpretation $φ$ ).
However, it also allows for the semi-formal approach where $Ω^{'}$ is the real world. This means we can no longer formally write down $Ω^{'}$ and $φ$ , and we have to restort to gesturing at $φ$ using informal descriptions. However, in many cases, this will be enough to give an unambiguous definition of extinction in $Ω$ , which -- imo -- justifies the label "semi-formal".
^{^}
More precisely, it should be enough to focus on some class of interpretations that are close enough to not matter for the purpose at hand. (For example, when counting the apples, it does not really matter whether the box I am putting them in is brown or yellow. And if it at some point started to matter, we should clarify it then.)
^{^}
However, we will assume that the interpretation function is Markovian --- that is, the interpretation of a state only depends on the state itself, not on the history we took to reach that state.
To see why this matters, consider the scenario where you are playing a game of Go against a misaligned superintelligence (which is, magically, constrained to only be able to affect the Go board, and forced to stick to the rules). Arguably, such AI could communicate with you by spelling out messages on the game board, and use this channel to take over the world. However, describing this via an interpretation function would require an interpretation that takes into account the history of play. (Since, presumably, such AI would be unable to take over if it only had a single board state to communicate with you.)
This does not mean we cannot use this formalism to talk about superintelligent AIs that try to take over in ways like this. However, it means that to do this, we need to let $Ω$ stand for the whole space of histories, which significantly raises the complexity of analysing the model. This is important to take into account when considering the conjecture, described later in the post, that Extinction-level Goodhart's Law is false in any simple-to-analyse model.
^{^}
And there is no silver lining or mitigating factors, such as "everybody moved to Mars or got uploaded".
^{^}
This reasoning can be easily extended to other value systems, such as virtue ethics, by assuming that the goal specification induces an ordering over trajectories rather than merely final states.
^{^}
E.G., it should be fine to consider a goal specification that assigns to every state one of the labels {"bad", "meh", "I don't know", "good"}, where "bad" is worse than everything else, "good" is better than everything else, but "meh" and "I don't know" are incomparable.
^{^}
In particular, I expect that in the real world, something like the following might be true: Suppose that a goal specification $g$ is informative, in the sense that it isn't something trivial like "any sequence of actions is as good as any other". And suppose additionally that "the specification complexity of $g$ is smaller than an optimistic bound on our civilisation's specification capacity", in the sense that $g$ is something we might be able to specify if the whole Earth made this its main priority for 5 years (starting tomorrow, magically assumming no AI progress). Then $g$ incentivises extinction-level Goodharting.

In other words, this claim suggests that:
- There are two ways of avoiding the Weak Version of Extinction-level Goodhart's Law: First, specify some really trivial goal which doesn't benefit from extra optimisation. Second, actually succeed at fully specifying our values (directly or indirectly).
- Of these options, the first one is useless and the second is unrealistic (though not theoretically impossible).
- Finally, if you take none of these options, your goal specification $g$ will incentivise causing human extinction --- however, this does not mean that humanity actually will go extinct if you have your AI pursue $g$ . This is because your AI might not be able to exert the amount of optimisation power required to cause extinction.
^{^}
Note the general trend where games (or RL environments, benchmarks, etc) can be viewed as either (1) simplified models of the game itself (such as Cart Pole being a model of Vojta playing Cart Pole, or a robot simulation of Robo Soccer being a model of real-world robots playing Soccer), or (2) more abstract metaphors (such as chess being a metaphor for war or the off-switch game being a model for AI alignment). And the trend is that (1) typically contains no extinction-level states while (2) fails the requirement to "attempt to accurately reflect reality".