Catastrophic Regressional Goodhart: Appendix

Thomas Kwa; Drake Thomas

This is a more technical followup to the last post, putting precise bounds on when regressional Goodhart leads to failure or not. We'll first show conditions under which optimization for a proxy fails, and then some conditions under which it succeeds. (The second proof will be substantially easier.)

Related work

In addition to the related work sections of the previous post, this post makes reference to the textbook An Introduction to Heavy-Tailed and Subexponential Distributions, by Foss et al. Many similar results about random variables are present in the textbook, though we haven't seen this posts's results elsewhere in the literature before. We mostly adopt their notation here, and cite a few helpful lemmas.

Main result: Conditions for catastrophic Goodhart

Suppose that and $V$ are independent real-valued random variables. We're going to show, roughly, that if

$X$ is subexponential (a slightly stronger property than being heavy-tailed).
$V$ has lighter tails than $X$ by more than a linear factor, meaning that the ratio of the tails of $V$ and the tails of $X$ grows superlinearly.^[1]

then ${lim}_{t \to \infty} E [V | X + V \geq t] = E [V]$ .

Less formally, we're saying something like "if it requires relatively little selection pressure on $X$ to get more of $X$ and asymptotically more selection pressure on $V$ to get more of $V$ , then applying very strong optimization towards $X + V$ will not get you even a little bit of optimization towards $V$ - all the optimization power will go towards $X$ ."

Proof sketch and intuitions

The conditional expectation $E [V | X + V > t]$ is given by $\frac{\int_{- \infty}^{\infty} v f_{V} (v) Pr (X > t - v)}{\int_{- \infty}^{\infty} f_{V} (v) Pr (X > t - v)}$ ,^[2] and we divide the integral in the numerator into 4 regions, showing that each region's effect on the conditional expectation of V is similar to that of the corresponding region in the unconditional expectation $E [V]$ .

The regions are defined in terms of a slow-growing function $h (t) : R \to R_{\geq 0}$ such that the fiddly bounds on different pieces of the proof work out. Roughly, we want it to go to infinity so that $| V |$ is likely to be less than $h (t)$ in the limit, but grow slowly enough that the shape of $V$ 's distribution within the interval $[- h (t), h (t)]$ doesn't change much after conditioning.

In the table below, we abbreviate the condition $X + V > t$ as $c$ .

Region	Why its effect on $E [V \| c]$ is small	Explanation
$r_{1} = (- \infty, - h (t)]$	$P [V \in r_{1} \| c]$ is too low	In this region, $\| V \| > h (t)$ and $X > t + h (t)$ , both of which are unlikely.
$r_{2} = (- h (t), h (t))$	$E [V \| V \in r_{2}, c] \approx E [V \| V \in r_{2}]$	The tail distribution of X is too flat to change the shape of $V$ 's distribution within this region.
$r_{3} = [h (t), t - h (t)]$	$P [V \in r_{3} \| c]$ is low, and $V < t$ .	There are increasing returns to each bit of optimization for X, so it's unlikely that both X and V have moderate values.^[3]
$r_{4} = (t - h (t), \infty)$	$P [V \in r_{4} \| c]$ is too low	X is heavier-tailed than V, so the condition that $V > t - h (t)$ is much less likely than $X > t - h (t)$ in $r_{2}$ .

Here's a diagram showing the region boundaries at $- h (t)$ , $h (t)$ , and $t - h (t)$ in an example where $t = 25$ and $h (t) = 4$ , along with a negative log plot of the relevant distribution:

Note that up to a constant vertical shift of normalization, the green curve is the pointwise sum of the blue and orange curves.

Full proof

To be more precise, we're going to make the following definitions and assumptions:

Let $f_{V} (v)$ be the PDF of $V$ at the value $v$ . We assume for convenience that $f_{V}$ exists, is integrable, etc, though we suspect that this isn't necessary, and that one could work through a similar proof just referring to the tails of $V$ . We won't make this assumption for $X$ .
Let $F_{X} (x) = Pr (X \leq x)$ and ${¯ F}_{X} (x) = Pr (X > x)$ , similarly for $F_{V}$ and ${¯ F}_{V}$ .
Assume $V$ has a finite mean: $\int_{- \infty}^{\infty} v f_{V} (v) d v$ converges absolutely.
$X$ is subexponential.
- Formally, this means that ${lim}_{x \to \infty} \frac{Pr (X_{1} + X_{2} > x)}{Pr (X > x)} = 2$ .
- This happens roughly whenever $X$ has tails that are heavier than $e^{- c x}$ for any $c$ and is reasonably well-behaved; counterexamples to the claim "long-tailed implies subexponential" exist, but they're nontrivial to exhibit.
- Examples of subexponential distributions include log-normal distributions, anything that decays like a power law, the Pareto distribution, and distributions with tails asymptotic to $e^{- x^{a}}$ for any $0 < a < 1$ .
We require for $V$ that its tail function is substantially lighter than X's, namely that ${lim}_{t \to \infty} \frac{t^{p} {¯ F}_{V} (t)}{{¯ F}_{X} (t)} = 0$ for some $p > 1$ .^[1]
- This implies that ${¯ F}_{V} (t) = O ({¯ F}_{X} (t) / t)$ .

With all that out of the way, we can move on to the proof.

The unnormalized PDF of $V$ conditioned on $X + V \geq t$ is given by $f_{V} (v) {¯ F}_{X} (t - v)$ . Its expectation is given by $\frac{\int_{- \infty}^{\infty} v f_{V} (v) {¯ F}_{X} (t - v)}{\int_{- \infty}^{\infty} f_{V} (v) {¯ F}_{X} (t - v)}$ .

Meanwhile, the unconditional expectation of V is given by $\int_{- \infty}^{\infty} v f_{V} (v)$ .

We'd like to show that these two expectations are equal in the limit for large $t$ . To do this, we'll introduce $Q (v) = \frac{{¯ F}_{X} (t - v)}{{¯ F}_{X} (t)}$ . (More pedantically, this should really be $Q_{t} (v)$ , which we'll occasionally use where it's helpful to remember that this is a function of $t$ .)

For a given value of $t$ , $Q (v)$ is just a scaled version of ${¯ F}_{X} (t - v)$ , so the conditional expectation of $V$ is given by $\frac{\int_{- \infty}^{\infty} v f_{V} (v) Q (v)}{\int_{- \infty}^{\infty} f_{V} (v) Q (v)}$ . But because $Q (0) = 1$ , the numerator and denominator of this fraction are (for small $v$ ) close to the unconditional expectation and $1$ , respectively.

We'll aim to show that for all $ϵ > 0,$ we have for sufficiently large $t$ that $∣ ∣ \int_{- \infty}^{\infty} v f_{V} (v) Q_{t} (v) - \int_{- \infty}^{\infty} v f_{V} (v) ∣ ∣ < ϵ$ and $\int_{- \infty}^{\infty} f_{V} (v) Q_{t} (v) \in [1 - ϵ, 1 + ϵ]$ , which implies (exercise) that the two expectations have limiting difference zero. But first we need some lemmas.

Lemmas

Lemma 1: There is $h (t)$ depending on $F_{X}$ such that:

(a) ${lim}_{x \to \infty} h (t) = \infty$
(b) ${lim}_{t \to \infty} t - h (t) = \infty$
(c) ${lim}_{t \to \infty} \frac{{¯ F}_{X} (t - h (t))}{{¯ F}_{X} (t)} = 1$
(d) ${lim}_{t \to \infty} {sup}_{| v | \leq h (t)} | Q (v, t) - 1 | = 0$ .

Proof: Lemma 2.19 from Foss implies that if $X$ is long-tailed (which it is, because subexponential implies long-tailed), then there is $h (t)$ such that condition (a) holds and ${¯ F}_{X}$ is $h$ -insensitive; by Proposition 2.20 we can take $h$ such that $h (t) \leq t / 2$ for sufficiently large $t$ , implying condition (b). Conditions (c) and (d) follow from being $h$ -insensitive.

Lemma 2: Suppose that $F_{X}$ is whole-line subexponential and $h$ is chosen as in Lemma 1. Also suppose that ${¯ F}_{V} (t) = O ({¯ F}_{X} (t) / t)$ . Then $P r [X + V > t, V > h (t), X > h (t)] = o ({¯ F}_{X} (t) / t) .$

Proof: This is a slight variation on lemma 3.8 from [1], and follows from the proof of Lemma 2.37. Lemma 2.37 states that

but it is actually proved that

$\begin{matrix} P {ξ_{1} + η_{1} > x, ξ_{1} > h (x), η_{1} > h (x)} \leq sup z > h (x) \frac{_{1} (z)}{_{2} (z)} \cdot sup z > h (x) \frac{_{1} (z)}{_{2} (z)} \cdot P {ξ_{2} + η_{2} > x, ξ_{2} > h (x), η_{2} > h (x)} . \end{matrix}$

If we let $F_{1} = F_{V}, F_{2} = G_{1} = G_{2} = F_{X}$ , then we get

$\begin{matrix} P {X + V > t, X > h (t), V > h (t)} \leq sup z > h (t) \frac{{¯ F}_{V} (z)}{{¯ F}_{X} (z)} sup z > h (t) \frac{{¯ F}_{X} (z)}{{¯ F}_{X} (z)} P {X + X^{'} > t, X > h (t), X^{'} > h (t)} = sup z > h (t) \frac{{¯ F}_{V} (z)}{{¯ F}_{X} (z)} P {X + X^{'} > t, X > h (t), X^{'} > h (t)} \end{matrix}$

where $X, X^{'} \sim F_{X}$ . Multiplying by $t$ , we have $\begin{matrix} t P {X + V > t, X > h (t), V > h (t)} \leq sup z > h (t) \frac{t {¯ F}_{V} (z)}{{¯ F}_{X} (z)} P {X + X^{'} > t, X > h (t), X^{'} > h (t)} \end{matrix}$ ,

and because $h (t) \to \infty$ as $t \to \infty$ and ${¯ F}_{V} (t) = O ({¯ F}_{X} (t) / t)$ , we can say that for some $c < \infty$ , ${lim}_{t \to \infty} {sup}_{z > h (t)} \frac{t {¯ F}_{V} (z)}{{¯ F}_{X} (z)} < c$ . Therefore for sufficiently large t $P {X + V > t, X > h (t), V > h (t)} \leq \frac{c}{t} P {X + X^{'} > t, X > h (t), X^{'} > h (t)}$ .

By Theorem 3.6, $P {X + X^{'} > t, X > h (t), X^{'} > h (t)}$ is $o ({¯ F}_{X} (t))$ , so the LHS is $o ({¯ F}_{X} (t) / t)$ as desired.

Bounds on the numerator

We want to show, for arbitrary $ϵ > 0$ , that $∣ ∣ \int_{- \infty}^{\infty} v f_{V} (v) Q (v) - \int_{- \infty}^{\infty} v f_{V} (v) ∣ ∣ < ϵ$ in the limit as $t \to \infty$ . Since

$∣ ∣ \int_{- \infty}^{\infty} v f_{V} (v) Q (v) - \int_{- \infty}^{\infty} v f_{V} (v) ∣ ∣ \leq \int_{- \infty}^{\infty} | v f_{V} (v) (Q (v) - 1) | = \int_{- \infty}^{\infty} | v | \cdot f_{V} (v) \cdot | Q (v) - 1 |$

it will suffice to show that the latter quantity is less than $ϵ$ for large $t$ .

We're going to show that $\int_{- \infty}^{\infty} | v | \cdot f_{V} (v) \cdot | Q (v) - 1 |$ is small by showing that the integral gets arbitrarily small on each of four pieces: $(- \infty, - h (t)]$ , $(- h (t), h (t))$ , $[h (t), t - h (t)]$ , and $(t - h (t), \infty)$ .

We'll handle these case by case (they'll get monotonically trickier).

Region 1: $(- \infty, - h (t)]$

Since $\int_{- \infty}^{\infty} v f_{V} (v)$ is absolutely convergent, for sufficiently large $t$ we will have $\int_{- \infty}^{- h (t)} | v | f_{V} (v) < ϵ$ , since $h (t)$ goes to infinity by Lemma 1(a).

Since $Q (v)$ is monotonically increasing and $Q (0) = 1$ , we know that in this interval $| Q (v) - 1 | = 1 - Q (v)$ .

So we have

$\int_{- \infty}^{- h (t)} | v | \cdot f_{V} (v) \cdot | Q (v) - 1 | = \int_{- \infty}^{- h (t)} | v | f_{V} (v) (1 - Q (v)) < \int_{- \infty}^{- h (t)} | v | f_{V} (v) < ϵ$

as desired.

Region 2: $(- h (t), h (t))$

By lemma 1(d), $h$ is such that for sufficiently large $t$ , $| Q (v) - 1 | < \frac{ϵ}{\int_{- \infty}^{\infty} | v | f_{V} (v)}$ on the interval $[- h (t), h (t)]$ . (Note that the value of this upper bound depends only on $V$ and $ϵ$ , not on $t$ or $h$ .) So we have

$\int_{- h (t)}^{h (t)} | v | f_{V} (v) | Q (v) - 1 | < \frac{ϵ}{\int_{- \infty}^{\infty} | v | f_{V} (v)} \int_{- h (t)}^{h (t)} | v | f_{V} (v) < \frac{ϵ}{\int_{- \infty}^{\infty} | v | f_{V} (v)} \int_{- \infty}^{\infty} | v | f_{V} (v) = ϵ$ .

Region 3: $[h (t), t - h (t)]$

For the third part, we'd like to show that $\int_{h (t)}^{t - h (t)} v f_{V} (v) (Q (v) - 1) < ϵ$ . Since

$\int_{h (t)}^{t - h (t)} v f_{V} (v) (Q (v) - 1) < \int_{h (t)}^{t - h (t)} t f_{V} (v) Q (v) = \frac{t}{{¯ F}_{X} (t)} \int_{h (t)}^{t - h (t)} f_{V} (v) {¯ F}_{X} (t - v)$

it would suffice to show that the latter expression becomes less than $ϵ$ for large $t$ , or equivalently that $\int_{h (t)}^{t - h (t)} f_{V} (v) {¯ F}_{X} (t - v) = o (\frac{{¯ F}_{X} (t)}{t})$ .

The LHS in this expression is the unconditional probability that $X + V > t$ and $h (t) < V < t - h (t)$ , but this event implies $X + V > t, V > h (t)$ , and $X > h (t)$ . So we can write

$\int_{h (t)}^{t - h (t)} f_{V} (v) {¯ F}_{X} (t - v) = P r [X + V > t, h (t) < V < t - h (t)]$

$< P r [X + V > t, V > h (t), X > h (t)] = o ({¯ F}_{X} (t) / t)$ by Lemma 2.

Region 4: $(t - h (t), \infty)$

For the fourth part, we'd like to show that $\int_{t - h (t)}^{\infty} v f_{V} (v) Q (v) \to 0$ for large $t$ .

Since $Q (v) = \frac{{¯ F}_{X} (t - v)}{{¯ F}_{X} (t)} < \frac{1}{{¯ F}_{X} (t)}$ , it would suffice to show $\int_{t - h (t)}^{\infty} v f_{V} (v) = o ({¯ F}_{X} (t))$ . But note that since ${lim}_{t \to \infty} \frac{{¯ F}_{X} (t - h (t))}{{¯ F}_{X} (t)} = 1$ by Lemma 1(c), this is equivalent to $\int_{t - h (t)}^{\infty} v f_{V} (v) = o ({¯ F}_{X} (t - h (t)))$ , which (by Lemma 1(b)) is equivalent to $\int_{t}^{\infty} v f_{V} (v) = o ({¯ F}_{X} (t))$ .

Note that $\int_{t}^{\infty} v f_{V} (v) = t \int_{t}^{\infty} f_{V} (v) + \int_{t}^{\infty} (v - t) f_{V} (v) = t {¯ F}_{V} (t) + \int_{t}^{\infty} {¯ F}_{V} (v)$ , so it will suffice to show that both terms in this sum are $o ({¯ F}_{X} (t))$ .

The first term $t {¯ F}_{V} (t)$ is $o ({¯ F}_{X} (t))$ because we assumed ${lim}_{t \to \infty} \frac{t^{p} {¯ F}_{V} (t)}{{¯ F}_{X} (t)} = 0$ for some $p > 1$ .

For the second term, we have for the same reason $\int_{t}^{\infty} {¯ F}_{V} (v) < \int_{t}^{\infty} \frac{{¯ F}_{X} (v)}{v^{p}} = {¯ F}_{X} (t) \int_{t}^{\infty} v^{- p} = \frac{t^{1 - p}}{p - 1} {¯ F}_{X} (t) = o ({¯ F}_{X} (t))$ .

This completes the bounds on the numerator.

Bounds on the denominator

For the denominator, we want to show that ${lim}_{t \to \infty} \int_{- \infty}^{\infty} f_{V} (v) Q_{t} (v) = 1 = \int_{- \infty}^{\infty} f_{V} (v)$ , so it'll suffice to show $| \int_{- \infty}^{\infty} f_{V} (v) (Q_{t} (v) - 1) | = o (1)$ as $t \to \infty$ .

Again, we'll break up this integral into pieces, though they'll be more straightforward than last time. We'll look at $(- \infty, - h (t))$ , $[- h (t), h (t)]$ , and $(h (t), \infty)$ .

$| \int_{- \infty}^{- h (t)} f_{V} (v) (Q (v) - 1) | = \int_{- \infty}^{- h (t)} f_{V} (v) (1 - Q (v)) < \int_{- \infty}^{- h (t)} f_{V} (v)$ .
- But since $h (t)$ goes to infinity, this left tail of the integral will contain less and less of $V$ 's probability mass over time.
$| \int_{- h (t)}^{h (t)} f_{V} (v) (Q (v) - 1) | \leq \int_{- h (t)}^{h (t)} f_{V} (v) | Q (v) - 1 |$
$\leq {sup}_{| v | \leq h (t)} | Q (v, t) - 1 | \int_{- h (t)}^{h (t)} f_{V} (v) \leq {sup}_{| v | \leq h (t)} | Q (v, t) - 1 |$
- But by Lemma 1(d) we know that this goes to zero for large $t$ .
$| \int_{h (t)}^{\infty} f_{V} (v) (Q (v) - 1) | = \int_{h (t)}^{\infty} f_{V} (v) (Q (v) - 1) < \int_{h (t)}^{\infty} f_{V} (v) Q (v)$ .
- But for sufficiently large $t$ we have $h (t) > 1$ , so we obtain
  $\int_{h (t)}^{\infty} f_{V} (v) Q (v) < \int_{h (t)}^{\infty} v f_{V} (v) Q (v) < \int_{- \infty}^{\infty} v f_{V} (v) Q (v) = o (1)$
  by the results of the previous section.

This completes the proof!

Light tails imply $V$

Conversely, here's a case where we do get arbitrarily high $E [V | X + V \geq t]$ for large $t$ . This generalizes a consequence of the lemma from Appendix A of Scaling Laws for Reward Model Overoptimization (Gao et al., 2022), which shows this result in the case where $X$ is either bounded or normally distributed.

Suppose that $X$ satisfies the property that ${lim}_{t \to \infty} \frac{{¯ F}_{X} (t + 1)}{{¯ F}_{X} (t)} = 0$ .^[4] This implies that $X$ has tails that are dominated by $e^{- c x}$ for any $c$ , though it's a slightly stronger claim because it requires that $X$ not be too "jumpy" in the decay of its tails.^[5]

We'll show that for any $V$ with a finite mean which has no upper bound, ${lim}_{t \to \infty} E [V | X + V > t] = \infty$ .

In particular we'll show that for any $c$ , ${lim}_{t \to \infty} E [V | X + V > t] \geq c$ .

Proof

Let $Pr (V > c + 1) = p > 0$ , which exists by our assumption that $V$ is unbounded.

Let $E [V | V < c] = q$ . (If this is undefined because the conditional has probability $0$ , we'll have the desired result anyway since then $V$ would always be at least $c$ .)

Observe that for all $t$ , $E [V | V < c, X + V > t] \geq q$ (assuming it is defined), because we're conditioning $(V | V < c)$ on an event which is more likely for larger $v$ (since $X$ and $V$ are independent).

First, let's see that ${lim}_{t \to \infty} \frac{P (V < c | X + V \geq t)}{P (V > c + 1 | X + V \geq t)} = 0$ . This ratio of probabilities is equal to

$\frac{\int_{- \infty}^{c} f_{V} (v) {¯ F}_{X} (t - v)}{\int_{c + 1}^{\infty} f_{V} (v) {¯ F}_{X} (t - v)} \leq \frac{\int_{- \infty}^{c} f_{V} (v) {¯ F}_{X} (t - c)}{\int_{c + 1}^{\infty} f_{V} (v) {¯ F}_{X} (t - c - 1)} = \frac{{¯ F}_{X} (t - c)}{{¯ F}_{X} (t - c - 1)} \cdot \frac{\int_{- \infty}^{c} f_{V} (v)}{\int_{c + 1}^{\infty} f_{V} (v)}$

$= \frac{{¯ F}_{X} (t - c)}{{¯ F}_{X} (t - c - 1)} \cdot \frac{Pr (V < c)}{Pr (V > c + 1)} \leq \frac{{¯ F}_{X} (t - c)}{{¯ F}_{X} (t - c - 1)} \cdot \frac{1}{p}$

which, by our assumption that ${lim}_{t \to \infty} \frac{{¯ F}_{X} (t + 1)}{{¯ F}_{X} (t)} = 0$ , will get arbitrarily small as $t$ increases for any positive $p$ .

Now, consider $E [V | X + V \geq t]$ . We can break this up as the sum across outcomes $Z$ of $E [V | Z, X + V \geq t] \cdot Pr (Z | X + V \geq t)$ for the three disjoint outcomes $V < c$ , $c \leq V \leq c + 1$ , and $V > c + 1$ . Note that we can lower bound these expectations by $q, c, c + 1$ respectively. But then once $t$ is large enough that $\frac{Pr (V < c | X + V \geq t)}{Pr (V > c + 1 | X + V \geq t)} < \frac{1}{c - q}$ , this weighted sum of conditional expectations will add to more than $c$ (exercise).

Answers to exercises from last post

Show that when $X$ and $V$ are independent and $t \in R$ , $E [V | X + V > t] \geq E [V]$ . Conclude that ${lim}_{t \to \infty} E [V | X + V > t] \geq E [V]$ . This means that given independence, optimization always produces a plan that is no worse than random.

Proof: Fixing a value of $t$ , we have for all $x \in R$ that $E (V | X + V > t, X = x) = E (V | V > t - x) \geq E (V)$ . Since the conditional expectation after seeing any particular value of $X$ is at least $E (V)$ , this will be true when averaged across all $x$ proportional to their frequency in the distribution of $X$ . This means that $E [X + V > t] \geq E (V)$ for all $t$ , so the inequality also holds in the limit.
When independence is violated, an optimized plan can be worse than random, even if your evaluator is unbiased. Construct a joint distribution $f_{V X}$ for $X$ and $V$ such that $E [X] = 0$ , $E [V] = 0$ , and $E [X | V = v] = 0$ for any $v \in R$ , but ${lim}_{t \to \infty} E [V | X + V > t] = - \infty$ .

Solution: Suppose that $V$ is distributed with a PDF equal to $0.5 e^{- | v |}$ , and the conditional distribution of $X$ is given by

$(X | V = v) = {\begin{matrix} 0 & v \geq 0 Coinflip () \cdot (v^{2} - v) & v < 0 \end{matrix}$
where $Coinflip ()$ is a random variable that is equal to $\pm 1$ with 50/50 odds. The two-dimensional heatmap looks like this:

Now, conditioning on $X + V \geq t$ with $t > 0$ , we have one of two outcomes: either $V \geq t$ , or $V \leq - \sqrt{t}$ and $Coinflip () = 1$ .
The first case has an unconditional probability of $0.5 e^{- t}$ , and a conditional expectation of $E [V | V \geq t] = \frac{\int_{t}^{\infty} 0.5 v e^{- v} d v}{\int_{t}^{\infty} 0.5 e^{- v} d v} = \frac{0.5 (t + 1) e^{- t}}{0.5 e^{- t}} = t + 1$ .
The second case has an unconditional probability of $0.25 e^{- \sqrt{t}}$ , and a conditional expectation of at most $- \sqrt{t}$ since all values of $v$ in the second case are at most that large.
So, abbreviating these two cases as $A$ and $B$ respectively, the overall conditional expectation is given by
$E [V | X + V \geq t] = \frac{E [V | A] Pr (A) + E [V | B] Pr (B)}{Pr (A) + Pr (B)} \leq \frac{(t + 1) 0.5 e^{- t} - 0.25 e^{- \sqrt{t}} \sqrt{t}}{0.5 e^{- t} + 0.25 e^{- \sqrt{t}}}$
$= \frac{(2 t + 2) - e^{\sqrt{t}} \sqrt{t}}{2 + e^{\sqrt{t}}} \leq \frac{(2 t + 2) - e^{\sqrt{t}} \sqrt{t}}{e^{\sqrt{t}}} = \frac{(2 t + 2)}{e^{\sqrt{t}}} - \sqrt{t} = o (1) - \sqrt{t} \to - \infty$
as desired.

This sort of strategy works for any fixed distribution of $V$ , so long as the distribution is not bounded below and has finite mean; we can replace $v^{2} - v$ with some sufficiently fast-growing function to get a zero-mean conditional $X$ distribution that behaves the same.

For a followup exercise, construct an example of this behavior even when all conditional $X$ distributions have variance at most $1$ .

^{^}
We actually just need $\int_{t}^{\infty} {¯ F}_{V} (v) \in o ({¯ F}_{X} (t))$ , so we can have e.g. ${¯ F}_{V} (v) = \frac{F_{X} (v)}{v {log}^{2} (v)}$ .
^{^}
We'll generally omit $d x$ and $d v$ terms in the interests of compactness and laziness; the implied differentials should be pretty clear.
^{^}
The diagrams in the previous post show visually that when $X$ and $V$ are both heavy-tailed and $t$ is large, most of the probability mass has $X \approx 0$ , $V \approx t$ or vice versa.
^{^}
This proof will actually go through if we just have ${lim}_{t \to \infty} \frac{{¯ F}_{X} (t + k)}{{¯ F}_{X} (t)} = 0$ for any constant $k > 0$ , which is a slightly weaker condition (just replace $1$ with $k$ in the proof as necessary). For instance, $X$ could have probability $\frac{e}{n!}$ of being equal to $100 n$ for $n = 0, 1, 2, 3, \dots$ , which would satisfy this condition for $k = 101$ but not $k = 1$ .
^{^}
If $X$ has really jumpy tails, the limit of the mean of the conditional distribution may not exist. Exercise: what goes wrong when $X$ has a $\frac{2}{3^{n}}$ probability of being equal to $2^{n}$ for $n = 1, 2, 3, \dots$ and $V$ is a standard normal distribution?

AI ALIGNMENT FORUM
AF

12

Catastrophic Regressional Goodhart: Appendix

12

Related work

Main result: Conditions for catastrophic Goodhart

Proof sketch and intuitions

Full proof

Lemmas

Bounds on the numerator

Region 1: $(- \infty, - h (t)]$

Region 2: $(- h (t), h (t))$

Region 3: $[h (t), t - h (t)]$

Region 4: $(t - h (t), \infty)$

Bounds on the denominator

Light tails imply $V$

Proof

Answers to exercises from last post

12

Catastrophic Regressional Goodhart: Appendix

12

Related work

Main result: Conditions for catastrophic Goodhart

Proof sketch and intuitions

Full proof

Lemmas

Bounds on the numerator

Region 1: (−∞,−h(t)]

Region 2: (−h(t),h(t))

Region 3: [h(t),t−h(t)]

Region 4: (t−h(t),∞)

Bounds on the denominator

Light tails imply V

Proof

Answers to exercises from last post

Region 1: $(- \infty, - h (t)]$

Region 2: $(- h (t), h (t))$

Region 3: $[h (t), t - h (t)]$

Region 4: $(t - h (t), \infty)$

Light tails imply $V$