The reverse Goodhart problem

Let me try to repair Goodhart's law to avoid these problems:

By statistics, we should very generally expect two random variables to be uncorrelated unless there's a "good reason" to expect them to be correlated. Goodhart's law says that if U and V are correlated in some distribution, then (1) if a powerful optimizer tries to maximize U, then it will by default go far out of the distribution, (2) the mere fact that U and V were correlated in the distribution does not in itself constitute a "good reason" to expect them to be correlated far out of the distribution, so by default they won't be; (3) therefore we expect Goodhart's law "by default": you optimize U, thus go out of the distribution, thus break the correlation between U and V, and then V regresses back down to its mean.

So then we can start going through examples:

GDP vs human flourishing: This example fits all the defaults. There is no "good reason" to expect an extremely-out-of-distribution correlation between "GDP" and "human flourishing"—really the only reason to expect a correlation is the fact that they're correlated in-distribution, and by itself that's not enough to count as a "good reason". And we definitely expect that powerfully maximizing GDP would push it far out-of-distribution. Therefore we expect Goodhart's law—if you maximize GDP hard enough, then human flourishing will stop going up and start going down as it regresses to the mean.
GDP vs "twice GDP minus human flourishing": Here there is a very good a priori reason to expect an extremely-out-of-distribution correlation between the two sides—namely the fact that "GDP" is part of both. So the default expectation doesn't apply.
GDP vs log(GDP): Here there's an even more obvious, a priori reason to expect a robust correlation across all possible configurations of matter in all possible universes. So the default expectation doesn't apply.
"Mass of an object" vs "total number of protons and neutrons in the object": The default expectation that "optimization takes you far out of the distribution" doesn't really apply here, because regularities hold in a much broader "distribution" if the regularity comes from basic laws of physics, rather than from regularities concerning human-sized objects and events. So you can have a quite powerful optimization process trying to maximize an object's mass, yet stay well within the distribution of environments where this particular correlation remains robust. (A powerful enough optimizer could eventually make a black hole, which would indeed break this correlation, and then we get Goodhart's law. Other physics-derived correlations would be truly unbreakable though, like inertial mass vs gravitational mass.)
"The utility of the worst-off human" vs "The utility of the average human": Is there a "good reason" to expect these to be correlated extremely-out-of-distribution? Yes! Mathematically, if the former goes to infinity, then the latter has to go to infinity too. So we have a sound a priori reason to at least question the Goodhart's law default. We need a more object-level analysis to decide what would happen.

We also need to scale the proxy so that $V \approx U$ on the typical range of circumstances; thus the conservatism of $U$ is only visible away from the typical range. ↩︎

[-]Steven Byrnes4y50

[-]Stuart_Armstrong4y30

Cheers, these are useful classifications.

[-]Gordon Seidoh Worley4y20

Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

[-]Gordon Seidoh Worley4y30

Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.

It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left with regret that I put in too much effort.

Most examples I can think of that look like potential anti-Goodharting seem the same: I don't mind that I overshot the target, but I do mind that I wasn't as efficient as I could have been.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

12

12