Introduction to Reducing Goodhart

That was quite a stimulating post! It pushed me to actually go through the cloud of confusion surrounding these questions in my mind, hopefully with a better picture now.

First, I was confused about your point on True Values. I was confused by what you even meant. If I understand correctly, you're talking about a class of parametrized models of human: the agent/goal-directed model, parametrized by something like the beliefs and desires of Dennett's intentional stance. With some non-formalized additional subtleties like the fact that desires/utilities/goals can't just describe exactly what the system do, but must be in some sense compressed and sparse.

Now, there's a pretty trivial sense in which there is no True Values for the parameters: because this model class lacks realizability, no parameter describes exactly and perfectly the human we want to predict. That sounds completely uncontroversial to me, but also boring.

Your claim, in my opinion, is that there are no parameters for which this model is close to good enough at predicting the human. Is that correct?

Assuming for the moment it is, this post doesn't really argue for that point in my opinion; instead it argues for the difficulty in inferring such good parameters if they existed. For example this part:

But here's the problem: humans have no such V (see also Scott A., Stuart 1, 2). Inferring human preferences depends on:
what state the environment is in.
what physical system to infer the preferences of.
how to make inferences from that physical system.
how to resolve inconsistencies and conflicting dynamics.
how to extrapolate the inferred preferences into new and different contexts.
There is no single privileged way to do all these things, and different choices can give very different results

is really about inference, as none of your points make it impossible for a good parameter to exist -- they just argue for the difficulty of finding/defining one.

Note that I'm not saying what you're doing with this sequence is wrong; looking at Goodhart from a different perspective, especially one which tries to dissolve some of the inferring difficulties, sounds valuable to me.

Another thing I like about this post it that you made me realize why the application of Goodhart's law to AI risk doesn't require the existence of True Values: it's an impossibility result, and when proving an impossibility, the more you assume the better. Goodhart is about the difficulty of using proxies in the best case scenario when there are indeed good parameters. It's about showing the risk and danger in just "finding the right values", even in the best world where true values do exist. So if there are no true values, the difficulty doesn't disappear, it gets even worse (or different at the very least)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

22

Introduction to Reducing Goodhart

22