I - Obvious answers to obvious questions
"Why do I think I know what I do about Goodhart's law?"
In the last post, I talked about some issues with the simple framing of Goodhart's law for human values. So why are we thinking about it in the first place?
There's an obvious answer (or at least an obvious outer layer of the answer-onion): we use Goodhart's law because we usually model humans as if they have values. Not necessarily True Values over the whole world, but at least contextual values over some easy-to-think-about chunk of the world. Even though it's not literally true, it's really useful! We don't use a person's goals to predict every thing they'll say in conversation, they're not that agenty, but we'll infer their goals in a simplified picture of the local environment, and use that to explain and predict what kind of state transitions they'll cause within that simplified model. And this works pretty darn well, so we conveniently round off our description of this state of affairs to "humans have values."
Consider historical examples of Goodhart's law. Like the dolphin trainers who rewarded dolphins for each piece of trash removed from the pool, which taught the dolphins not to clean the pool, but to hoard trash and rip it up before delivering each piece. We don't need to pretend we know the dolphin trainers' True Values, we just need to model a simplified version of their values, within a simplified model of this story. We can say "the trainers want less trash in the pool" without implying that the trainers would actually optimize for this in the real world (e.g. by freeing the dolphins so crowds won't visit the pool), because we're making this statement within a simplified model of their situation that only has very limited states and actions available.
Thus, within our simplified model of the story, this works totally fine as an example of Goodhart's law: the humans had wants, they created an incentive for a thing that was correlated with what they wanted in the past, and then the thing stopped being correlated with what they wanted. And in fact we can gather up many such stories, that all feature the same pattern of the human creating an incentive that's correlated with their modeled values and then not getting what they want - these stories are our empirical evidence for Goodhart's law.
What's the problem with using Goodhart's law, then? If it lets us predict a regularity about the future of dolphin trainers (or schoolteachers or the Federal Reserve), can't we trust it to predict what happens when we run a value learning program?
In fact, often we can. See examples of applying it to AI. But it doesn't work in full generality because comparing the outcome to what we actually wanted only works when what we "actually wanted" is sufficiently obvious. I'd like to use the rest of this post to introduce a notion of "competence" for inferred human values. This is like a souped-up Bayesian version of revealed preferences, and it's useful to me in thinking about when we expect Goodhart's law to hold.
II - Competent banana
When humans have a "competent" value, that means that we're particularly agenty with respect to this value. Like if I prefer two bananas over one banana, competently, this means that agenty predictions using this preference will work reliably across a wide range of contexts, especially contexts that come up in my actual life (a.k.a. on-distribution).
But suppose I disagreed with other people similar to me about one vs. two bananas, AND actually you could get different answers from me if I'd read pro/anti arguments in the opposite order or if you asked me in different ways, AND that treating this as a preference of mine didn't help form a simple explanation of other preferences, AND so on and so forth. Then even if I verbally tell you that I prefer two bananas, modeling me as an agent that prefers two bananas is less useful than before - the preference is more incompetent.
Competence is continuous, not discrete. Don't just mentally replace the question of whether I "really" prefer two bananas with the question of whether I "competently" prefer two bananas. Allow the competences of my values to be shades of grey. Different inferred values can fit the data to different degrees, can have larger or smaller domains of validity, can be more or less easily explained with a non-agential model, and so on.
If one wanted a precise definition, one could try denominating this in a single currency: bits of predictive usefulness minus bits of model complexity (given a distribution of training situations to predict, and some choice of timescale to predict over). However, I suspect this misses some subtleties, especially concerning priors over model architectures and what language to cash values out in, so I'll stick to verbal reasoning.
The values described in this way can be small - it's okay to say I prefer two bananas to one banana, you don't have to give a complete model of my behavior all at once. To make a physics analogy, they're like the ideal gas law, not like the Standard Model of particle physics. The ideal gas law doesn't try to explain everything at once, and it makes no claim at being fundamental, it's just a useful regularity that we can use to predict part of certain systems, within its domain of validity. This is also analogous to hypotheses in infra-Bayesian reasoning. The domain of validity of inferred human values can either be an explicit fact about what they get used to predict, or it can have the implicit value of "at least as broad as where it shows up in the training distribution."
A further difference between competent preferences and "True Values" is that competence is not a substitute for importance. I might hold dearly to some value that I'm still a bit fuzzy on, but be relatively rational in exchanges involving something trivial. The definition of competence cares about the usefulness to a predictor, not the feelings of the predictee. Nevertheless, there's a limit to how much importance we're willing to attribute to an incompetent preference; if a preference is basically worthless for predicting me, how important can it really be?
The opposite doesn't always hold - we're fine with saying that a smoker who seeks out cigarettes might not 'really' value them. But let's not get too far into the weeds. We'll circle back to meta-preferences later, and in later posts.
III - Application
Even if the competence of preferences has only a loose link with importance, can we detect AI failures by watching for violation of competent preferences? Not quite, because of human preference conflicts. For example, I could have both a desire for bacon and a desire to keep slim; both help you predict my behavior, both are competent, but they conflict. Therefore any future for the universe messes up according to some competent values. But often, the weight of evidence is utterly one-sided - the dolphin trainers didn't secretly have a countervailing preference for trash in the pool, after all.
Violation of such one-sided competent preferences seems like a better failure indicator. This is a condition for "obviousness" that we can plug in to the requirement from section 1 that for examples of Goodhart's law, it had to be obvious that we didn't get what we wanted. I suspect this would work even with fairly simple methods of value inference, which makes this a promising question to look into.
However, some examples of Goodhart's law require more complicated information about humans to understand, because humans have meta-preferences that can render situations obvious to us even in the presence of preference conflicts. Consider again the smoker with an inferred preference conflict about smoking. If a personal assistant AI tries to learn their values and then helps them buy cigarettes more efficiently, we feel a strong intuition that this is bad behavior.
IV - Conclusion
I tried to make the description of competence non-anthropocentric, with a view towards applying it to value learning for AI, but this notion is heavily based on how humans solve the problem of modeling humans' values - both third-person modeling of other humans, and first-person modeling of ourselves. We create concepts and heuristics to help us predict both the outer world and the inner world, which we simultaneously try to apply where appropriate and also reuse as often as possible. Thus both a human and a competent-value-learning AI are trying to create simple models and heuristics that help predict humans, but the human may find different concepts useful because they're trying to predict additional first-person data.
(If the AI could find the important patterns in the neurological state of the human, that might help bring their concepts closer together, but this is a lot computationally harder from the outside than from the inside. I've also described competence by assuming some artificial notion of "agent-like models" that we can put preferences into - humans have to learn that concept too, and associate it with its verbal label, leading to different inferences.)
Thus, neither humans nor our AI learning competent values ever infer something that fits the bill of "True Values" in the simple framing of Goodhart's law. We have lots of equivalents of the ideal gas law, but no Standard Model. We can use the patterns we learn to make common-sensical predictions of humans, and we can even use them to notice when things are going wrong so that we can learn about Goodhart's law from examples. But we can't use them to write down a utility function for humans.
With this in mind, we can take a deeper look at examples of Goodhart's law in value learning. Next post.