(This work was supported by CEEALAR and LTFF. Thanks to James Flaville, Jason Green-Lowe, Michele Campolo, Justis Mills, Peter Barnett, and Steve Byrnes for conversations.)
I - Prologue
A few months ago, I wanted to write about AI designs that evade Goodhart's law. But as I wrote that post, I became progressively more convinced that framing things that way was leading me to talk complete nonsense. I want to explore why Goodhart's law led me to talking nonsense and try to find a different (though not entirely original, see Rohin et al., Stuart 1, 2, 3) framing of core issues, one which avoids assuming that we can model humans as idealized agents.
This post is the first of a sequence of five posts about Goodhart's law and AIs that learn human values (a research problem also called value learning). In this introduction I'll point out why you can't just do things the straightforward way. Leave a comment below telling me what's unclear, or what you disagree with.
II - Introduction
Goodhart's law is the observation that when you try to pick a specific observable to optimize for, the act of optimization will drive a wedge between what you're optimizing and what you want, even if they used to be correlated. For example, if what you really want is for students to get a general education, and there's a short 100-question test that correlates with how much students know, it might seem like a good idea to change schools in whatever way increases test scores. But this would lead to teaching the students only those 100 test questions and not anything else - optimizing for a proxy for education actually made the education worse.
In Scott Garrabrant's terminology from Goodhart Taxonomy, suppose that we have some true preference function V (for "True Values") over worlds, and U is some proxy that has been correlated with V in the past. Then there are a few distinct reasons why maximizing U may score poorly according to V. Things that seem really good according to U can just be random noise, or can be drawn from a part of the distribution that's extreme in many other ways too, or can intervene on the world without causing what we really want.
Based just on what I've said so far, it might seem like this is cause for pessimism about an AI learning human values. If our values are V, and we build an AI with effective utility function U, and U≠V, according to these Goodhart's law arguments the AI will do things we don't like.
But there's a spanner in the works: humans have no such V (see also Scott A., Stuart 1, 2). Humans don't have our values written in Fortran on the inside of our skulls, we're collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It's not that there's some pre-theoretic set of True Values hidden inside people and we're merely having trouble getting to them - no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like "which atoms exactly count as part of the person" and "what do you do if the person says different things at different times?"
The natural framing of Goodhart's law - in both mathematics and casual language - makes the assumption that there's some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values.
Goodhart's law is important - we use it all over the place on this site (e.g. 1, 2, 3). In AI alignment we want to use Goodhart's law to crystallize a pattern of bad behavior in AI systems (e.g. 1, 2, 3, 4), and to design powerful AIs that don't have this bad behavior (e.g. 1, 2, 3, 4, 5, 6). But if you try to use Goodhart's law to design solutions to these problems, it'll unhelpfully tell you you're doomed because you can't find humans' V.
This sequence is going to push back against the notion that AI alignment, even value learning, looks like finding a unique match for human values. The goal is deconfusion. We still want to talk about the same patterns, but we want a version of what-we-now-call-Goodhart's-law that's better for thinking about what beneficial AI could look like in the real world. I'm going to call the usual version of Goodhart's law "Absolute Goodhart" (because it contrasts the AI's values with fixed human values), and the version we want that's better for value learning "Relative Goodhart."
The name of this sequence has a double meaning. We want to "reduce Goodhart" - make there be less of this problem where AIs will do things we don't want. But to come to grips with this, first we'll have to "reduce Goodhart" - reductionistically explain how Goodhart's law emerges from underlying reality.
III - Preview of the sequence
We'll start post two with the classic question: "Why do I think I know what I do about Goodhart's law?"
Answering this question involves talking about how humans model each other. But this raises yet more questions, like "why can't the AI just model humans that way?" This requires us to break down what we mean when we casually say that humans "model" things, and also requires us to talk about the limitations of such models compared to the utility-maximization picture. The good news is that we can rescue some version of common sense, the bad news is that this doesn't solve our problems.
In post three we'll take a deeper look at some typical places to use Goodhart's law that are related to value learning. For example:
- Curve fitting, where overfitting is a problem.
- Hard-coded utility functions, where we can choose the wrong thing for the AI to maximize.
- Hard-coded human models, which might make systematically bad inferences.
Goodhart's law reasoning is used both in the definition of these problems, and also in talking about proposed solutions (such as quantilization). I plan to re-describe these problems in excruciating detail, so that we can temporarily taboo the phrase "Goodhart's law" and grapple with the lower-level details of each case. These details turn out to be quite different depending on whether the AI is modeling humans or is merely modeled by them.
In post four, we turn to the problem that different ways of inferring human preferences will come into conflict with each other. We'll have to go from the concrete to the abstract to hash out what happens (or what we think happens, and what we want to happen) when we have multiple overlapping ways of modeling humans and the world. This is where we really get to talk about Relative Goodhart.
Post five will contain bookkeeping and unsolved problems, but it will also have my best stab at tying everything together. When I started writing this sequence I was pessimistic about solving any of the problems from this post. Now, though, I hope by the end I can offer a vision of what it would mean for value learning to succeed.
That was quite a stimulating post! It pushed me to actually go through the cloud of confusion surrounding these questions in my mind, hopefully with a better picture now.
First, I was confused about your point on True Values. I was confused by what you even meant. If I understand correctly, you're talking about a class of parametrized models of human: the agent/goal-directed model, parametrized by something like the beliefs and desires of Dennett's intentional stance. With some non-formalized additional subtleties like the fact that desires/utilities/goals can't just describe exactly what the system do, but must be in some sense compressed and sparse.
Now, there's a pretty trivial sense in which there is no True Values for the parameters: because this model class lacks realizability, no parameter describes exactly and perfectly the human we want to predict. That sounds completely uncontroversial to me, but also boring.
Your claim, in my opinion, is that there are no parameters for which this model is close to good enough at predicting the human. Is that correct?
Assuming for the moment it is, this post doesn't really argue for that point in my opinion; instead it argues for the difficulty in inferring such good parameters if they existed. For example this part:
is really about inference, as none of your points make it impossible for a good parameter to exist -- they just argue for the difficulty of finding/defining one.
Note that I'm not saying what you're doing with this sequence is wrong; looking at Goodhart from a different perspective, especially one which tries to dissolve some of the inferring difficulties, sounds valuable to me.
Another thing I like about this post it that you made me realize why the application of Goodhart's law to AI risk doesn't require the existence of True Values: it's an impossibility result, and when proving an impossibility, the more you assume the better. Goodhart is about the difficulty of using proxies in the best case scenario when there are indeed good parameters. It's about showing the risk and danger in just "finding the right values", even in the best world where true values do exist. So if there are no true values, the difficulty doesn't disappear, it gets even worse (or different at the very least)
I'm mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the "True Values"). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?
Hum, but I feel like you're claiming that this framing is wrong while arguing that it is too difficult to apply to be useful. Which is confusing.
Still agree that your big question is interesting though.
Thanks, this is useful feedback in how I need to be more clear about what I'm claiming :) In october I'm going to be refining these posts a bit - would you be available to chat sometime?
Glad I could help! I'm going to comment more on your following post in the next few days/next week, and then I'm interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)