I - Prologue
Two months ago, I wanted to write about AI designs that evade Goodhart's law. But as I wrote that post, I became progressively more convinced that framing things that way was leading me to talk complete nonsense. I want to explore why that is and try to find a different (though not entirely original, see Rohin et al., Stuart 1, 2, 3) framing of core issues, which avoids assuming that we can model humans as idealized agents.
This post is the first of a sequence of five posts - in this introduction I'll be making the case that we expect problems to arise in the straightforward application of Goodhart's law to value learning. I'm interested in hearing from you if you remain unconvinced or think of things I missed.
II - Introduction
Goodhart's law tells us that even when there are normally only small divergences between what we optimize for and our real standards, the outcome can be quite bad by our real standards. To use Scott Garrabrant's terminology from Goodhart Taxonomy, suppose that we have some true preference function V (for "True Values") over worlds, and U is some proxy that has been correlated with V in the past. Then there are several reasons given in Scott's post why maximizing U may score poorly according to V.
- what state the environment is in.
- what physical system to infer the preferences of.
- how to make inferences from that physical system.
- how to resolve inconsistencies and conflicting dynamics.
- how to extrapolate the inferred preferences into new and different contexts.
There is no single privileged way to do all these things, and different choices can give very different results. And yet the framing of Goodhart's law, as well as much of our intuitive thinking about value learning, rests on the assumption that the True Values are out there.
Goodhart's law is important - we use it all over the place (e.g. 1, 2, 3). In AI alignment we want to use Goodhart's law to crystallize a pattern of bad behavior in AI systems (e.g. 1, 2, 3, 4), and to design powerful AIs that don't have this bad behavior (e.g. 1, 2, 3, 4, 5, 6). But if you try to use Goodhart's law to design solutions to these problems it has a single prescription for us: find V (or at least bound your error relative to it). Since there is no such V, not only is that advice useless, it actually denies the possibility of success.
The goal, then, is deconfusion. We still want to talk about the same stuff, the same patterns, but we want a framing of what-we-now-call-Goodhart's-law that helps us think about what successful AI could look like in the real world.
III - Preview of the sequence
We'll start the next post with the classic question: "Why do I think I know what I do about Goodhart's law?"
The obvious answers to this question involve talking about how humans model each other. But this raises yet more questions, like "why can't the AI just model humans that way?" This requires two responses: first, breaking down what we mean when we casually say that humans "model" things, and second, talking about the limitations of such models compared to the utility-maximization picture. The good news is that we can rescue some version of common sense, the bad news is that this doesn't solve our problems.
Next we'll take a deeper look at some typical places to use Goodhart's law that are related to value learning:
- Curve fitting and overfitting.
- Hard-coded utility functions.
- Adversarial examples.
- Hard-coded human models.
Goodhart's law reasoning is used both in the definition of these problems, and also in talking about proposed solutions such as quantilization. I plan to talk at excessive length about all of these details, with the object of building up pictures of our reasoning in these cases that never needs to mention the word "Goodhart" because it's at a finer level of magnification.
However, these pictures aren't all going to be consistent, because what humans think of as success or failure can depend on the context, and extrapolating beyond that context will bring our intuitions into conflict. Thus we'll have to revisit the abstract notion of human preferences and really hash out what happens (or what we think happens) at the boundaries and interchanges between human models of the world.
Finally, the hope is to conclude with some sage advice. Not a solution, because I haven't got one. But maybe some really obvious-seeming sage advice can tie together the concepts introduced in the sequence into something that feels like progress.