I - Meanings of words
Now that we have more concrete thinking under our belt, it's time to circle back to Goodhart's law for value learners. What sorts of bad behavior are we imagining from future value-learning AI? What makes those behaviors plausible? And what makes them bad?
Let's take the last point first. Judgments of goodness or badness are situated in models - models of the world that we use to infer and operationalize human values. And we don't just use the same one all the time.
When I say "I like dancing," this is a different use of the word 'like,' backed by a different model of myself, than when I say "I like tasting sugar." The model that comes to mind for dancing treats it as one of the chunks of my day, like "playing computer games" or "taking the bus." I can know what state I'm in (the inference function of the model) based on seeing and hearing short scenes. Meanwhile, my model that has the taste of sugar in it has states like "feeling sandpaper" or "stretching my back." States are more like short-term sensations, and the described world is tightly focused on my body and the things touching it.
The meta-model that talks about me having preferences in both of these models is the framing of competent preferences. If someone or something is observing humans, it looks for human preferences by seeing what the preferences are in "agent-shaped" models that are powerful for their size.
So when we call some AI behavior "bad," this is a word whose meaning depends on usage and context, but ultimately bottoms out in implied models of the world. It's like a Winograd schema - like how English-readers infer that "they" in "workers put down the boxes because they were tired" refers to the workers, the "like" in "I like dancing" is understood to use a certain perspective on how I am modeling and interacting with the world.
All of this should be taken with the caution that there's not one True Model in which the True Meaning of the word "bad" is expressed. Obviously you still have to make some choice in practice, but the point is that the way you make this choice doesn't have to look like resolving epistemic uncertainty about which model is the True Model.
II - Model conflicts
What were the patterns that stood out from the previous discussion of what humans think of as bad behavior in value learning?
The most common type of failure, especially in modern day AI, is when humans are actively wrong about what's going to happen. They have something specific in mind when designing an AI, like training a boat to win the race, but then they run it and don't get what they wanted. The boat crashes and is on fire. We could make the boat racing game more of a value learning problem by training on human demonstrations rather than the score, and crashing and being on fire would still be bad.
For simple systems where humans can understand the state space and picturing what we want, this is the only standard you need, but for more complicated systems (e.g. our galaxy) humans can only understand small parts or simple properties of the whole system, and we apply our preferences to those parts we can understand. From the inside, it can be hard to feel the difference, because we want things about tic-tac-toe or about the galaxy with the same set of emotions. But when trying to infer human preferences, there's going to be ambiguity and preference conflicts about the galaxy in a way that never shows up in tic-tac-toe.
This is a key point. Inter-preference conflicts aren't an issue that ever comes up if you think of humans as having a utility function, but they're almost unavoidable if you think of humans as a physical systems with different possible models. We can't fit the whole galaxy into our heads, nor could evolution fit it into our genes, and so out of necessity we have to use simple heuristics that work well pragmatically but don't always play nicely together, even in our everyday lives.
Bad preference aggregation can lead to new kinds of bad behavior that don't make much sense in the Absolute Goodhart picture of human preferences. An AI that resolves every seemingly-even deadlock of human moral intuitions by picking whichever answer leads to the most paperclips seems bad, even though it's hard to put your finger on what's wrong on the object level.
That's an extreme example, though. A value learner can fail at resolving preference conflicts without any ulterior motive, in cases where humans have competent intuitions about what the conflict-resolution process should look like. If I like dancing, and I like tasting sugar, it's obvious to me that what I shouldn't do is never go dancing so that I can stay at home and continually eat sugar.
The line between different sorts of bad behavior is blurry here. The obviousness that I shouldn't become a sugar-hermit can be thought of either as me doing preference aggregation between preferences for tasting sugar and dancing, or as an object-level preference in a more fine-grained and comprehensive model of my states and actions. But I don't want to be modeled in the most fine-grained way. So at the very first step of trying to choose between plans, we immediately need to use my meta-preferences to reason correctly.
III - Meta-preferences
The meta-preferences an AI should learn include how we want to be modeled, which preferences we endorse and which we don't, how to resolve preference conflicts, etc. These opinions are inferred from humans' words and actions, and like other preferences they're limited in scope and can come into conflict.
Learning and representing these meta-preferences is a pit full of unsolved problems. One issue is that how an AI learns and represents stuff depends on its entire design, and everyone disagrees on how to design AGI. But even in toy models accessible today, we quickly run into difficulty - this does have a silver lining, I think, because it means we can do useful work right now on learning meta-preferences.
If we consider an AGI that's an instruction-following language model, meta-preferences might be represented as text about text, like "Saying 'It's good to rob a bank' is bad," or text about the design of the model itself. But although language models are good at stating meta-preferences, I'm currently unsatisfied with the prospective ways to act on them (e.g.). It's hard for a language model to re-evaluate the way it models me based on a text description of how I want to be modeled.
AGI based on model-based reinforcement learning has a quite different set of problems. If the AI models itself, and its own operations, then our preferences about how we want it to model us aren't much harder to connect to actions in the world than our other preferences. But how are we supposed to get any human preferences learned reliably? WIth the language model we could agree to pretend that it's going to end up aligned-ish, because it learns the human text generating process and very little else. Such a story is harder to come by for an AI with a more general world model trained with self-supervised predictive loss. Still, I think all of these are problems that can be worked on, not necessarily fatal flaws.
A further complication (perhaps not meta-preferences' fault, but certainly associated with them) is that where our value-learning AI eventually ends up in preference-space depends on where it starts. This can lead to certain problems (Stuart), and we might want to better understand this process and make sure it leads somewhere sensible (me). However, some amount of this dynamic is essential; for starters, picking out humans as the things whose values we want to learn (rather than e.g. evolution) has the type signature of meta-preference. Learning human meta-preferences can push you around in preference-space, but you've still got to start somewhere.
How does all this connect back to Goodhart? I propose that a lot of the feeling of unease when considering value learning schemes reliant on human modeling is because we don't think they'd satisfy our meta-preferences. If the value learning AI is modeling us in an alien way, even if there's some setting of its parameters that would lead to outcomes we approve of it feels like it would be surrounded on all sides by steep cliffs with spikes at the bottom. This pointlike nature of the "True Values" is a key component of Absolute Goodhart arguments.
IV - Meandering about domains of validity
A meta-preference that I think is crucial for making our lives easier is a sort of conservatism, where we prefer to keep the world inside the domain of validity of our preferences. What's a domain of validity, anyhow?
Option one: The domain of validity comes bundled with the model of the world. This is like Newtonian mechanics coming with a disclaimer on it saying "not valid above 0.1 c." This way keeps things nice and simple for our limited brains, but clunky to use in abstract arguments.
Option two: We could have a plethora of different models of the world, and where they broadly agree we call it a "domain of validity," and as they agree less, we trust them less. When I talk about individual preferences having a domain of validity, we can translate this to there being many similar models that use variations on this preference, and there's some domain where they more or less agree, but as you leave that domain they start disagreeing more and more.
Our models in this case have two roles; they make predictions about the world, and they also contain inferences about our preferences. Basically always, it's the preferential domain of validity that we care about. If there are two models that always predict the same behavior from us, and usually agree about our preferences, but have some situations where they utterly disagree about preferences, those situations are the ones outside the domain of validity.
What would ever incentivize a person or AI to leave the domain of validity of our preferences? Imagine you're trying to predict the optimal meal, and you make 10 different models of your preferences about food. If nine of these models think a meal would be a 2/10, and the last model thinks a meal would be a 1,000/10, you'd probably be pretty tempted to try that meal.
Ultimately, what you do depends on how you're aggregating models. Avoiding going outside the domain of validity looks like using an aggregation function that puts more weight on the pessimistic answers than the optimistic ones, or even penalizing positive variance. In the language of meta-preferences, I don't want one way of modeling me to return "super-duper-happy" while other reasonable ways of modeling me return "confused."
This meta-preference doesn't make sense if you think that there's actually One True way of modeling humans and we just don't know which it is. If our uncertainty about how to model humans was epistemic uncertainty, the right thing to do would be Bayesian updating and linear aggregation. All this talk about domains of validity would be invalid. So it's an important fact that we aren't just searching for the One True model of humans, we're just refining the desiderata by which we rate many possible models.
V - Making sense
It's time to finally do some Goodhart-reducing.
The classic mechanisms of Goodhart's law are about how optimizing a proxy - even one that's close to our True Values in everyday life - can lead to a bad score according to our True Values. This sort of Absolute Goodhart reasoning is convenient to us because most common examples of Goodhart's law involve a simple proxy leading to results that are obviously wrong. Absolute Goodhart poses a problem to any attempt to learn human values, because a value learning AI is just a complicated sort of proxy.
But for real physical humans, there are no unique True Values to compare proxies to. We can only compare models to other models. So to talk about Goodhart's law in a more naturalistic language, we have to make some edits.
It turns out to be pretty easy: just replace "proxy" with "one model" and "True Values" with "other models, especially those we find obvious when doing verbal reasoning." This gives you Relative Goodhart, which is much more useful for building value learning AI. As you can probably guess, I picked the names "Absolute" and "Relative" because in Absolute Goodhart you compare inferred human values to the lodestar of the True Values, while in Relative Goodhart you're just comparing one way of inferring human values to other ways.
In Relative Goodhart, the mechanisms of Goodhart's law are ways that one model of human values can be driven apart from other models. We can illustrate this by going back through Goodhart Taxonomy and translating the arguments:
- Extremal Goodhart:
- Absolute Goodhart: When optimizing for some proxy for value, worlds in which that proxy takes an extreme value are probably very different (drawn from a different distribution) than the everyday world in which the relationship between the proxy and true value was inferred, and this big change can magnify any discrepancies between the proxy and the true values.
- Relative Goodhart: When optimizing for one model of human preferences, worlds in which that model takes an extreme value are probably very different than the everyday world from which that model was inferred, and this big change can magnify any discrepancies between similar models that used to agree with each other. Lots of model disagreement often signals to us that the validity of the preferences is breaking down, and we have a meta-preference to avoid this.
- This transformation works very neatly for Extremal Goodhart, so I took the liberty of ordering it first in the list.
- Regressional Goodhart:
- Absolute Goodhart: If you select for high value of a proxy, you select not just for signal but also for noise. You'll predictably get a worse outcome than the naive estimate, and if there are some parts of the domain that have more noise without lowering the signal, the maximum value of the proxy is more likely to be there.
- Relative Goodhart: If you select for high value according to one model of humans, you select not just for the component that agrees with the aggregate of other models, but also the component that disagrees. Other models will predictably value your choice less then the model you're optimizing, and if there are some parts of the domain that tend to drive this model's estimates apart from the others' without lowering the average value, the maximum value is more likely to be there.
- Causal Goodhart:
- Absolute Goodhart: If we pick a proxy to optimize that's correlated with True Value but not sufficient to cause it, then there might be appealing ways to intervene on the proxy that don't intervene on what we truly want.
- Relative Goodhart: If we have two modeled preferences that are correlated, but one is actually the causal descendant of the other, then there might be appealing ways to intervene on the descendant preference that don't intervene on the ancestor preference.
There's a related issue when we have modeled preferences that are coarse-grainings or fine-grainings of each other. There can be ways to intervene on the fine-grained model that don't intervene on the coarse-grained model.
These translated Goodhart arguments all make the same change, which replaces failure according to particular True Values with failure according to other reasonable models of our preferences. As Stuart Armstrong put it, Goodhart's law is model splintering for values.
Although this change may seem boring or otiose, I think it's actually a huge opportunity. In the first post I complained that Absolute Goodhart's law didn't admit of solutions. When trying to compare a model to the True Values, we didn't know the True Values. But when comparing models to other models, nothing there is unknowable!
In the next and final post, the plan is to tidy this claim up a bit, see how it applies to various proposals for beating Goodhart's law for value learning, and zoom out to talk about the bigger picture for at least a whole paragraph.
At least, up to some finite amount of shuffling that's like a choice of prior, or universal Turing machine, or definition of "agent-shaped."
You may recognize a resemblance to inferring human values.
That would lead to unpalatable positions like "whatever the human did, that's what they wanted" or "the human wants to follow the laws of physics."
Comparing preferences across models is currently an open problem. If you take this post's picture of inferring human preferences literally (rather than e.g. imagining we'll be able to train a big neural network that does all this internally), we had better figure out how to translate between ontologies better.
And as with Extremal, we would rather not go to the part of phase space where the models of us all disagree with each other.
My addition of the variance-seeking pressure under the umbrella of Regressional Goodhart really highlights the similarities between it and Extremal Goodhart. Both are simplifications of the same overarching math, it's just that in the Regressional case we're doing even more simplification (requiring there to be a noise term with nice properties), allowing for a more specific picture of the optimization process.