# Defeating Goodhart and the "closest unblocked strategy" problem

7 min read12 comments

# 18

This post is longer and more self-contained than my recent stubs.

tl;dr: Patches such as telling the AI "avoid X" will result in Goodhart's law and the nearest unblocked strategy problem: the AI will do almost exactly what it was going to do, except narrowly avoiding the specific X.

However, if the patch can replaced with "I am telling you to avoid X", and this is treated as information about what to avoid, and the biases and narrowness of my reasoning are correctly taken into account, these problems can be avoided. The important thing is to correctly model my uncertainty and overconfidence.

## AIs don't have a Goodhart problem, not exactly

The problem of an AI maximising a proxy utility function seems similar to the Goodhart Law problem, but isn't exactly the same thing.

The standard Goodhart law is a principal-agent problem: the principal P and the agent A both know, roughly, what the principal's utility is (eg aims to create a successful company). However, fulfilling is difficult to measure, so a measurable proxy is used instead (eg aims to maximise share price). Note that the principal and the agents goals are misaligned, and the measurable serves to (try to) bring them more into alignment.

For an AI, the problem is not that is hard to measure, but that it is hard to define. And the AI's goals are : there is no need to make measurable, it is not a check on the AI, but the AI's intrinsic motivation.

This may seem like a small difference, but it has large consequences. We could give an AI a , our "best guess" at , while also including all our uncertainty about how to define . This option is not available for the principal agent problem, since giving a complicated goal to a more knowledgeable agent just gives it more opportunities to misbehave: we can't rely on it maximising the goal, we have to check that it does so.

## Overfitting to the patches

There is a certain similarity with many machine learning techniques. Neural nets that distinguish cats and dogs could treat any "dog" photo as a specific patch that can be routed around. In that case, the net would define "dog" as "anything almost identical to the dog photos I've been trained on", and "cat" as "anything else".

And that would be a terrible design; fortunately, modern machine learning gets around the problem by, in effect, assigning uncertainty correctly: "dog" is not seen as the exact set of dog photos in the training set, but as a larger, more nebulous concept, of which the specific dog photos are just examples.

Similarly, we could define as , where is our best attempt at specifying , and encodes the fact that is but an example our imperfect minds have come up with, to try and capture . We know that is oversimplified, and is an encoding of this fact. If a neural net could synthesis a decent estimate of "dog" from some examples, could it synthesis "friendliness" from our attempts to define it?

The idea is best explained through an example.

## Example: Don't crush the baby or the other objects

This section will present a better example, I believe, than the original one presented here.

A robot exists in a grid world:

The robot's aim is to get to the goal square, with the flag. It gets a penalty of for each turn it isn't there.

If that were the only reward, the robot's actions would be disastrous:

So we will give it a penalty of for running over babies. If we do so, we will get a Goodhart/nearest unblocked strategy behaviour:

Oops! Turns out we valued those vases as well.

What we want the AI to learn is not that the baby is specifically important, but that the baby is an example of important things it should not crush. So imagine it is confronted by the following, which includes six types of objects, of unknown value:

Instead of having humans hand-label each item, we instead generalise from some hand-labelled examples, using rules of extrapolation and some machine learning. This tells the AI that, typically, we value about one-in-six objects, and value them at a tenth of the value of babies (hence it gets for running one over). Given that, the best policy, with an expected reward of , is:

This behaviour is already much better than we would expect from a typical Goodhart law-style agent (and we could complicate the example to make the difference more emphatic).

## Example: human over-confidence

The above works if we humans correctly account for our uncertainty - if we not only produce , but also a correct for how good a match we expect between and .

But we humans are often overconfident in their estimates, especially in our estimates of value. We are far better at hindsight ("you shouldn't have crushed the vase") than at foresight ("here's a complete list of what you shouldn't do"). Even knowing that hindsight is better, doesn't make the issue go away.

This is similar to the planning fallacy. That fallacy means that we underestimate the time taken to complete tasks - even if we try to take the planning fallacy into account.

However, the planning fallacy can be solved using the outside view: comparing the project to similar projects, rather than using detailed inner knowledge.

Similarly, human overconfidence can be solved by the AI noting our initial estimates, our corrections to those initial estimates, our corrections taking into account the previous corrections, our attempts to take into account all previous repeated corrections - and the failure of those attempts.

Suppose, for example, that humans, in hindsight, value one-in-three of the typical objects in the grid world. We start out with an estimate of one-in-twelve; after the robot mashes a bit too many of the objects, we update to one-in-nine; after being repeatedly told that we underestimate our hindsight, we update to one-in-six... and stay there.

But meanwhile, the robot can still see that we continue to underestimate, and goes directly to a one-in-three estimate; so with new, unknown objects, it will only risk crushing a single one:

If the robot learnt that we valued even more objects (or valued some of them more than ), it would then default to the safest, longest route:

.

In practice, of course, the robot will also be getting information about what types of objects we value, but the general lesson still applies: the robot can learn that we underestimate uncertainty, and increase its own uncertainty in consequence.

## Full uncertainty, very unknown unknowns

So, this is a more formal version of ideas I posted a while back. The process could be seen as:

1. Give the AI as our current best estimate for .
2. Encode our known uncertainties about how well relates to .
3. Have the AI deduce, from our subsequent behaviour, how well we have encoded our uncertainties, and change these as needed.
4. Repeat 2-3 for different types of uncertainties.

What do I mean by "different types" of uncertainty? Well, the example above was simple: the model had but a single uncertainty, over the proportion of typical objects that we valued. The AI learnt that we systematically underestimated this, even when it helped us try and do better.

But there are other types of uncertainties that could happen. We value some objects more than others, but maybe these estimates are not accurate either. Maybe we are fine as long as one object of a type exists, and don't care about the other - or, conversely, maybe some objects are only valuable in pairs. The AI needs a rich enough model to be able to account for these extra types of preferences, that we may not have ever articulated explicitly.

There are even more examples as we move from gridworlds into the real world. We can articulate ideas like "human value is fragile" and maybe give an estimate of the total complexity of human values. And then the agent could use examples to estimate the quality of our estimate, and come up with better number for the desired complexity.

But "human value is fragile" is a relatively recent insight. There was time when people hadn't articulated that idea. So it's not that we didn't have a good estimate for the complexity of human values; we didn't have any idea that was a good thing to estimate.

The AI has to figure out the unknown unknowns. Note that, unlike the value synthesis project, the AI doesn't need to resolve this uncertainty; it just needs to know that it exists, and give a good-enough estimate of it.

The AI will certainly figure out some unknown unknowns (and unknown knowns): it just has to spot some patterns and connections we were unaware of. But in order to get all of them, the AI has to have some sort of maximal model in which all our uncertainty (and all our models) can be contained.

Just consider some of the concepts I've come up with (I chose these because I'm most familiar with them; LessWrong abounds with other examples): siren worlds, humans making similar normative assumptions about each other, and the web of connotations.

In theory, each of these should have reduced my uncertainty, and moved closer to . In practice, each of these has increased my estimate of uncertainty, by showing how much remains to be done. Could an AI have taken these effects correctly into account, given that these three examples are of very different types? Can it do so for discoveries that remain to be made?

I've argued that an indescribable hellworld cannot exist. There's a similar question as to whether there exists human uncertainty about that cannot be included in the AI's model of . By definition, this uncertainty would be something that is currently unknown and unimaginable to us. However, I feel that it's far more likely to exist, than the indescribable hellworld.

Still despite that issue, it seems to me that there are methods of dealing with the Goodhart problem/nearest unblocked strategy problem. And this involves properly accounting for all our uncertainty, directly or indirectly. If we do this well, there no longer remains a Goodhart problem at all.

New Comment
12 comments, sorted by Click to highlight new comments since:

Cheers! Reading it now.

This seems interesting but I don't really understand what you're proposing.

1, Give the AI W as our current best estimate for U.

Is W a single utility function?

2, Encode our known uncertainties about how well W relates to U.

What is the type signature of the encoded data (let's call it D) here? A probability distribution for U-W, or for U? Or something else?

3, Have the AI deduce, from our subsequent behaviour, how well we have encoded our uncertainties, and change these as needed.

How does the AI actually do this? Does it use some sort of meta-prior, separate from D? Suppose we were overconfident in step 2, e.g., let's say we neglected to include some uncertainty in D (there is a certain kind of computation that is highly negatively valuable, but in W we specified it as having value 0, and in D we didn't include any uncertainty about it so the AI thinks that this kind of computation has value 0 with probability 1), how would the AI "deduce" that we were wrong? (Or give an example with a different form of overconfidence if more appropriate.)

4, Repeat 2-3 for different types of uncertainties.

Do you literally mean that 2-3 should be done separately for each kind of uncertainty, or just that we should try to include all possible types of uncertainties into D in step 2?

Also, Jessica Taylor's A first look at the hard problem of corrigibility went over a few different ways that an AI could formalize the fact that humans are uncertain about their utility functions, and concluded that none of them would solve the problem of corrigibility. Are you proposing a different way of formalizing it that's not on her list, or do you get around the issue by trying to solve a different problem?

This seems interesting but I don't really understand what you're proposing.

The last section is more aspirational and underdevelopped; the main point is noticing that Goodhart can be defeated in certain circumstances, and speculating how that could be extended. I'll get back to this at a later date (or others can work on it!)

Also, Jessica Taylor's A first look at the hard problem of corrigibility went over a few different ways that an AI could formalize the fact that humans are uncertain about their utility functions, and concluded that none of them would solve the problem of corrigibility.

This is not a design for corrigible agents (if anything, it's more a design for low impact agents). The aim of this approach is not to have an AI that puts together the best , but one that doesn't go maximising a narrow , and has wide enough uncertainty to include a decent among the possible utility functions, and that doesn't behave too badly.

This is not a design for corrigible agents (if anything, it’s more a design for low impact agents). The aim of this approach is not to have an AI that puts together the best U, but one that doesn’t go maximising a narrow V, and has wide enough uncertainty to include a decent U among the possible utility functions, and that doesn’t behave too badly.

Ok, understood, but I think this approach might run into similar problems as the attempts to formalize value uncertainty in Jessica's post. Have you read it to see if one of those ways to formalize value uncertainty would work for your purposes, and if not, what would you do instead?

I did read it. The main difference is that I don't assume that humans know their utility function, or that "observing it over time" will converge on a single point. The AI is expected to draw boundaries between concepts; boundaries that humans don't know and can't know (just as image recognition neural nets do).

What I term uncertainty might better be phrased as "known (or learnt) fuzziness of a concept or statement". It differs from uncertainty in the Jessica sense in that knowing absolutely everything about the universe, about logic, and about human brains, doesn't resolve it.

What I term uncertainty might better be phrased as “known (or learnt) fuzziness of a concept or statement”. It differs from uncertainty in the Jessica sense in that knowing absolutely everything about the universe, about logic, and about human brains, doesn’t resolve it.

In this approach, does the uncertainty/fuzziness ever get resolved (if so how?), or is the AI stuck with a "fuzzy" utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don't) from the perspective of our real values?

Or is this meant to be a temporary solution, i.e., at some point we shut this AI down and create a new one that is able to resolve the uncertainty/fuzziness?

In this approach, does the uncertainty/fuzziness ever get resolved (if so how?), or is the AI stuck with a "fuzzy" utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don't) from the perspective of our real values?

The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster; I'm working on other approaches that could allow the synthesis of the actual values.

The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster;

I'm saying this isn't clear, because optimizing for a fuzzy utility function instead of the true utility function could lead to astronomical waste or be a form of x-risk, unless you also had a solution to corrigibility such that you could shut down the AI before it used up much of the resources of the universe trying to optimize for the fuzzy utility function. But then the corrigibility solution seems to be doing most of the work of making the AI safe. For example without a corrigibility solution it seems like the AI would not try to help you resolve your own uncertainty/fuzziness about values and would actually impede your own efforts to do so (because then your values would diverge from its values and you'd want to shut it down later or change its utility function).

I'm working on other approaches that could allow the synthesis of the actual values.

Ok, so I'm trying to figure out how these approaches fit together. Are they meant to both go into the same AI (if so how?), or is it more like, "I'm not sure which of these approaches will work out so let's research them simultaneously and then implement whichever one seems most promising later"?

"I'm not sure which of these approaches will work out so let's research them simultaneously and then implement whichever one seems most promising later"

That, plus "this approach has progressed as far as it can, there remains uncertainty/fuzziness, so we can now choose to accept the known loss to avoid the likely failure of maximising our current candidate without fuzziness". This is especially the case if, like me, you feel that human values have diminishing marginal utility to resources. Even without that, the fuzziness can be an acceptable cost, if we assign a high probability to loss to Goodhart-like effects if we maximise the wrong thing without fuzziness.

There's one other aspect I should emphasise: AIs drawing boundaries we have no clue about (as they do now between pictures of cats and dogs). When an AI draws boundaries between acceptable and unacceptable worlds, we can't describe this as reducing human uncertainty: the AI is constructing its own concepts, finding patterns in human examples. Trying to make those boundaries work well is, to my eyes, not well described in any Bayesian framework.

It's very possible that we might get to a point were we could say "we expect that this AI will synthesise a good measure of human preferences. The measure itself has some light fuzziness/uncertainty, but our knowledge of it has a lot of uncertainty".

So I'm not sure that uncertainty or even fuzziness are necessarily the best ways of describing this.

One further issue is that if the AI deduces this within one human-model (as in CIRL), it may follow this model off a metaphorical cliff when trying to maximize modeled reward.

Merely expanding the family of models isn't enough because the best-predicting model is something like a microscopic, non-intentional model of the human. A "nearest unblocked model" problem. The solution should be similar - get the AI to score models so that the sort of model we want it to use is scored highly. (Or perhaps more complicated where human morality is undefined.) This isn't just a prior - we want predictive quality to only be one of several (as yet ill-defined) criteria.