Using expected utility for Good(hart)

1Gordon Seidoh Worley

New Comment

Another important challenge it to list the different possible variables that humans might care about; in the example above, we were given the list of a 1000, but what if we didn't have it? Also, those variables could only go one way - up. What if there were a real-valued variable that the agent suspected humans cared about - but didn't know whether we wanted it to be high or low?

This to me seems the fatal problem with this approach. Although I don't have a proof for it, my suspicion is that we have so many variables that we would not be able to perform the computations necessary as you describe to mitigate Goodharting because it would not actually be able to take all the variables into account during optimization.

Goodhart's curse happens when you want to maximise an unknown or uncomputable U, and instead choose a simpler proxy V, and maximise that instead. Then the optimisation pressure on V transforms it into a worse proxy of U, possibly resulting in a very bad result from the U-perspective.

Many suggestions for dealing with this involve finding some other formalism, and avoiding expected utility maximisation entirely.

However, it seems that classical expected utility maximisation can work fine, as long as we properly account for all our uncertainty and all our knowledge.

## The setup

Imagine that there are 1000+5 different variables that humans might care about; the default value of these variables is 0, and they are all non-negative.

Of these, 5 are known to be variables that humans actually care about, and would like to maximise as much as possible. Of the remaining 1000, an AI agent knows that humans care about 100 of these, and want their values to be high - but it doesn't know which 100.

The agent has a "budget" of 1000 to invest in any variables it chooses; if it invest X in a given variable v, that variable is set to √X. Thus there is a diminishing return for every variable.

Then set

And we get a classic Goodhart scenario: the agent will invest 200 in each of these variables, setting them to just about 14, and ignore all the other variables.

## It knows we care, but not about what

In that situation, we have not, however, incorporated all that we know into the definition of V. We know that there are a 100 other variables humans care about. So we can define V as:

Of course, we don't know what the 100 variables are; but we can compute the expectation of V.

In order to make the model more interesting, and introduce more complicated trade-offs, I'll designate some of the 1000 variables as "stiff" variables: these require 40 times more investment to reach the same value. The number of stiff variables varies between 1 and 99 (to ensure that there is always at least one variable the humans care about among the non-stiff variables).

Then we, or an agent, can do classical expected utility maximisation on V:

This graph plots various values against the number of stiff variables. The orange dots designate the values of the 5 known variables; the agent prioritises these, because they are known. At the very bottom we can make out some brown dots; these are the values of the stiff variables, which hover barely above 0: the agent wastes little effort on these.

The purple dots represent the values of the non-stiff variables; the agent does boost them, but because they have only 1/10 of being variables humans care about, they get less priority. The blue dots represent the expected utility of V given all the other values; it moves from 0.13 approximately to 0.12 as the number of stiff variables rises.

This is better than only maximising the 5 known variables, but still seems sub-optimal: after all, we know that human value is fragile, so we lose a lot by having the stiff variables so low, as they are very likely to contain things we care about.

## The utility knows that value is fragile

*We* know that human value is fragile, but that has not yet been incorporated into the utility. The simplest way would be to define V as:

Again, V cannot be known by the agent, but its expected value can be calculated. For different number of stiff variables, we get the following behaviour:

The purple dots are the values that the agent sets the 5 known variables, and the un-stiff unknown variables to. Because V is defined as a minimum, and because at least one of the un-stiff unknown variables must be one the humans care about, the agent will set them all to the same value. The brown dots track the value of the stiff variables, while the blue points are the expected value of V.

Initially, when there are few stiff variables, the agent invests their efforts mainly into the other variables, hoping that humans don't care about any of the stiff variables. As the number of stiff variables increases, the probability that humans care about at least one of them also increases. By the time there are 40 stiff variables, it's almost a certainty that one of them is one of the 100 the humans care about; at that point, the agent has to essentially treat V as being the minimum of all 1000+5 variables. The values of all variables - and hence the expected utility - then continues to decline as the number of stiff variables further increases, which makes it more and more expensive to increase all variables.

This behaviour is much more conservative, and much closer to what we'd want the agent to actually be doing in this situation; it does not feel Goodharty at all.

## Using expected utility maximisation for Good(hart)

So before writing off expected utility maximisation as vulnerable to Goodhart effects, check if you've incorporated all the information and the uncertainty that you can, into the utility function.

## The generality of this approach

It is not a problem, for this argument, if the number of variables humans care about is unknown, or if the tradeoff is more complicated than above. A probability distribution over the number of variables, and a more complicated optimal policy, would resolve these. Nor is the strict "min" utility formulation needed; a soft-min (or a mix of soft-min and min, depending on the importance of the variables) would also work, and allow the utility-maximiser to take less conservative tradeoffs.

So, does the method generalise? For example, if we wanted to maximise CEV, and wanted to incorporate my criticisms of it, we could add the criticisms as a measure of uncertainty to the CEV. However, it's not clear how to transform my criticisms into a compact utility-function-style form.

More damningly, I'm sure that people could think of more issues with CEV, if we gave them enough time and incentives (and they might do that as part of the CEV process itself). Therefore we'd need some sort of process that scans for likely human objections to CEV and automatically incorporates them into the CEV process.

It's not clear that this would work, but the example above does show that it might function better than we'd think.

Another important challenge it to list the different possible variables that humans might care about; in the example above, we were given the list of a 1000, but what if we didn't have it? Also, those variables could only go one way - up. What if there were a real-valued variable that the agent suspected humans cared about - but didn't know whether we wanted it to be high or low?

We could generate a lot of these variables in a variety of unfolding processes (processes that look back at human minds and use that to estimate what variables matter, and where to look for new ones), but that may be a challenge. Still, something to think about.