*Produced As Part Of The **SERI ML Alignment Theory Scholars** Program 2.1, mentored by **John Wentworth**.*

## Summary

We can solve Goodhart's Curse by making the agent (1) know the reliability of its goal representation and (2) cap the amount of optimization power devoted to achieving its goals, based on this reliability measure. One simple way to formalize this is with a generalization bound on the quality of the value proxy and a quantilizer whose value is chosen based on the generalization bound. For a competitive implementation of this algorithm, it's easy to create outer objectives that will push a learned planning algorithm toward soft optimization. I think this is a productive research avenue, and this approach should stack on top of value learning and interpretability approaches to reducing AGI risk.

## Extremal Goodhart

Powerful artificial agents are dangerous if they do unbounded uninterruptible optimization. Unbounded optimization is dangerous because of Extremal Goodhart, which is demonstrated in the diagram below.

In this diagram, the potential policies taken by the agent are arranged in order of expected proxy value^{[1]} on the x-axis. We can see that as we push further to the right (which is what an optimizer does), the proxy value becomes less correlated with the true value.^{[2]} If we completely trusted our distribution over possible utility functions, we would simply take the rightmost policy, since it has the highest expected value. But we can't trust this knowledge, because by choosing the rightmost policy we have applied optimization pressure to find the policy with the most upward error in its value estimate.

We can reduce this problem by randomizing our choice of policy. Randomizing the policy is equivalent to reducing the amount of optimization power. The problem is that randomizing inherently reduces the usefulness and capabilities of your agent.

This post is about an approach for setting the amount of optimization power such that the optimizer achieves as much expected value as possible, within the limits set by its knowledge about the utility function. In other words, it's dumber when in unfamiliar territory, i.e. out of distribution. If implemented well, I think this makes the alignment problem easier in two ways. One, it allows us to imperfectly specify the utility function, and two, it allows us to specify a single-task utility function and be uncertain about the utility of unrelated world states.

## Setting based on a generalization bound

First off, a -quantilizer is an algorithm that samples randomly from the top proportion of policies. A 40%-quantilizer looks like this:

Quantilizers are the optimal policy distribution if you have some cost function , and have knowledge about the expected cost under the prior distribution and want to upper bound the expected cost under the chosen policy distribution. Based on the required upper bound you can determine the value. However, this isn't a good way to choose the value. We don't know the cost function, and we don't know how much cost is too much.

Instead, it's possible to determine the optimal policy distribution if we have a representation of uncertainty about the true utility function *and* a generalization bound telling us how reliable that representation is. The optimal policy distribution will often be exactly the same as a quantilizer, but sometimes not exactly.

A generalization bound for a learning algorithm is simply a number that upper bounds the average on-distribution **test loss**. We can use a generalization bound to guarantee that a learning algorithm hasn't overfitted to the training data. We are going to assume we have a generalization bound on the quality of our proxy of the true utility function (we will later formalize this proxy as a distribution ). This is just a formalization of the knowledge that our proxy works well *in normal situations*. We can then work out the relationship between proxy quality and changes in the policy distribution, and use this to work out the optimal policy distribution that doesn't damage the proxy quality too much.

Note that generalization bounds are reasonably easy to get. The simplest way to get one is to have a validation set of data not used during training, and use that to estimate the loss on the test dataset. By itself, this isn't a bound, but if we assume the validation set and the test set have the same data distribution, then we can use a concentration inequality to easily get an upper bound on the expected test loss. If we don't have a validation set, we can use theoretically derived generalization bounds, like this one, which upper bounds the expected error of an MNIST classifier at ~16% compared to an estimated true error of 1.8%.^{[3]} In my toy examples and generated graphs, I made up a generalization bound that was more than 10x the actual expected loss, based on this result.

## Formalization

Now I'll describe the full argument more formally. Assume there is a true human utility function which maps from any policy to a utility . Assume we have some fixed distribution over utility functions , represented by the parameterized density function . We can query this density function at a specific policy using the notation . We may have learned this distribution using some labelled examples of humans performing tasks, for example. We have some distribution over policies called , which can be a model of a human distribution over policies.

Now we also need to assume we have some measure of how reliable is, in other words, how closely it approximates the "true utility function" we were trying to communicate. The most natural way to represent this information is a generalization bound of the form .^{[4]} This says that on average over normal policies (), our knowledge about the utility function is good (the true utility function has high density). Since the expected log probability is a natural loss function that can be used to learn this distribution, it's natural that we might have a generalization bound on this value.

Let's assume our policy distribution can be represented in the form of a Bayesian update on , i.e. . Now the problem of finding the optimal is equivalent to finding the optimal function .^{[5]} Now we can define , a generalization of the quantilizer value to any distribution represented this way. Now for any function , we have the same bound proven in the original paper:

Now it'd be useful to derive an upper bound on , but unfortunately negative log densities aren't always positive. This is easily fixed by adding lowest possible negative log density: .

By setting , we get . In words, this is a bound on how much mass must be assigned to the true value function by , on average across the **policy distribution***.*

This is usually a loose upper bound, since it holds in the worst case when is as incorrect as possible on the policy distribution. However, in this case, this is exactly what we want. As optimization power is increased, positive errors in the proxy are selected for, so we can expect that this bound will become tighter as is pushed down close to zero.

Let's define the set of utility functions that satisfy the above bound, where is the set of possible policies. .

Now the optimal policy distribution, given the information we have, is:

This distribution will balance the task of maximizing value with the task of maintaining similarity to the prior (which is measured by , and allows a higher lower bound on the value). As our information about the proxy improves, the level of optimization increases. In my toy example, the policy looks like:

The code used to generate these toy examples is here. It uses an artificially constructed distribution , I chose a beta distribution over the value of each policy. I used out-of-the-box constrained convex optimization to do the minimizing part, and then a hacky optimizer on top of that to find the optimal policy distribution (the max part).

Note that Jessica Taylor did a very similar proof eight years ago,^{[7]} but for a simpler bound on the quality of the utility function. This shows already that the quantilization parameter can be derived if you have a bound on the average error of the proxy.

If the information is in the form of a bound on the average distance to the proxy , as Jessica's is, then the optimal distribution is always a quantilizer. With the generalization bound in the form I used, the optimal distribution isn't always a quantilizer, as can be seen in some of the diagrams above. Instead, it cuts off the upper end of the distribution if the utility estimate is too noisy there.

### Extensions I want

- A more computationally efficient solution to the max min problem.
- Make the situation more realistic by including latent representations of world-state-trajectories in the formalization, in between policies and utilities.
- Adding the modal log density will loosen the bound a lot when is confident. Is there a way around this?

## Fine-tuned generative models

Fine-tuned and conditioned generative models act a lot like soft optimizers. This has been pointed out a number of times (e.g. tailcalled, Linda Linsefors, Sam Marks & Charlie Steiner have mentioned similar things).

Quantilizers (and the optimal policy distribution above) can be described as a Bayesian update on the policy prior, so we can borrow techniques for Bayesian inference to help compute them. In fact, we have an outer objective whose optimum is exactly the soft optimizer we want, if we use it correctly: the Variational inference objective.

This works by finding the distribution that maximizes , which you can easily show is the same distribution as . To create a quantilizer, we would have to set to be an indicator function that is 1 when is above a certain value, and 0 otherwise. To avoid introducing infinite values to our optimization problem, we would probably want to use a smooth approximation to the indicator function (this doesn't damage any theoretical guarantees).

All of this assumes that we can already have access to a learned distribution over utility functions, , which we can use to calculate . If we want to use this to create softly optimizing agents out of language models, we would need a fast way to calculate (or approximate) locally, rather than globally as I did in the toy model. Let me know if you have a suggestion for how to do this.

## Competitive planning

Sampling an entire sequence of actions at once is obviously impractical, both because you need to react to the environment, and because that's just not an easy way to generate a good plan. A real agent would softly optimize across plans that it is considering in real time, which would be of finite length, and represented abstractly rather than with literal actions.

A key technique for make planning easier is working out a plan bit by bit. Let's say we have a policy distribution which a soft optimizer is trying to approximate, where the policy can be split up into a series of actions . Then a plan can be generated by sampling these one by one, conditional on previously sampled actions.

If the planner needed to generate a plan consisting of two actions, it could do this by sampling from , then from , where and are a factorization of , the full policy distribution constructed above. When the two actions influence the cost independently, this will look like quantilizing each action independently, but when there is dependence between the actions it will look like quantilizing over pairs of actions. This avoids the cost independence assumption discussed in Section 3.3 of the Quantilizers paper, allowing us to make the correct competitiveness-safety tradeoff that we already determined based on our proxy quality.

A sophisticated planner would sample these in whichever order the sampling is easiest. For example, it might start by backtracking from the goal, work out constraints on intermediate states, and recursively quantilize toward the intermediate states. All of this seems natural to implement when you start with a powerful conditional model, and seems possible to do without losing soft optimization guarantees.

Directly building by hand an algorithm that implements the above seems completely plausible, at least for low to medium levels of optimization power. I expect for useful levels of general purpose optimization power, we might need to learn the softly optimizing planner.

## Motivation: Assistant for alignment research

In the end what I want is an agent that can be used to help solve the alignment problem properly, while remaining safely bounded and focused on the relatively simple goals and sub-problems we give it. Soft optimization seems to be a promising and practical tool that will be useful for achieving this goal.

Soft optimization allows us to create an agent with a task-specific goal (e.g. run this experiment or solve a math problem), without specifying literally all of human values. We should be able to start with a number of demonstrations of a task being completed, both well and poorly, and the corresponding human-labelled "utility" of the outcomes, and have this be sufficient to create an agent that usefully solves the task. This is because it automatically avoids highly out of distribution outcomes, the outcomes with uncertain utility. For this reason we won't need to think of every possible terrible outcome and carefully provide data about its utility.

For this to work and be competitive, the distribution over utility functions has to generalize well. Generalizing the human value function is hard, because human values are complex. Generalizing *uncertainty about values* is relatively achievable, because uncertainty can be very simple. For example, the uncertainty modeled in could be as simple as: be more uncertain the further the world state is from states in the training data. This is an easy function to learn and an easy function to generalize out of the training distribution.

## My understanding of the most important problems

- Training data is fundamentally limited to outcomes that humans can evaluate
- Learning theory is about situations where the test data is drawn from the same distribution as the training environments. The quantilizer result adds some flexibility to this assumption, but we still have to be worried about situations that have literally zero measure in the training distribution. In these situations generalization bounds don't provide much useful information, and we have to rely entirely on prior information about the similarity of the new situation to situations in training.
- Solutions to this might look like Paul Christiano's work, or insanely good understanding of the prior of our value learning algorithm, or a really great understanding of the internal latent variables that are used as inputs to the utility function.

- Learning theory is about situations where the test data is drawn from the same distribution as the training environments. The quantilizer result adds some flexibility to this assumption, but we still have to be worried about situations that have literally zero measure in the training distribution. In these situations generalization bounds don't provide much useful information, and we have to rely entirely on prior information about the similarity of the new situation to situations in training.
- Value learning
- If we only want low levels of optimization power, then value learning by hand labeling various outcomes might be sufficient. If we want full value learning of all human values, then perhaps an approach like Vanessa Kosoy's might work, where "agent" and "goal" are described mathematically, then found inside the latent knowledge of a predictive world model.
^{[8]}A soft optimizer could fit well with this approach because it naturally results in a distribution over utility when the utility function of the operator isn't well defined. - If we use the method described in this post to bound optimization, it requires
*very high levels of certainty about value*in order to make use of a lot of optimization power. This will strongly limit capabilities until really good value learning techniques are developed.

- If we only want low levels of optimization power, then value learning by hand labeling various outcomes might be sufficient. If we want full value learning of all human values, then perhaps an approach like Vanessa Kosoy's might work, where "agent" and "goal" are described mathematically, then found inside the latent knowledge of a predictive world model.
- Deceptive mesa-optimizers
- An agent trained by finetuning a language model with the variational loss described above could become a deceptively aligned mesa-optimizer, and pretend to be optimizing less strongly than it really is in order to fool the outer objective.

## Appendix

**Why **** caps optimization power**

Vanessa Kosoy defines an intelligence measure , which measures the proportion of random policies that do better than the policy we are measuring. It's worth noting that of a quantilizer is equal to , if the random policy distribution is equal to the policy prior . If the random policy distribution is broader than , then will differ from by a fixed constant. This also corresponds to how Eliezer defines optimization power, as the number of times you can divide the search space in two. Vanessa's definition includes an average over different environments in order to measure the generality of the optimizer.

We can intuitively think of a 0.001-quantilizer as being approximately as intelligent as 1 in 1000 humans, if is a distribution over human actions (although don't trust this intuition too much, it'll always be a very rough approximation, and is missing the notion of generality).

**I think TurnTrout and Garrett Baker are wrong about quantilizers**

They argue that quantilizing doesn't work, or at least raises unanswered questions. I think these questions have good answers:

Maybe we just...Quantilize.But then what's the base distribution, and what's the threshold?

The base distribution is a human imitator, or at least a distribution of policies that we can obtain outcome evaluations for. The threshold is determined by your uncertainty about the utility function.

How do you set the quantiles such that you're drawing from a distribution which mostly involves lots of actual diamonds?

Because you can have a bound on how accurate to the true utility function your distribution is, you know that your policy distribution will contain the most diamonds your knowledge of diamond measurement will allow.

Do there even exist such quantiles, under the uniform base distribution on plans?

Yes, if your knowledge about the utility function is good enough, and you have a bound on how good it is. However, under a uniform distribution over plans it'd be difficult to obtain a tight generalization bound on , and the amount of optimization required to find useful plans would need to be very high, so you'd need a really good bound.

**This proposal is a formal version of one on the ****Goodhart's curse**** page**

Another proposal would be to rely on risk aversion over

unresolvablyuncertain probabilities broad enough to contain something similar to the true V as a hypothesis, and hence engender sufficient aversion to low-true-V outcomes. Then we should worry on a pragmatic level that asufficientlyconservative amount of moral uncertainty--so conservative that U-T risk aversionnever underestimatedthe appropriate degree of risk aversion from our V-standpoint--would end up preventing the AI from actingever.Or that this degree of moral risk aversion would be such a pragmatic hindrance that the programmers might end up pragmatically bypassing all this inconvenient aversion in some set of safe-seeming cases. Then Goodhart's Curse would seek out any unforeseen flaws in the coded behavior of 'safe-seeming cases'.

I think this post addresses the concern that this prevents the AI from taking action at all. But my version still has the same weakness, it will limit capabilities in a way that will be tempting to bypass.

**Reflective stability**

Quantilizers aren't always reflectively stable, i.e. if it builds a successor agent or self-modifies, it might create a maximizer. Currently my best guess for avoiding this is to use TurnTrout's Attainable Utility to penalize policies that gain much more power than the quantilizer is expected to.

**AI Safety Camp**

I'm planning on running an AI Safety Camp research project on this topic. Applications will open in a few days if you want to help, and it'll be remote work over a few months. Project description is here (although I wrote it before doing most of the work in this post).

**Acknowledgements**

Thanks to Beren Millidge for sending me this paper about viewing RL as inference. Thanks to Matthew Watkins, Jacques Thibodeau, Rubi J. Hudson and Thomas Larsen for proofreading and feedback.

^{^}I use "proxy utility" and "utility distribution" interchangeably. These refer to a distribution over utility functions. This distribution represents the agent's (uncertain and not perfectly reliable) knowledge about the "true" utility function. In the Goodhart's Curse arbital page, the proxy is referred to as U, and the "true" utility function as V. In this post, we will call the proxy , and the "true" utility function .

^{^}In the limit of optimization power, the proxy being optimized becomes almost entirely uncorrelated with true value, so the AGI's actions would look like they are maximizing a random utility function.

^{^}Theoretical generalization bounds on neural networks have been studied intensely in ML theory recently, because they are closely linked to the question of why neural networks generalize well at all, despite seemingly being able to fit randomly labeled data. Under classic PAC learning theory, models that can fit randomly labeled data should overfit to the training data.

Note that these bounds only hold with high probability (due to the actual datapoints being randomly chosen), but the bound can be easily adjusted depending on the probability you want. Also note that we will never optimize against this probability.

^{^}Notation here is confusing, so I'll break down each object: .

is a number in [0,1].

takes in an entire utility function and gives you the density assigns to that function. It might be implemented as a Gaussian process, for example.

takes in a number in [0,1], the value of a utility function at . It returns the density at that location. We can think of this as a vertical slice through the distribution over functions.

Now we can get the conditional density at the particular value , which remember is in .

^{^}The function isn't really meaningful, it's just useful for transforming the prior into a policy . See a graph of one in the next footnote.

^{^}^{^}I only saw this a few days ago. I was a bit disappointed, I thought I'd come up with something completely new.

^{^}I find this idea similar to the method used in the paper about finding latent directions that correspond to "truth".

To clarify: A "grader" is not just "anything with a utility function", or anything which makes subroutine calls to some evaluative function (e.g. "how fun does this plan seem?"). A grader-optimizer is not an optimizer which has a grader. It is an optimizer which primarily wants to maximize the evaluations of a grader. Compare

maximally fun-seeming plan(which may or may not involve actually having fun)."(2) is just you optimizing

againstyour`is-fun?`

grader. On (2), you would gleefully accept a plan which you expect to fool yourself about the future (e.g. you aren't really going to be having fun, but the act of evaluating the plan tricks you into thinking you will). On (1), you would not consider this plan, because it wouldn't lead to having fun.(This distinction is still somewhat hard for me to quickly explain, so to really internalize it you'll probably have to read and think more about it and not just read a few paragraphs.)

The post did state that it wasn't going to explain the alternatives (I figured the post was long enough already):

You can learn about a concrete alternative via the example value-child cognitive step-through I provided in Don't align agents to evaluations of plans. You can further read about how robust grading/grader-optimization isn't necessary in Alignment allows "nonrobust" decision-influences and doesn't require robust grading, which includes pseudocode for an agent which isn't a grader-optimizer.

Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I'll remove that paragraph. Thanks for the links, I hadn't read those, and I appreciate the pseudocode.

I think most likely I still don't understand what you mean by grader-optimizer, but it's probably better to discuss on your post after I've spent more time going over your posts and comments.

My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?

And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?

Thanks for registering a guess! I would put it as: a grader optimizer is something which is trying to optimize the outputs of a grader as its terminal end (either de facto, via argmax, or intent-alignment, as in "I wanna search for plans which make this function output a high number"). Like, the pointof the optimization is to make the number come out high.

(To help you checksum: It feels important to me that "is good at achieving its goals" is not tightly coupled to "approximating argmax", as I'm talking about those terms. I wish I had fast ways of communicating my intuitions here, but I'm not thinking of something more helpful to say right now; I figured I'd at least comment what I've already written.)