**Clarifying Thoughts on Optimizing and Goodhart Effects - Part 2**

Previous Post: Re-introducing Selection vs Control for Optimization In the post, I reviewed Abram's selection/control distinction, and suggested how it relates to actual design. I then argue that there is a bit of a continuum between the two cases, and that we should add an addition extreme case to the typology, direct solution.

Here, I will revisit the question of what optimization means.

NOTE: This is not completely new content, and is instead split off from the previous version and rewritten to include an *(Added) *discussion of Eliezer's definition for measuring optimization power, from 2008. Hopefully this will make the sequence clearer for future readers.

In the next post, Applying over-Optimization in Selection and Control, I apply these ideas, and concretize the discussion a bit more before moving on to discussing Mesa-Optimizers in Part 4.

## What does Optimization Mean, Again?

This question has been discussed a bit, but I still don't think its clear. So I want to start by revisiting a post Eliezer wrote in 2008, where he suggested that optimization power was ability to select states from a preference ordering over different states, and could be measured with entropy. He notes that this is not computable, but gives us insight. I agree, except that I think that the notion of the state space is difficult, for some of the reasons Scott discussed when he mentioned that he was confused about the relationship between gradient descent and Goodhart's law. In doing so, Scott proposed a naive model that looks very similar to Eliezer's;

simple proxy of "sample points until I get one with a large U value" or "sample n points, and [select] the one with the largest U value" when I think about what it means to optimize something for U. I might even say something like " bits of optimization" to refer to sampling points. I think this is not a very good proxy for what most forms of optimization look like."

I want to start by noting that this is absolutely and completely a "selection" type of optimization, in Abram's terms. As Scott noted, however, it's not a good model for what most optimization looks like, and that's part of why I think Eliezer's model is less helpful than I did when I originally read it.

There's a much better model for gradient descent optimization, which is... gradient descent. It is a bit closer to control than direct optimization, since in some sense we're navigating through the space, but for almost all actual applications, it is still selection, not control. To review how it works, points are chosen iteratively, and the gradient is assessed at each point. The gradient is used to *select* a new point at some (perhaps very clever, dynamically chosen next point.) Some stopping criteria is checked, and it iterates at that new point. This is almost always tons more efficient than generating random points and examining them.

*(Addded)* It's far better than a grid search, usually, for most landscapes, but also makes it clear why I think it's hard to discuss optimization power in Eliezer's terms on a practical level, at least when dealing with a continuous system. The problem I'm alluding to is that any list of preferences over states depends on number of states. Gradient descent type optimization is really good at focusing on specific sections of the state space, especially compared to grid search. We might find a state where grid search would require a tremendously high resolution, but we don't ever compute a preference ordering over states. With gradient descent, we instead compute preferences for a local area and (hopefully) zoom-in, potentially ignoring other parts of the space. An optimizer that focuses very narrowly can have high-resolution but miss the non-adjacent region with far better outcomes, or can have fairly low resolution but perform far better - and the second optimizer is clearly more powerful, but I don't know how to capture this.

But to return to the main discussion, the process of gradient descent is also somewhere between selection and control - and that's what I want to explain.

In theory, the evaluation of each point in the test space could involve an actual check of the system. I build each rocket, watch to see whether it fails or succeeds according to my metric. For search, I'd just pick the best performers, and for more clever approaches, I can do something like find a gradient by judging performance of parameters to see if increasing or decreasing those that are amenable to improvement would help. (I can be even more inefficient, but find something more like a gradient, by building many similar rockets, each an epsilon away in several dimensions, and estimating a gradient that way. Shudder.)

In practice, we use a proxy model - and this is one place that allows for the types of overoptimization misalignment we are discussing. (But it's not the only one.) The reason this occurs is laid out clearly in the Categorizing Goodhart paper as one of the two classes of extremal failure - either model insufficiency, or regime change. This also allows for (during simulation undetectable) causal failures, if the proxy model gets a causal effect wrong.Even without using a proxy model, we can be led astray by the results if we are not careful. Rockets might look great, even in practice, and only fail in untested scenarios because we optimized something too hard - extremal model insufficiency. (Lower weight is cheaper, and we didn't notice a specific structural weakness induced by ruthlessly eliminating weight on the structure.) For our purposes, we want to talk about things like "how much optimization pressure is being applied." This is difficult, and I think we're trying to fit incompatible conceptual models together rather than finding a good synthesis, but I have a few ideas on what selection pressure leading to extremal regions means here.

- Extreme proxy values (in comparison to most of the space) seems similar to having lots of selection pressure. If we have a insanely tall and narrow peak, we may be finding something strange rather than simply improving.
- Extreme input values (unboundedly large or small values) may indicate a worrying area vis-a-vis overoptimization failures.
- Lots of search time alone does NOT indicate extremal results - it indicates lots of things about your domain, and perhaps the inefficiency of your search, but not overoptimization. (This is in contrast to the naive grid-search model, where lots of points visited means more optimizing.)

As an aside, Causal Goodhart is different. It doesn't really seem to rely on extremes, but rather on manipulating new variables, ones that could have an impact on our goal. This can happen because we change the value to a point where it changes the system, similar to extremal Goodhart, but does not need to. For instance, we might optimize filling a cup by getting the water level near the top. Extremal regime change failure might be overfilling the cup and having water spill everywhere. Causal failure might be moving the cup to a different point, say right next to a wall, in order to capture more water, but accidentally break the cup against the wall.Notice that this doesn't require much optimization pressure - Causal Goodhart is about moving to a new region of the distribution of outcomes by (metaphorically or literally) breaking something in the causal structure, rather than by over-optimizing and pushing far from the points that have been explored.This completes the discussion so far - and note that none of this is about control systems. That's because in a sense, most current examples don't optimize much, they simply execute an adaptive program.

One critical case of a control system optimizing is a mesa-optimizer, but that will be deferred until after the next post, which introduces some examples and intuitions around how Goodhart-failures occur in selection versus control systems.

Thoughts on early stopping? (Maybe it works because if you keep optimizing long enough, you're liable to find yourself on a tall and narrow peak which generalizes poorly? However it seems like maybe the phenomenon doesn't just apply to gradient descent?)

BTW, I suspect there is only so much you can do with abstractions like these. At the end of the day, any particular concrete technique may not exhibit the flaws you predicted based on the abstract category you placed it in, and it may exhibit flaws which you wouldn't have predicted based on its abstract category. Maybe abstract categories are best seen as brainstorming tools for finding flaws in techniques.

Good point about early stopping.

I agree that these abstractions are very limited and should mainly be used to raise concerns. Due to existential risk, there's an asymmetry between concerns vs positive arguments against concerns: if we want to avoid large negative outcomes, we have to take vague concerns seriously in the absence of good arguments against them; but, asymmetrically, seek strong arguments that systems avoid risk. Recently I worry that this can give people an incorrect picture of which ideas I think are firm enough to take seriously. I'll happily discuss fairly vague ideas such as instrumental convergence when talking about risk. But (at least for me personally) the very same ideas will seem overly vague and suspicious, if used in an argument that things will go well. I think this is basically the right attitude to take, but could be confusing to other people.

Could you expand on why you think that information / entropy doesn't match what you mean by "amount of optimization done"?

E.g. suppose you're training a neural network via gradient descent. If you start with weights drawn from some broad distribution, after training they will end up in some narrower distribution. This seems like a good metric of "amount of optimization done to the neural net."

I think there are two categories of reasons why you might not be satisfied - false positive and false negative. False positives would be "I don't think much optimization has been done, but the distribution got a lot narrower," and false negatives would be "I think more optimization is happening, but the distribution isn't getting any narrower." Did you have a specific instance of one of these cases in mind?

A couple points.

First, the reason why I wasn't happy with entropy as a metric is because it doesn't allow (straightforward) comparison of different types of optimization, as I discussed. Entropy of a probability distribution output isn't comparable to the entropy over states that Eliezer defines, for example.

Second, I'm not sure false positive and false negative are the right conceptual tools here. I can easily show examples of each - gradient descent can fail horribly in many ways, and luck of specific starting parameters on specific distributions can lead to unreasonably rapid convergence, but in both cases, it's a relationship between the algorithm and the space being optimized.

I think Eliezer's definition still basically makes sense as a measure of optimization power, but the model of optimization which inspired it (basically, optimization-as-random-search) doesn't make sense.

Though, I would very likely have a better way of measuring optimization power if I understood what was really going on better.

I think I agree about Eliezer's definition, that it's theoretically correct, but I definitely agree that I need to understand this better.