Goodhart's law comes in a few flavors, as originally pointed out by Scott, and formalized a bit more in our joint paper. When discussing that paper, or afterwards, we struggled with something Abram Demski clarified recently, which is the difference between selection and control. This matters for formalizing what happens, especially when asking about how Goodhart occurs in specific types of optimizers, as Scott asked recently.
Epistemic Status: This is for de-confusing myself, and has been helpful. I'm presenting what I am fairly confident I understand well for the content written so far, but I'm unclear about usefulness for others, or how clear it comes across. I think that there's more to say after this post, and this will have a few more parts if people are interested. (I spent a month getting to this point, and decided to post and get feedback rather than finish a book first.)
In the first half of the post, I'll review Abram's selection/control distinction, and suggest how it relates to actual design. I'll also argue that there is a bit of a continuum between the two cases, and that we should add an addition extreme case to the typology, direct solution. The second section will revisit what optimization means, and try to note a few different things that could happen and go wrong with Goodhart-like overoptimization.
The planned third (started, but as of now not fully understood by myself) section will talk about Goodhart in this context using the new understanding - trying to more fully explain why Goodhart effects in selection and control fundamentally differs.
Thoughts on how selection and control are used in tandem
In this section, I'll discuss the two types of optimizers Abram discussed; selection, and control, and introduce a third, simpler optimizer, direct solution. I'm also going to mention where embedded agents are different, because that's closely related to selection versus control, and talk about where mesa-optimizers exist.
Starting with the (heavily overused) example of rockets, I want to revisit Abram's categorization of algorithmic optimization versus control. There are several stages involved with getting rockets to go where we want. The first is to design the rocket, which involves optimization, which I'll discuss in two stages, the second is to test, which involves optimization and control in tandem, and the third is to actually guide the rocket we built in flight, which is purely control.
Initially, designing rocket is pure optimization. We might start by building simplified mathematical models to figure out the basic design constraints - if a rocket is bringing people to the moon, we may decide the goal is a rocket and a lander, rather than a single composite. We may decide that certain classes of trajectory / flight paths are going to be used. This is all a set of mathematical exercises, and probably involves only multiply differentiable models that can be directly solved to find an optimum. This is in many ways a third category of "optimizing," in Abram's model, because there is not even a need for looking over the search space. I'll call this direct solution, since we just pick the optimum based on the setup.
After getting a bit closer to actual design, we need to simulate rocket designs and paths, and optimize the simulated solution. This lets you do clever things like build a rocket with a sufficient but not excessive amount of fuel (hopefully with a margin of error.) If we're smart, we optimize with several intended uses and variable factors in mind, to make sure our design is sufficiently robust. (If we're not careful enough to include all relevant factors, we ignore some factor that turns out will matter, like the relationship between temperature of the O-rings and their brittleness, and our design fails in those conditions.) This is all optimizing over a search space. The cost of the search is still comparatively low - not as low as direct solution, and we may use gradient descent, genetic algorithms, simulated annealing, or other strategies. The commonality between these solutions is that they simulate points in the search space, perhaps along with the gradients at that point.
After we settle on a design, we build an actual rocket, and then we test it. This moves back and forth between the very high cost approach of building physical objects and testing them - often to destruction - and simulation. After each test, we probably re-run the simulation to make sure any modifications are still near the optimum we found, or we refine the simulations to re-optimize and pick the next design to build.
Lastly, we build a final design, and launch the rocket. The control system is certainly a mesa-optimizer with regards to the rocket design process. For a rocket, this control is closer to direct optimization than simulation, because the cost of evaluation needs to be low enough for real-time control. The mesa-optimizer would, in this case, use simplified physics to fire the main and guidance rockets to stay near the pre-chosen path. It's probably not allowed to pick a new path - it can't decide that the better solution is to orbit twice instead of once before landing. (Humans may decide this, then hand the mesa-optimizer new parameters.) We tightly constrain the mesa-optimizer, since in a certain sense it's dumber than the design optimizer that chose what to optimize for.
For a more complex system, we may need a complex mesa-optimizer to guide the already designed system. Even for a more complex rocket, we may allow the mesa-optimizer to modify the model used for optimizing, at least in minor ways - it may dynamically evaluate factors like the rocket efficiency, and decide that it's getting 98% of the expected thrust, so it will plan to use that modified parameter in the system model used to mesa-optimize. Giving a mesa-optimizer more control is dangerous, but perhaps necessary to allow it to navigate a complex system.
What does Optimization Mean?
Scott mentioned that he was confused about the relationship between gradient descent and Goodhart's law. He proposed the naive model;
simple proxy of "sample points until I get one with a large value" or "sample points, and [select] the one with the largest value" when I think about what it means to optimize something for . I might even say something like "n bits of optimization" to refer to sampling points. I think this is not a very good proxy for what most forms of optimization look like."
This is absolutely and completely a "selection" type of optimization, in Abram's terms, but as he noted, it's not a good model for what most optimization looks like.
There's a much better model for gradient descent, which is... gradient descent. This is a bit closer to control, but for almost all actual applications, it is still essentially selection. To review, points are chosen iteratively, and the gradient is assessed at each point. The gradient is used to select a new point at some (perhaps very clever, dynamically chosen next point.) Some stopping criteria is checked, and it iterates at that new point. This is almost always tons more efficient than generating random points and examining them. It's far better than a grid search, usually, for most landscapes. It's also somewhere between selection and control - and that's what I want to explain.
In theory, the evaluation of each point in the test space could involve an actual check of the system. I build each rocket, watch to see whether it fails or succeeds according to my metric. For search, I'd just pick the best performers, and for more clever approaches, I can do something like find a gradient by judging performance of parameters to see if increasing or decreasing those that are amenable to improvement would help. (I can be even more inefficient, but find something more like a gradient, by building many similar rockets, each an epsilon away in several dimensions, and estimating a gradient that way. Shudder.)
In practice, we use a proxy model - and this is one place that allows for the types of overoptimization misalignment we are discussing. (But it's not the only one.) The reason this occurs is laid out clearly in the Categorizing Goodhart paper as one of the two classes of extremal failure - either model insufficiency, or regime change. This also allows for (during simulation undetectable) causal failures, if the proxy model gets a causal effect wrong.
Even without using a proxy model, we can be led astray by the results if we are not careful. Rockets might look great, even in practice, and only fail in untested scenarios because we optimized something too hard - extremal model insufficiency. (Lower weight is cheaper, and we didn't notice a specific structural weakness induced by ruthlessly eliminating weight on the structure.)
For our purposes, we want to talk about things like "how much optimization pressure is being applied." This is difficult, and I think we're trying to fit incompatible conceptual models together rather than finding a good synthesis, but I have a few ideas on what selection pressure leading to extremal regions means here.
- Extreme proxy values (in comparison to most of the space) seems similar to having lots of selection pressure. If we have a insanely tall and narrow peak, we may be finding something strange rather than simply improving.
- Extreme input values (unboundedly large or small values) may indicate a worrying area vis-a-vis overoptimization failures.
- Lots of search time alone does NOT indicate extremal results - it indicates lots of things about your domain, but not overoptimization. This is in stark contrast to the naive model.
As an aside, Causal Goodhart is different. It doesn't really seem to rely on extremes, but rather on manipulating new variables, ones that could have an impact on our goal. This can happen because we change the value to a point where it changes the system, similar to extremal Goodhart, but does not need to. For instance, we might optimize filling a cup by getting the water level near the top. Extremal regime change failure might be overfilling the cup and having water spill everywhere. Causal failure might be moving the cup to a different point, say right next to a wall, in order to capture more water, but accidentally break the cup against the wall.
Notice that this doesn't require much optimization pressure - Causal Goodhart is about moving to a new region of the distribution of outcomes by (metaphorically or literally) breaking something in the causal structure, rather than by over-optimizing and pushing far from the points that have been explored.
This completes the discussion so far - and note that none of this is about control systems. That's because in a sense, most current examples don't optimize much, they simply execute an adaptive program. (One critical case of a control system optimizing is a mesa-optimizer, but that also needs to be addressed in a later post.)