Some time ago, I mentioned in a comment some issues I had with the Goodhart Taxonomy:
1) “adversarial” seems too broad to be that useful as a category
2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad
3) Whereas “regressional” and “extremal” (and perhaps “causal”) are defined statistically, “adversarial” is defined in terms of agents, and this may have downsides (I’m less sure about this objection)
After thinking about things more, I'd like to expand on these points:
I am not 100% sure I correctly understand what is meant by "Regressional Goodhart". But looking at the example from the post:
Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6'3", and a random 7' person in their 20s would probably not be as good
We see that, if optimizing for basketball ability (V) given only height (U), we won't get the best basketball player, simply because they are imperfectly correlated.
If "Regressional Goodhart" is simply the observation that, if we can't perfectly measure something, we won't perfectly optimize it, then this seems not only obvious to anyone (even if they haven't heard of Goodhart's Law), but it doesn't even seem to match the definition of "Goodhart", which is usually formulated in one of two ways:
G1) When a measure becomes a target, it ceases to be a good measure
G2) Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes
Suppose U,V are imperfectly correlated, but their relationship is monotonic, and the other 3 categories of Goodhart don't apply (e.g. the "straightforward" basketball example, where a 7 foot tall player is better on average than one who's merely 6'3'').
Then if you optimize for V through U, the latter won't "cease to be a good measure"(as in G1): it was a decent but imperfect measure before, and it didn't become any worse as a measure at the upper end. Similarly, the statistical regularity between U and V will continue to hold in this region (unlike G2).
(Technically, the correlation coefficient between two variables usually does decrease under range restriction, but I don't think this is the phenomenon that the OP had in mind).
So at this point, we can simply say this sense of "Regressional Goodhart" isn't Goodhart, and reduce the taxonomy to the other 3 categories. Or, if we think it is pointing at a similar phenomenon as the other 3 categories, we can redefine "Goodhart" slightly more broadly to include it. I have two candidates:
G3) Given a goal V and an imperfect but correlated proxy U, attempts to optimize V given only access to U will often fail
G4) Given a goal V and an imperfect but correlated proxy U, attempts to optimize V given only access to U will often fail, over and above the obvious "irreducible error" arising from the imperfection in the correlation
Personally, I favor (G4), since the term "Goodhart" for me carries some connotation of "unexpected things going wrong", in which case "Regressional Goodhart" seems to just point at the irreducible optimization error that's beside the point of the Law.
On the other hand, the Tails Come Apart phenomenon is arguably unintuitive. So if "Regressional Goodhart" stipulates that "highly" correlated variables will have more irreducible optimization error than one naively expects, then that could be a sensible definition. In any case, I feel this should be more explicit.
[epistemic status: this section still feels very half-baked, mostly including it because "why not"]
On reflection, I do think "Adversarial" is a good category of Goodhart to have, but it nevertheless seems extremely broad compared to the others, and worth trying to further subdivide. Here are some tentative stabs at doing so, by trying to answer the question:
Suppose you have goal V, but must delegate actions to an agent A with goal W. What are some general ways you can misspecify A's incentive structure U, and how will these failure modes manifest?
When specifying a reward function, we can err by having its global optimum be misaligned with what we actually want. This has been described and exemplified here.
On the other hand, even if the global optima match up, we can still run into problems if the incentive gradients on our reward function are locally messed up, e.g. taxing 15% of income if one earns > $K, and 10% for <= $K, which sets up bad incentives for those with an income near $K. [ETA] Or this example:
A coworker is teaching an agent to navigate a room. The episode terminates if the agent walks out of bounds. He didn’t add any penalty if the episode terminates this way. The final policy learned to be suicidal, because negative reward was plentiful, positive reward was too hard to achieve, and a quick death ending in 0 reward was preferable to a long life that risked negative reward.
On a first take, it seems like one could define wireheading as "Adversarial Goodhart where the agent has enough influence over the reporting, that it's easier for it to deceive the principal than to actually achieve the goal."
Compare with examples like this, from Wikipedia:
Providing company executives with bonuses for reporting higher earnings encouraged executives at Fannie Mae and other large corporations to inflate earnings statements artificially and make decisions targeting short-term gains at the expense of long-term profitability.
How different are these? Should they be put in the same bucket?
It could make for an interesting project if someone simply took a bunch of examples of Goodhart's Law, and put them into the Goodhart Taxonomy. Examples can come from:
This could be instructive for testing the Goodhart Taxonomy (and my proposed modifications to it) or revealing new and better categories.
Glad to see engagement on this - and I should probably respond to some of these points, but before doing so, want to point to where I've already done work on this, since much of that work either admits your points, or addresses them.
First, I think you should read the paper I wrote with Scott that extended the thoughts from his post. It certainly doesn't address all of this, but we were very clear that adversarial Goodhart was less clear than the other modes and needed further work. We also more clearly drew the connection to tails fall apart, and clarified some of the sub-cases of both extremal and causal Goodhart. Following that, I wrote another post on the topic, trying to expand on the points made in the paper - but specifically excluding multi-agent issues, because they were hard and I wasn't clear enough about how they worked.
I tried to do a bit of that work in a paper, Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence. This attempts to provide a categorization for multi-agent cases similar to the one made in Scott's post. It made a few key points that I think need further discussion about the relationship to embedded agents, and other issues. I was less successful than I hoped at cutting through the confusion, but a key point it does make is that all multi-agent failures are actually single agent failure modes, but they are caused by misaligned goals or coordination failures. (And these aren't all principal-agent issues, though I agree that many are. For instance, some cases are tragedy of the commons, and others are more direct corruption of the other agents.) I also summarized the paper a bit and expanded on certain key points in another lesswrong post.
And since I'm giving a reading list, I also think my even more recent, but only partially-completed sequence of posts on optimization and selection versus control (in the single agent cases) might clarify some of the points about Regressional versus Extremal Goodhart further. Post one of that sequence is here.