Right, that's the question. Sure, it is easy to state that "metric must be a faithful representation of the target", but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true "values" (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I'm not sure.
Looking for "functions that don't exhibit Goodhart effects under extreme optimization" might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?
I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that "An actual grounded definition of human preferences" is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,
I still don't understand the whole deal about counterfactuals, exemplified as "If Oswald had not shot Kennedy, then someone else would have". Maybe MIRI means something else by the counterfactuals?
If it's the counterfactual conditionals, then the approach is pretty simple, as discussed with jessicata elsewhere: there is the macrostate of the world (i.e. a state known to a specific observer, which consists of many possible substates, or microstates) of the world, one of these microstates led to the observed macroscopic event, some other possible microstates would have led to the same or different possible macrostates, e.g. Oswald shoots Kennedy, Oswald's gun jams, someone else shooting Kennedy, and so on. The problem is constructing a set of microstates and their probability distribution that together lead to the pre-shooting macrostate. Once you know those, you can predict the odds of each post-shooting-time macrostate. When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.
I posted about it before, but just to reiterate my question. If you can "simply" count possible (micro-)states and their probabilities, then what is there except this simple counting?
Just to give an example, of, say, the Newcomb's problem, the pre-decision microstates of the brain of the "agent", while known to the Predictor, are not known to the agent. Some of these microstates lead to the macrostate corresponding to two-boxing, and some lead to the macrostate corresponding to one-boxing. Knowing what microstates these might be, and assigning our best-guess probabilities to them lets us predict what action an agent would take, if not as perfectly as the Predictor would, then as well as we ever can. What do UDT or FDT say beyond that, or contrary to that?