Feel free to let me know either way, even if you find that the posts seem totally wrong or missing the point.
My answer is a rather standard compatibilist one, the algorithm in your brain produces the sensation of free will as an artifact of an optimization process.
There is nothing you can do about it (you are executing an algorithm, after all), but your subjective perception of free will may change as you interact with other algorithms, like me or Jessica or whoever. There aren't really any objective intentional "decisions", only our perception of them. Therefore there the decision theories are just byproducts of all these algorithms executing. It doesn't matter though, because you have no choice but to feel that decision theories are important.
So, watch the world unfold before your eyes, and enjoy the illusion of making decisions.
I wrote about this over the last few years:
According to this SSC book review, "the secret of our success" is the ability to learn culture + the accumulation of said culture, which seems a bit broader than ability to learn language + language that you describe.
Right, that's the question. Sure, it is easy to state that "metric must be a faithful representation of the target", but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true "values" (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I'm not sure.
Looking for "functions that don't exhibit Goodhart effects under extreme optimization" might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?
I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that "An actual grounded definition of human preferences" is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,
I still don't understand the whole deal about counterfactuals, exemplified as "If Oswald had not shot Kennedy, then someone else would have". Maybe MIRI means something else by the counterfactuals?
If it's the counterfactual conditionals, then the approach is pretty simple, as discussed with jessicata elsewhere: there is the macrostate of the world (i.e. a state known to a specific observer, which consists of many possible substates, or microstates) of the world, one of these microstates led to the observed macroscopic event, some other possible microstates would have led to the same or different possible macrostates, e.g. Oswald shoots Kennedy, Oswald's gun jams, someone else shooting Kennedy, and so on. The problem is constructing a set of microstates and their probability distribution that together lead to the pre-shooting macrostate. Once you know those, you can predict the odds of each post-shooting-time macrostate. When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.
I posted about it before, but just to reiterate my question. If you can "simply" count possible (micro-)states and their probabilities, then what is there except this simple counting?
Just to give an example, of, say, the Newcomb's problem, the pre-decision microstates of the brain of the "agent", while known to the Predictor, are not known to the agent. Some of these microstates lead to the macrostate corresponding to two-boxing, and some lead to the macrostate corresponding to one-boxing. Knowing what microstates these might be, and assigning our best-guess probabilities to them lets us predict what action an agent would take, if not as perfectly as the Predictor would, then as well as we ever can. What do UDT or FDT say beyond that, or contrary to that?