This seems related to my speculations about multi-agent alignment. In short, for embedded agents, having a tractable complexity of building models of other decision processes either requires a reflexively consistent view of their reactions to modeling my reactions to their reactions, etc. - or it requires simplification that clearly precludes ideal Bayesian agents. I made the argument much less formally, and haven't followed the math in the post above (I hope to have time to go through more slowly at some point.)
To lay it out here, the basic argument in the paper is that even assuming complete algorithmic transparency, in any reasonably rich action space, even games as simple as poker become completely intractable to solve. Each agent needs to simulate a huge space of possibilities for the decision of all other agents in order to make a decision about what the probability is that the agent is in each potential position. For instance, what is the probability that they are holding a hand much better than mine and betting this way, versus that they are bluffing, versus that they have a roughly comparable strength hand and are attempting to find my reaction, etc. But evaluating this requires evaluating the probability that they assign to me reacting in a given way in each condition, etc. The regress may not be infinite, because the space of states is finite, as is the computation time, but even in such a simple world it grows too quickly to allow fully Bayesian agents within the computational capacity of, say, the physical universe.
This is a fantastic set of definitions, and it is definitely useful. That said, I want to add something to what you said near the end. I think the penultimate point needs further elaboration. I've spoken about "multi-agent Goodhart" in other contexts, and discussed why I think it's a fundamentally hard problem, but I don't think I've really clarified how I think this relates to alignment and takeoff. I'll try to do that below.
Essentially, I think that the question of multipolarity versus individual or collective takeoff is critical, as (to me) it is the most worrying scenario for alignment.
Individual takeoff implies that a coherent, agentic system is being improved or accelerating, where takeoff could be defined by either economic growth, where a single company or system accounts for a majority of humanity's economic output, or otherwise be a foom or similar scenario. Collective takeoff would imply that a set of agentic systems are accelerating in ways that are (in the short term) non-competitive. If humanity as a whole does benefit widely from greatly increased economic growth, at some point even doubling in a year, yet there is no single dominant system, this would be a collective takeoff.
Multipolar takeoff, however, is a scenario where systems are actively competing in some domain. It seems plausible that competition of this sort would provide incentives for rapid improvement that could impact even non-agentic systems like Drexler's CAIS. Alternatively, or additionally, improvement could be enabled by feedback from competition with peer or near-peer systems. (This seems to be the way humans developed intelligence, and so it seems a-priori worrying.) In either case, this type of takeoff could involve zero or negative sum interaction between systems. If a single winner emerged quickly enough to prevent destructive competition, it would be the "evolutionary" winner, with goals being aligned with success. For that reason, it seems implausible to me that it would be aligned with humanity's interests as a whole. If no winner emerged, it seems that convergent instrumental goals combined with rapidly increasing capabilities would lead to at best a Hansonian Em-scenario, where systems respect property and other rights, but all available resources would be directed towards competition, and systems would be expanded to take over resources until the marginal cost of expansion equals marginal benefit. It seems implausible that in a takeoff scenario, competition reaching this point would leave significant resources for the remainder of humanity, likely at least wasting our cosmic endowment. If the competition turned negative sum, there could be even faster races to the bottom, leading to worse consequences.
I think that more engagement in this area is useful, and mostly agree. I'll point out that I think much of the issue with powerful agents and missed consequences is more usefully captured by work on Goodhart's law, which is definitely my pet idea, but seems relevant. I'll self promote shamelessly here.
Technical-ish paper with Scott Garrabrant: https://arxiv.org/abs/1803.04585
A more qualitative argument about multi-agent cases, with some examples of how it's already failing: https://www.mdpi.com/2504-2289/3/2/21/htm
A hopefully someday accepted / published article on paths to minimize these risks in non-AI systems: https://mpra.ub.uni-muenchen.de/98288/5/MPRA_paper_98288.pdf
See my other reply about pseudo-pareto improvements - but I think the "understood + endorsed" idea is really important, and worth further thought.
My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don't even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do "wrong" so that we can pick what an AI should respect of human preferences, and what can be ignored.
For instance, I love my children, and I like chocolate. I'm also inconsistent with my preferences in ways that differs; at a given moment of time, I'm much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don't know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies - they are either inconsistent, or at least can be exploited by an AI that has a model which doesn't think about resolving those inconsistencies.
With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV - but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first.
(Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that's a different discussion.)
"Arguably, you can't fully align with inconsistent preferences"
My intuitions tend to agree, but I'm also inclined to ask "why not?" e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it "unaligned" with me? More generally, what is it about these other coherence conditions that prevent meaningful "alignment"? (Maybe it takes a big discursive can of worms, but I actually haven't seen this discussed on a serious level so I'm quite happy to just read references).
I've been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler's thoughts on what he has called "pareto-topia". (I haven't gotten anywhere thinking about this because I'm spending my time on other things.)
I don't think you're putting enough weight on what REALLY convinced economists, which was the tractability that assuming utility provides, and their enduring physics envy. (But to be fair, who wouldn't wish that their domain was as tractable as Newtonian physics ended up being.)
But yes, Utility is a useful enough first approximation for humans that it's worth using as a starting point. But only as a starting point. Unfortunately, too many economists are instead busy building castles on their assumptions, without trying to work with better approximations. (Yes, prospect theory and related. But it's hard to do the math, so micro-economic foundations of macroeconomics mostly just aren't being rebuilt.)
I certainly agree that this isn't a good reason to consider human inability to approximate a utility function when looking at modeling AGI. But it's absolutely critical when discussing what we're doing to align with human "values," and figuring out what that looks like. That's why I think that far more discussion on this is needed.
Glad to see engagement on this - and I should probably respond to some of these points, but before doing so, want to point to where I've already done work on this, since much of that work either admits your points, or addresses them.
First, I think you should read the paper I wrote with Scott that extended the thoughts from his post. It certainly doesn't address all of this, but we were very clear that adversarial Goodhart was less clear than the other modes and needed further work. We also more clearly drew the connection to tails fall apart, and clarified some of the sub-cases of both extremal and causal Goodhart. Following that, I wrote another post on the topic, trying to expand on the points made in the paper - but specifically excluding multi-agent issues, because they were hard and I wasn't clear enough about how they worked.
I tried to do a bit of that work in a paper, Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence. This attempts to provide a categorization for multi-agent cases similar to the one made in Scott's post. It made a few key points that I think need further discussion about the relationship to embedded agents, and other issues. I was less successful than I hoped at cutting through the confusion, but a key point it does make is that all multi-agent failures are actually single agent failure modes, but they are caused by misaligned goals or coordination failures. (And these aren't all principal-agent issues, though I agree that many are. For instance, some cases are tragedy of the commons, and others are more direct corruption of the other agents.) I also summarized the paper a bit and expanded on certain key points in another lesswrong post.
And since I'm giving a reading list, I also think my even more recent, but only partially-completed sequence of posts on optimization and selection versus control (in the single agent cases) might clarify some of the points about Regressional versus Extremal Goodhart further. Post one of that sequence is here.
This post has significant changed my mental model of how to understand key challenges in AI safety, and also given me a clearer understanding of and language for describing why complex game-theoretic challenges are poorly specified or understood. The terms and concepts in this series of posts have become a key part of my basic intellectual toolkit.
I don 't think this is straightforward in practice - and putting a cartesian boundary in place is avoiding exactly the key problem. Any feature of the world used as the item to minimize/maximize is measured, and uncorruptable measurement systems seems like a non-trivial problem. For instance, how do I get my GAI to maximize blue in an area instead of maximizing the blue input into their sensor when pointed at that area? We need to essentially solve value loading and understand a bunch of embedded agent issues to really talk about this.