(A longer text-based version of this post is also available on MIRI's blog here, and the bibliography for the whole sequence can be found here)

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 11:40 AM

Some last minute emphasis:

We kind of open with how agents have to grow and learn and be stable, but talk most of the time about this two agent problem, where there is an initial agent and a successor agent. When thinking about it as the succession problem, it seems like a bit of a stretch as a fundamental part of agency. The first two sections were about how agents have to make decisions and have models, and choosing a successor does not seem like as much of a fundamental part of agency. However, when you think it as an agent has to stably continue to optimize over time, it seems a lot more fundamental.

So, I want to emphasize that when we say there are multiple forms of the problem, like choosing successors or learning/growing over time, the view in which these are different at all is a dualistic view. To an embedded agent, the future self is not privileged, it is just another part of the environment, so there is no difference between making a successor and preserving your own goals.

It feels very different to humans. This is because it is much easier for us to change ourselves over time that it is to make a clone of ourselves and change the clone, but that difference is not fundamental.

I want to expand a bit on adversarial Goodhart, which this post describes as when another agent actively attempts to make the metric fail, and the paper I wrote with Scott split into several sub-categories, but which I now think of in somewhat simpler terms. There is nothing special happening in the multi-agent setting in terms of metrics or models, it's the same three failure modes we see in the single agent case.

What changes more fundamentally is that there are now coordination problems, resource contention, and game-theoretic dynamics that make the problem potentially much worse in practice. I'm beginning to think of these multi-agent issues as a problem more closely related to the other parts of embedded agency - needing small models of complex systems, reflexive consistency, and needing self-models, as well as the issues less intrinsically about embedded agency, of coordination problems and game theoretic competition.