(The bibliography for the whole sequence can be found here)
The first part of this post seems to rest upon an assumption that any subagents will have long-term goals that they are trying to optimize, which can cause competition between subagents. It seems possible to instead pursue subgoals under a limited amount of time, or using a restricted action space, or using only "normal" strategies. When I write a post, I certainly am treating it as a subgoal -- I don't typically think about how the post contributes to my overall goals while writing it, I just aim to write a good post. Yet I don't recheck every word or compress each sentence to be maximally informative. Perhaps this is because that would be a new strategy I haven't used before and so I evaluate it with my overall goals, instead of just the "good post" goal, or perhaps it's because my goal also has time constraints embedded in it, or something else, but in any case it seems wrong to think of post-writing-Rohin as optimizing long term preferences for writing as good a post as possible.
This agent design treats the system’s epistemic and instrumental subsystems as discrete agents with goals of their own, which is not particularly realistic.
Nitpick: I think the bigger issue is that the epistemic subsystem doesn't get to observe the actions that the agent is taking. That's the easiest way to distinguish delusion box behavior from good behavior. (I call this a nitpick because this isn't about the general point that if you have multiple subagents with different goals, they may compete.)
How could a (relatively) 'too-strong' epistemic subsystem be a bad thing?
So if we view an epistemic subsystem as an super intelligent agent who has control over the map and has the goal of make the map match the territory, one extreme failure mode is that it takes a hit to short term accuracy by slightly modifying the map in such a way as to trick the things looking at the map into giving the epistemic subsystem more control. Then, once it has more control, it can use it to manipulate the territory to make the territory more predictable. If your goal is to minimize surprise, you should destroy all the surprising things.
Note that we would not make an epistemic system this way, a more realistic model of the goal of an epistemic system we would build is "make the map match the territory better than any other map in a given class," or even "make the map match the territory better than any small modification to the map." But a large point of the section is that if you search strategies that "make the map match the territory better than any other map in a given class," at small scales, this is the same as "make the map match the territory." So you might find "make the map match the territory" optimizers, and then go wrong in the way above.
I think all this is pretty unrealistic, and I expect you are much more likely to go off in a random direction than something that looks like a specific subsystem the programmers put in gets too much power and optimizes stabile for what the programmers said. We would need to understand a lot more before we would even hit the failure mode of making a system where the epistemic subsystem was agenticly optimizing what it was supposed to be optimizing.