Disclaimer: This is just my model of how Vanessa thinks and I might be misrepresenting some views. Furthermore, as mentioned below, I'm sure Vanessa is well aware of these issues, and plans on trying to solve those which constitute real obstacles. [Edit: See her comments below]
Vanessa( and Diffractor)'s work generally feels different to other Alignment theory, and I have usually attributed this to its radical focus on foundations (shared by some other researchers) and the complexity of its technical mathematical results (shared by few). But upon momentarily coarsening these fine technical details, and presenting PreDCA more conceptually in a language similar to that of other proposals, it becomes clear that it really is fundamentally different to them. As a consequence of Vanessa's opinions and approach, PreDCA has different objectives and hopes to most proposals.
PreDCA is fundamentally concrete. Yes, it still includes some "throw every method you have at the problem" (as in Classification). But the truly principal idea involves betting everything on a specific mathematical formalization of some instructions. This concreteness is justified for Vanessa because nothing short from a watertight solution to agent foundations will get us through Alignment. Some AGI failure modes are obviously problems humanity would have to deal with in the long run anyway (with or without AGI), but on a time trial. And Vanessa is pessimistic about the viability of some intuitively appealing Alignment approaches that try to delegate these decisions to future humanity. In short, she believes some lock-in is inevitable, and so tries to find the best lock-in possible by fundamentally understanding agency and preferences.
Now, if PreDCA works, we won't get a naively narrow lock-in: the user(s)'s utility function will care about leaving these decisions to future humanity. But it will, in other subtler but very real ways, leave an indelible fingerprint on the future. I'm certain Vanessa is aware of the implications, and I suppose she's just willing to take yet another chance (on this lock-in being broadly beneficial), since otherwise we have no chance at all. But still, PreDCA seems to me to downplay the extent to which we wouldn't be pointing at a well-established and comprehended minimal base of values that will keep us alive, but at extremely messy patterns and behaviors (which presuppose a vast amount of uncertainties as true). Of course, the idea is that the correct theoretical enunciation of preferences will do away with the unwanted details. But there might seem to be too many and too systematical biases for even a very capable framework to tell apart obvious consensus from civilizational quirks, and the latter could have awful consequences.
Another (maybe unfair) consequence of PreDCA's concreteness is we can more easily find weak links in the conjunctive plan. The protocol is still work in progress, and so what I'm about to say might end up getting fixed in some retrospectively obvious way, but I now present some worries with the technical aspects of the proposal (which Vanessa is certainly aware of).
The method to find which computations are running in the real world doesn't actually do so. Consider, for instance, a program trying to prove a formal system consistent (by finding a model of it), and another one trying to prove it inconsistent (by deriving a contradiction from it). The outputs of these two programs are acausally related: if the first halts, the second will not, and conversely. Our AGI will acknowledge this acausal relation (and much more complicated ones), as Vanessa expects, since otherwise it won't know basic math and won't produce a correct world model. But this seems to fundamentally mess with the setup. If only one of these programs is running in the universe, counterfactually changing the output of the other will still change the physical universe. And so, as Vanessa mentions in conversation with Jack, we're not really checking which computations run, but which computations' outputs the universe has information about. This is a problem to fix, since we actually care about the former. But this quirk seems to me too inherently natural in the Infra-Bayesian framework. The whole point of such a framework is finding a canonical grounding for these theoretical concepts, from which our common sense ideas and preferences follow naturally, and so changing this quirk ad hoc, against the framework's simplicity, won't cut it. So we need a theoretical breakthrough on this front, and I find it unlikely. That is, I think any accommodation for this quirk will change the framework substantially (but then again, this is very speculative given the unpredictable nature of theoretical breakthroughs).
On a similar note, as Vanessa mentions, the current framework only allows for our AGI to give positive value to computations (so that it will always choose for more computations to be run), which is completely counter to our intuitions. And again, I feel like fixing this ad hoc won't cut it and we need a breakthrough. The framework has another quirk around computations: the number of instances of a concrete computation run in a universe isn't well defined. This is in fact a defining property of the framework, and is clearly deeply related to the above paragraph (that's in part why I feel the above issue won't get easily solved). Now, Vanessa has valid philosophical arguments in favor of this being the case in the real world. But even provided she's right, to me these examples point towards a more worrying general problem about how we're dealing with computations. We are clearly confused about the role of computations in a fundamental level. And yet PreDCA architecturally implements a specific understanding of them. Even if this understanding is the right one (even if future humanity would arrive at the conclusion that the intuitive "number of computations" doesn't matter ethically), the user(s) will retain many of our ethical intuitions about computations, and conflict with the architecture. In the ideal case, the theoretical framework will perfectly deal with this issue and extrapolate as well as future humanity. But I'm not certain it has the right tools to handle this correctly. On the contrary, I feel like this might be another potential source of errors in the utility function inference/extrapolation due to the nature of the framework.
Other concerns can be raised that are more standard. Especially, Vanessa is aware that more ideas are needed to ensure Classification works. For example, my intuition is that implementing a model of human cognitive science won't help much: the model would need to be very precise to defend against the vast space of acausal attackers, since these can get as similar to humans as contingent details of our AGI's hypothesis search permits. Furthermore, we can only implement computational properties of human cognition, and not physical ones (this would require solving ontology identification, and open up a new proxy to Goodhart).
This is actually a particular case of a more general (and my main) worry: we might not have eradicated the need to prune a massive search, and this will remain to be the main source of danger. Even if all mesa-optimizers are acausal attackers, does that leave us much better off, when our best plan at avoiding those is applying various approximate pruning mechanisms (as already happens in many other Alignment proposals)? PreDCA helps clearly with other failure modes, but maybe this hard kernel remains. Especially, our AGI's hypothesis search needs to be massive for it to converge on the right ones, which is necessary for other aspects of the protocol. And since our AGI's actions are determined by its hypotheses, we might be searching a space as big as that of possible code for our AGI, which is the original problem. Maybe pruning the hypothesis search for acausal attackers is way more tractable for some reason, but I don't see why we should expect that to be the case.
It's not surprising that these MIRI-like concerns have led to what is probably the most serious shot at making something similar to Coherent Extrapolated Volition viable.
Even if we only deal with provably halting programs, this scenario can be replicated.