I'd be curious about how much more costly this attack is on LMs Pretrained with Human Preferences (including when that method is only applied to "a small fraction of pretraining tokens" as in PaLM 2).
Relevant related work : NNs are surprisingly modular
I believe Richard linked to Clusterability in Neural Networks, which has superseded Pruned Neural Networks are Surprisingly Modular.
The same authors also recently published Detecting Modularity in Deep Neural Networks.
In practice, arguing that your evaluations show open-sourcing is safe may involve a bunch of paperwork and maybe lawyer fees. If so, this would be a big barrier for small teams, so I expect open-source advocates not to be happy with such a trajectory.