This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.
I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.
I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.