I gave a talk at MIT in March earlier this year on barriers to mechanistic interpretability being helpful to AGI/ASI safety, and why by default it will likely be net dangerous. Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I discuss two major points (by no means exhaustive), one technical and one political, that present barriers to MI addressing AGI risk:
This being said, there are more nuances to this opinion, and a lot of it is downstream of lack of coordination and the downsides of publishing in an adversarial environment like we are in right now. I still endorse the work done by e.g. Chris Olah's team as brilliant, but extremely early, scientific work that has a lot of steep epistemological hurdles to overcome, but I unfortunately also believe that on net work such as Olah's is at the moment more useful as a safety-washing tool for AGI labs like Anthropic than actually making a dent on existential risk concerns.
Here are the slides from my talk, and you can find the video here.
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I'll add that I have as well and wrote a sequence about it :)