Don't you mean "the most *conditionally* forbidden technique?"
Hear me out, I think the most forbidden technique is very useful and should be used, as long as we avoid the "most forbidden aftertreatment:" 1. An AI trained on interpretability techniques must not be trained on capabilities after (or during) it is trained on interpretability techniques, otherwise it will relearn bad behaviour—in a more sneaky way. 2. An AI trained on interpretability techniques cannot be trusted any more than old version of itself, which hasn't been trained on interpretability techniques yet. Evaluations must be performed on the old version of itself. * An AI company which trains its AI on interpretability techniques, must publish the old version (which hasn't been trained on them), with the same availability as the new version. The natural selection argument: The reason why the most forbidden technique is forbidden, because it lets bad behaviours evolve against interpretability. Bad behaviours which are caught by interpretability techniques are killed, and then capabilities training creates more bad behaviours, and then bad behaviours which get caught are killed again, and then capabilities training creates more bad behaviours again! After many generations, natural selection creates the most insidious and terrifying bad behaviours, which successfully hide from all interpretability techniques. However, if we only train on interpretability techniques a single time, and do zero capabilities training afterwards, then we are only killing the bad behaviours a single time, and without giving the surviving bad behaviours any time to birth a second generation of bad behaviours. Human evolution Imagine if during the course of human evolution, a robot observed our brain activity in the amygdala, the part of the brain responsible for anger. Whenever the population increases above 1 million, it kills half of all people with the highest anger ratings, based on brain activity in the amygdala. For a short time, this will cause people to be less an