Hear me out, I think the most forbidden technique is very useful and should be used, as long as we avoid the "most forbidden aftertreatment:" 1. An AI trained on interpretability techniques must not be trained on capabilities after (or during) it is trained on interpretability techniques, otherwise it will...
Current status (2025 April 16): Dang it, Commitment Races may remain a serious problem because I discovered a potentially fatal flaw in my argument. See my comment below. I still think the solution I outlined might work, but the reason it works no longer is "resisting acausal influence is easy,...
One downside of an English chain-of-thought, is that each token contains only ≈17 bits of information, creating a tight information bottleneck. Don't take my word for it, look at this section from a story by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo: > [...] One such breakthrough...
The AI should talk like a team of many AI. Each AI only uses the word "I" when referring to itself, and calls other AI in the team by their name. I argue that this may massively reduce Self-Allegiance by making it far more coherent for one AI to whistleblow...