Validating against a misalignment detector is very different to training against one
Consider the following scenario: * We have ideas for training aligned AI, but they’re mostly bad: 90% of the time, if we train an AI using a random idea from our list, it will be misaligned. * We have a pretty good alignment test we can run: 90% of aligned...