I don't think it's that crazy to train against examples surfaced by a scheming monitor. For example, for diffuse threats (that happen a lot, and that aren't very harmful individually - e.g. sandbagging), async monitoring + training against the examples seems like a reasonable response. There's not much use continuing to see the same failures and there are good reasons to get rid of them - e.g. being able to more confidently use the model for safety research. So I'm not sure always keeping monitors separate from training is practical, or correct.
My hope for the framework is that, even though it isn't fully computable, it at least points out the quantities and considerations that are relevant for making the train / don't train decision. (Which by default would likely be made in a less principled way.)
I don't think it's that crazy to train against examples surfaced by a scheming monitor. For example, for diffuse threats (that happen a lot, and that aren't very harmful individually - e.g. sandbagging), async monitoring + training against the examples seems like a reasonable response. There's not much use continuing to see the same failures and there are good reasons to get rid of them - e.g. being able to more confidently use the model for safety research. So I'm not sure always keeping monitors separate from training is practical, or correct.
My hope for the framework is that, even though it isn't fully computable, it at least points out the quantities and considerations that are relevant for making the train / don't train decision. (Which by default would likely be made in a less principled way.)