Here’s a rough taxonomy.
One confusing edge case for distinguishing alignment failure from capabilities failure is if the model ‘knows’ some concept, but doesn’t ‘know’ that that was the concept you wanted.
E.g. suppose the model classifies someone getting shot with a gun as not an injury. There are 3 possibilities:
I think the first possibility is definitely a capabilities failure, and the last is definitely an alignment failure, but I’m not sure how to categorize the middle one. The key test is something like: do we expect this problem to get better or worse as model capability increases? I’m inclined to say that we expect it to get better, because a more capable model will be better at guessing what humans want, and better at asking for clarification if uncertain.
This seems generally clear and useful. Thumbs-up.
(2) could probably use a bit more "...may...", "...sometimes..." for clarity. You can't fool all of the models all of the time etc. etc.
thanks, edited :)
What does wanting to use adversarial training say about the sorts of labeled and unlabeled data we have available to us? Are there cases where we could either do adversarial training, or alternately train on a different (potentially much more expensive) dataset?
What does it say about our loss function on different sorts of errors? Are there cases where we could either do adversarial training, or just use a training algorithm that uses a different loss function?
A common justification I hear for adversarial training is robustness to underspecification/slight misspecification. For example, we might not know what parameters the real world has, and our best guesses are probably wrong. We want our AI to perform acceptably regardless of what parameters the world has, so we perform adversarial domain randomization. This falls under 1. as written insofar we can say that the AI learnt a stupid heuristic (use a policy that depends on the specific parameters of the environment for example) and we want it to learn something more clever (infer the parameters of the world, then do the optimal thing under those parameters). We want a model whose performance is invariant to (or at least robust to) underspecified parameters, but by default our models don't know which parameters are underspecified. Adversarial training helps insofar as it teaches our models which things they should not pay attention to. In the image classification example you gave for 1., I'd frame it as:* The training data doesn't disambiguate between "the intended classification boundary is whether or not the image contains a dog" and "the intended classification boundary is whether or not the image contains a high frequency artifact that results from how the researchers preprocessed their dog images", as the two features are extremely correlated on the training data. The training data has fixed a parameter that we wanted to leave unspecified (which image preprocessing we did, or equivalently, which high frequency artifacts we left in the data).* Over the course of normal (non-adversarial) training, we learn a model that depends on the specific settings of the parameters that we wanted to leave unspecified (that is, the specific artifacts). * We can grant an adversary the ability to generate images that break the high frequency artifacts while preserving the ground truth of if the image contains a dog. For example, the Lp norm-ball attacks oft studied in the adversarial image classification literature. * By training the model on data generated by said adversary, the model can now learn that the specific pattern of high frequency patterns doesn't matter, and can now spend more its capacity paying attention to whether or not the image contains dog features.