I am a Ph.D. student from HKUST. My research interest is broadly on AI Safety, Interpretability, and Alignment.
https://alan-qin.github.io/
My knowledge of the broader machine learning and neuroscience fields is limited, and I strongly suspect that there are connections to other topics out there – perhaps some that have already been studied, and perhaps some which have yet to be. For example, there are probably interesting connections between interpretability and dataset distillation (Wang et al., 2018). I’m just not sure what they are yet.
The dataset condensation (distillation) may be seen as the global explanation of the training dataset. And, we could also utilize DC to extract some spurious correlations like backdoor triggers. We made a naive attempt in this direction: https://openreview.net/forum?id=ix3UDwIN5E .
But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.
I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).
I think that being able to selectively modify the inner state (the task of Surgeon) is not easier than searching for adversarial examples in the input space.