Machine Unlearning Evaluations as Interpretability Benchmarks
Interpreting Models by Ablation. Image generated by DALL-E 3. Introduction Interpretability in machine learning, especially in language models, is an area with a large number of contributions. While this can be quite useful for improving our understanding of models, one issue is that there is the lack of robust benchmarks...
Oct 23, 202333