Selective Generalization: Improving Capabilities While Maintaining Alignment
Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud. *Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort. TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between...
Jul 16, 202575

