x
Latent Adversarial Training (LAT) Improves the Representation of Refusal — AI Alignment Forum