Top postsTop post
ao
1699
Ω
33
21
212
There have been several discussions about the importance of adversarial robustness for scalable oversight. I’d like to point out that adversarial robustness is also important under a different threat model: catastrophic misuse. For a brief summary of the argument: 1. Misuse could lead to catastrophe. AI-assisted cyberattacks, political persuasion, and...
Anthropic, DeepMind and Google Brain are all working on strategies to train language models on their own outputs. For a brief summary of the work so far: 1. Red Teaming Language Models with Language Models (Perez et al., 2022). One model prompts another, seeking to expose undesirable generations. A third...