x

AI ALIGNMENT FORUM

AF

aog — AI Alignment Forum

ao

Top postsTop post

ao

Message

AI grantmaker at Longview Philanthropy and AI DPhil student at Oxford

1710

Ω

33

21

213

7y

ao

AI grantmaker at Longview Philanthropy and AI DPhil student at Oxford

Adversarial Robustness Could Help Prevent Catastrophic Misuse

There have been several discussions about the importance of adversarial robustness for scalable oversight. I’d like to point out that adversarial robustness is also important under a different threat model: catastrophic misuse. For a brief summary of the argument: 1. Misuse could lead to catastrophe. AI-assisted cyberattacks, political persuasion, and...

Dec 11, 2023•30

Model-driven feedback could amplify alignment failures

Anthropic, DeepMind and Google Brain are all working on strategies to train language models on their own outputs. For a brief summary of the work so far: 1. Red Teaming Language Models with Language Models (Perez et al., 2022). One model prompts another, seeking to expose undesirable generations. A third...

Jan 30, 2023•21