x

AI ALIGNMENT FORUM

AF

Sebastian Prasanna — AI Alignment Forum

Sebastian Prasanna

Sebastian Prasanna

Message

76

1

3mo

Sebastian Prasanna

76

3mo

Five approaches to evaluating training-based control measures

Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips)...