AI ALIGNMENT FORUM
AF

Makes sense, thanks—can you also briefly clarify what exactly you are pointing at with 'syntactic?' Seems like this could be interpreted in multiple plausible ways, and looks like others might have a similar question.

Reducing LLM deception at scale with self-other overlap fine-tuning

Cameron Berg9mo37

The idea to combine SOO and CAI is interesting. Can you elaborate at all on what you were imagining here? Seems like there are a bunch of plausible ways you could go about injecting SOO-style finetuning into standard CAI—is there a specific direction you are particularly excited about?

Paradigm-Building for AGI Safety Research

29Mistral Large 2 (123B) seems to exhibit alignment faking

8mo

32Reducing LLM deception at scale with self-other overlap fine-tuning

9mo

20Self-prediction acts as an emergent regularizer

60Self-Other Overlap: A Neglected Approach to AI Alignment

16There Should Be More Alignment-Driven Startups

33Survey for alignment researchers!

10Paradigm-building: Introduction

25Theoretical Neuroscience For Alignment Theory