Supervised fine-tuning as a method for training-based AI control
Executive summary This post is a research update on our ongoing project studying training-based AI control: using training to mitigate risk from misaligned AI systems in diffuse threat models like research sabotage. This post presents our results on supervised fine-tuning (SFT), the simplest training-based approach. We evaluate our training measures...
Nov 13, 202540