Improving Model-Written Evals for AI Safety Benchmarking
This post was written as part of the summer 2024 cohort of the ML Alignment & Theory Scholars program, under the mentorship of Marius Hobbhahn. Abstract As model-written evals (MWEs) become more widely used in AI benchmarking, we need scalable approaches to assessing the quality of the eval questions. Upon...
Oct 15, 202430