x

AI ALIGNMENT FORUM

AF

Ollie J — AI Alignment Forum

Ollie J

Ollie J

Message

191

2

5

5y

Ollie J

191

5y

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

by Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, and Francis Rhys Ward

We have written a paper on sandbagging for which we present the abstract and brief results in this post. See the paper for more details. Tweet thread here. Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system...

Jun 13, 2024•84

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

by Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak, and Sam F. Brown

This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...

Nov 8, 2023•49