Francis Rhys Ward

The Elicitation Game: Evaluating capability elicitation techniques

by Teun van der Weij, Felix Hofstätter, JaydenTeoh, HenningB, and Francis Rhys Ward

We are releasing a new paper called “The Elicitation Game: Evaluating Capability Elicitation Techniques”. See tweet thread here. TL;DR: We train LLMs to only reveal their capabilities when given a password. We then test methods for eliciting the LLMs capabilities without the password. Fine-tuning works best, few-shot prompting and prefilling...

Feb 27, 202515

Why care about AI personhood?

In this new paper, I discuss what it would mean for AI systems to be persons — entities with properties like agency, theory-of-mind, and self-awareness — and why this is important for alignment. In this post, I say a little more about why you should care. The existential safety literature...

Jan 26, 202543

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

by Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, and Francis Rhys Ward

We have written a paper on sandbagging for which we present the abstract and brief results in this post. See the paper for more details. Tweet thread here. Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system...

Jun 13, 202484

An Introduction to AI Sandbagging

by Teun van der Weij, Felix Hofstätter, and Francis Rhys Ward

Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance...

Apr 26, 202451

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

by Teun van der Weij, Felix Hofstätter, and Francis Rhys Ward

Produced as part of the ML Alignment Theory Scholars Program Winter 2024 Cohort, under the mentorship of Francis Rhys Ward. The code, data, and plots can be found on https://github.com/TeunvdWeij/MATS/tree/main/distribution_approximation. This post is meant to provide insight on an interesting LLM capability, which is useful for targeted underperformance on evaluations...

Jan 29, 202439

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

by Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak, and Sam F. Brown

This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...

Nov 8, 202349

Reward Hacking from a Causal Perspective

by tom4everitt, Francis Rhys Ward, sbenthall, James Fox, mattmacdermott, and RyanCarey

Post 4 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction, Post 2: Causality, Post 3: Agency, and Post 4: Incentives. By Francis Rhys Ward, Tom Everitt, Sebastian Benthall, James Fox, Matt MacDermott, Milad Kazemi, Ryan Carey representing the Causal Incentives Working Group. Thanks also to Toby...

Jul 21, 202329

Francis Rhys Ward

Francis Rhys Ward

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Introduction to Towards Causal Foundations of Safe AGI

An Introduction to AI Sandbagging

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Francis Rhys Ward

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Introduction to Towards Causal Foundations of Safe AGI

An Introduction to AI Sandbagging

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

The Elicitation Game: Evaluating capability elicitation techniques

Why care about AI personhood?

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

An Introduction to AI Sandbagging

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Reward Hacking from a Causal Perspective