RyanCarey — AI Alignment Forum

Reward Hacking from a Causal Perspective

Post 4 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction, Post 2: Causality, Post 3: Agency, and Post 4: Incentives. By Francis Rhys Ward, Tom Everitt, Sebastian Benthall, James Fox, Matt MacDermott, Milad Kazemi, Ryan Carey representing the Causal Incentives Working Group. Thanks also to Toby...

Jul 21, 202329

Incentives from a causal perspective

Post 4 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction, Post 2: Causality, and Post 3: Agency. By Tom Everitt, James Fox, Ryan Carey, Matt MacDermott, Sebastian Benthall, and Jon Richens, representing the Causal Incentives Working Group. Thanks also to Toby Shevlane and Aliya Ahmad. “Show...

Jul 10, 202327

Causality: A Brief Introduction

Post 2 of Towards Causal Foundations of Safe AGI, see also Post 1 Introduction. By Lewis Hammond, Tom Everitt, Jon Richens, Francis Rhys Ward, Ryan Carey, Sebastian Benthall, and James Fox, representing the Causal Incentives Working Group. Thanks also to Alexis Bellot, Toby Shevlane, and Aliya Ahmad. Causal models are...

Jun 20, 202349

Introduction to Towards Causal Foundations of Safe AGI

By Tom Everitt, Lewis Hammond, Rhys Ward, Ryan Carey, James Fox, Sebastian Benthall, Matt MacDermott and Shreshth Malik representing the Causal Incentives Working Group. Thanks also to Toby Shevlane, MH Tessler, Aliya Ahmad, Zac Kenton, Maria Loks-Thompson, and Alexis Bellot. Over the next few years, society, organisations, and individuals will...

Jun 12, 202374

RyanCarey's Shortform

Jan 24, 20216

New paper: The Incentives that Shape Behaviour

Abstract: Which variables does an agent have an incentive to control with its decision, and which variables does it have an incentive to respond to? We formalise these incentives, and demonstrate unique graphical criteria for detecting them in any single decision causal influence diagram. To this end, we introduce structural...

Jan 23, 202023

When to use quantilization

In 2015, Jessica introduced quantilization as a countermeasure for Goodhart's Law and specification-gaming. Since these are such central problems in AI safety, I consider quantilization to be one of the best innovations in AI safety so far, but it has received little attention from the AI safety field. I think...

Feb 5, 201965