Edouard Harris — AI Alignment Forum

Inverse scaling can become U-shaped

Edit: Here's a great comment by Ethan Perez that caveats this result, that I'd recommend reading for context. This is a paper by folks at Quoc Le's team at Google that examines the winning tasks from Round 1 of Anthropic's Inverse Scaling Prize. They find that 3/4 of the winning...

Nov 8, 202227

POWERplay: An open-source toolchain to study AI power-seeking

We're open-sourcing POWERplay, a research toolchain you can use to study power-seeking behavior in reinforcement learning agents. POWERplay was developed by Gladstone AI for internal research. POWERplay's main use is to estimate the instrumental value that a reinforcement learning agent can get from a state in an MDP. Its implementation...

Oct 24, 202229

Instrumental convergence: scale and physical interactions

Summary of this post This is the third post in a three-part sequence on instrumental convergence in multi-agent RL. Read Part 1 and Part 2. In this post, we’ll: 1. Investigate instrumental convergence on a multi-agent gridworld with a complicated topology. 2. Show that when we add a simple physical...

Oct 14, 202222

Misalignment-by-default in multi-agent systems

Summary of this post This is the second post in a three-part sequence on instrumental convergence in multi-agent RL. Read Part 1 here. In this post, we’ll: 1. Define formal multi-agent POWER (i.e., instrumental value) in a setting that contains a "human" agent and an "AI" agent. 2. Introduce the...

Oct 13, 202221

Instrumental convergence in single-agent systems

Summary of the sequence Over the past few months, we’ve been investigating instrumental convergence in reinforcement learning agents. We started from the definition of single-agent POWER proposed by Alex Turner et al., extended it to a family of multi-agent scenarios that seemed relevant to AI alignment, and explored its implications...

Oct 12, 202233

AI Tracker: monitoring current and near-future risks from superscale models

TLDR: We've put together a website to track recent releases of superscale models, and comment on the immediate and near-term safety risks they may pose. The website is little more than a view of an Airtable spreadsheet at the moment, but we'd greatly appreciate any feedback you might have on...

Nov 23, 202167

AI takeoff story: a continuation of progress by other means

Thanks to Vladimir Mikulik for suggesting that I write this, and to Rohin Shah and Daniel Kokotajlo for kindly providing feedback. Prologue This is a story about a universe a lot like ours. In this universe, the scaling hypothesis — which very roughly says that you can make an AI...

Sep 27, 202176