Edit: Here's a great comment by Ethan Perez that caveats this result, that I'd recommend reading for context. This is a paper by folks at Quoc Le's team at Google that examines the winning tasks from Round 1 of Anthropic's Inverse Scaling Prize. They find that 3/4 of the winning...
We're open-sourcing POWERplay, a research toolchain you can use to study power-seeking behavior in reinforcement learning agents. POWERplay was developed by Gladstone AI for internal research. POWERplay's main use is to estimate the instrumental value that a reinforcement learning agent can get from a state in an MDP. Its implementation...
Summary of this post This is the third post in a three-part sequence on instrumental convergence in multi-agent RL. Read Part 1 and Part 2. In this post, we’ll: 1. Investigate instrumental convergence on a multi-agent gridworld with a complicated topology. 2. Show that when we add a simple physical...
Summary of this post This is the second post in a three-part sequence on instrumental convergence in multi-agent RL. Read Part 1 here. In this post, we’ll: 1. Define formal multi-agent POWER (i.e., instrumental value) in a setting that contains a "human" agent and an "AI" agent. 2. Introduce the...
Summary of the sequence Over the past few months, we’ve been investigating instrumental convergence in reinforcement learning agents. We started from the definition of single-agent POWER proposed by Alex Turner et al., extended it to a family of multi-agent scenarios that seemed relevant to AI alignment, and explored its implications...
TLDR: We've put together a website to track recent releases of superscale models, and comment on the immediate and near-term safety risks they may pose. The website is little more than a view of an Airtable spreadsheet at the moment, but we'd greatly appreciate any feedback you might have on...
Thanks to Vladimir Mikulik for suggesting that I write this, and to Rohin Shah and Daniel Kokotajlo for kindly providing feedback. Prologue This is a story about a universe a lot like ours. In this universe, the scaling hypothesis — which very roughly says that you can make an AI...