janleike — AI Alignment Forum

[Link] Why I’m optimistic about OpenAI’s alignment approach

The post lays out some arguments in favor of OpenAI’s approach to alignment and responds to common objections.

Dec 5, 202298

[Link] A minimal viable product for alignment

This is a link post for https://aligned.substack.com/p/alignment-mvp I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This second post argues that instead of trying to solve the alignment problem once and for all, we can succeed with something less ambitious: building a system...

Apr 6, 202253

[Link] Why I’m excited about AI-assisted human feedback

This is a link post for https://aligned.substack.com/p/ai-assisted-human-feedback I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This first post argues for recursive reward modeling and the problem it's meant to address (scaling RLHF to tasks that are hard to evaluate).

Apr 6, 202229

Specification gaming: the flip side of AI ingenuity

by Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton, and janleike

(Originally posted to the Deepmind Blog) Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden...

May 6, 202069

General Cooperative Inverse RL Convergence

This is brief technical note on how to get convergence in the cooperative inverse reinforcement learning framework. We extend cooperative inverse RL to partially observable domains and use a recent result on the grain of truth problem to establish (arguably very strong) conditions to get convergence to ε-Nash equilibria. Credit:...

Jun 17, 20163

Jan Leike

Jan Leike

Jan Leike

[Link] Why I’m optimistic about OpenAI’s alignment approach

Specification gaming: the flip side of AI ingenuity

[Link] A minimal viable product for alignment

[Link] Why I’m excited about AI-assisted human feedback

Jan Leike

[Link] Why I’m optimistic about OpenAI’s alignment approach

Specification gaming: the flip side of AI ingenuity

[Link] A minimal viable product for alignment

[Link] Why I’m excited about AI-assisted human feedback

[Link] Why I’m optimistic about OpenAI’s alignment approach

[Link] A minimal viable product for alignment

[Link] Why I’m excited about AI-assisted human feedback

Specification gaming: the flip side of AI ingenuity

General Cooperative Inverse RL Convergence