The post lays out some arguments in favor of OpenAI’s approach to alignment and responds to common objections.
This is a link post for https://aligned.substack.com/p/alignment-mvp I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This second post argues that instead of trying to solve the alignment problem once and for all, we can succeed with something less ambitious: building a system...
This is a link post for https://aligned.substack.com/p/ai-assisted-human-feedback I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This first post argues for recursive reward modeling and the problem it's meant to address (scaling RLHF to tasks that are hard to evaluate).
(Originally posted to the Deepmind Blog) Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden...
This is brief technical note on how to get convergence in the cooperative inverse reinforcement learning framework. We extend cooperative inverse RL to partially observable domains and use a recent result on the grain of truth problem to establish (arguably very strong) conditions to get convergence to ε-Nash equilibria. Credit:...