The conversation begins (Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and...
Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function. Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022: > When you explicitly optimize...
This post is partly a belated response to Joshua Achiam, currently OpenAI’s Head of Mission Alignment: > If we adopt safety best practices that are common in other professional engineering fields, we'll get there … I consider myself one of the x-risk people, though I agree that most of them...
A new version of “Intro to Brain-Like-AGI Safety” is out! Things that have not changed Same links as before: * As a series of 15 blog posts on LessWrong / Alignment Forum: https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8 * As a 225-page PDF (now up to version 3): https://osf.io/preprints/osf/fe36n * Summary video: Video & transcript:...
Previous: 2024, 2022 “Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1] 1. Background & threat model The main threat model I’m working to address is the same as it’s been since I was hobby-blogging about AGI safety...
In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study,...
(Brief pitch for a general audience, based on a 5-minute talk I gave.) Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI) My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play...