EDIT 1/27: This post neglects the entire sub-field of estimating uncertainty of learned representations, as in https://openreview.net/pdf?id=e9n4JjkmXZ. I might give that a separate follow-up post. Introduction Suppose you've built some AI model of human values. You input a situation, and it spits out a goodness rating. You might want to...
A mostly finished post I'm kicking out the door. You'll get the gist. I There's a tempting picture of alignment that centers on the feeling of "As long as humans stay in control, it will be okay." Humans staying in control, in this picture, is something like humans giving lots...
This is pretty basic. But I still made a bunch of mistakes when writing this, so maybe it's worth writing. This is background to a specific case I'll put in the next post. It's like a a tech tree If we're looking at the big picture, then whether some piece...
Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take...
A delayed hot take. This is pretty similar to previous comments from Rohin. Shard theory alignment requires magic - not in the sense of magic spells, but in the technical sense of steps we need to remind ourselves we don't know how to do. Locating magic is an important step...
Meta: Over the past few months, we've held a seminar series on the Simulators theory by janus. As the theory is actively under development, the purpose of the series is to discover central structures and open problems. Our aim with this sequence is to share some of our discussions with...
As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day some days for 25 days. It's the end (I saved a tenuous one for ya')! Kind of disappointing that this ended up averaging out to one every 2 days,...