I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality.
Longer bio: www.lesswrong.com/posts/aG74jJkiPccqdkK3c/the-lesswrong-team-page-under-construction#Ben_Pace___Benito
PSA: You can write comment on PDFs in google drive!
There's a button in the top right that says "Add a comment" on hover-over, then you get to click-and-drag to highlight a box in the PDF where your comment goes. I will leave a test comment on the first PDF so everyone can see that.
(I literally just found this out.)
I'm exceedingly excited about this sequence. The Embedded Agency sequence laid out a core set of confusions, and it seems like this is a formal system that deals with those issues far better than the current alternatives e.g. the cybernetics model. This post lays out the basics of Cartesian Frames clearly and communicates key parts of the overall approach ("reasoning like Pearl's to objects like game theory's, with a motivation like Hutter's"). I've also never seen math explained with as much helpful philosophical justification (e.g. "Part of the point of the Cartesian frame framework is that we are not privileging either interpretation"), and I appreciate all of that quite a bit.
It seems likely that by the end of this sequence it will be on a list of my all-time favorite things posted to LessWrong 2.0. I'm looking forward to getting to grips with Cartesian Frames, understanding how they work, and to start applying those intuitions to my other discussions of agency.
I'm also curating it a little quickly to let people know that Scott is giving a talk on this sequence this Sunday at 12:00PM PT. Furthermore, Scott is holding weekly office hours (see the same link for more info) for people to ask questions, and Diffractor is running a reading group in the MIRIx Discord, which I recommend people PM him to get an invite to (I just did so myself, it's a nice Discord server).
+1 I already said I liked it, but this post is great and will immediately be the standard resource on this topic. Thank you so much.
Such a great post.
Note that I changed the formatting of your headers a bit, to make some of them just bold text. They still appear in the ToC just fine. Let me know if you'd like me to revert it or have any other issues.
Oli suggests that there are no fields with three-word-names, and so "AI Existential Risk" is not a choice. I think "AI Alignment" is the currently most accurate name for the field that encompasses work like Paul's and Vanessa's and Scott/Abram's and so on. I think "AI Alignment From First Principles" is probably a good name for the sequence.
It seems a definite improvement on the axis of specificity, I do prefer it over the status quo for that reason.
But it doesn't address the problem of scope-sensitivity. I don't think this sequence is about preventing medium-sized failures from AGI. It's about preventing extinction-level risks to our future.
"A First-Principles Explanation of the Extinction-Level Threat of AGI: Introduction"
"The AGI Extinction Threat from First Principles: Introduction"
"AGI Extinction From First Principles: Introduction"
Critch recently made the argument (and wrote it in his ARCHES paper, summarized by Rohin here) that "AI safety" is a straightforwardly misleading name because "safety" is a broader category than is being talked about in (for example) this sequence – it includes things like not making self-driving cars crash. (To quote directly: "the term “AI safety” should encompass research on any safety issue arising from the use of AI systems, whether the application or its impact is small or large in scope".) I wanted to raise the idea here and ask Richard what he thinks about renaming it to something like "AI existential safety from first principles" or "AI as an extinction risk from first principles" or "AI alignment from first principles".
I still don't understand how corrigibility and intent alignment are different.
I listened to this yesterday! Was quite interesting, I'm glad I listened to it.
I expect the examples Ajeya has in mind are more like sharing one-line summaries in places that tend to be positively selected for virality and anti-selected for nuance (like tweets), but that substantive engagement by individuals here or in longer posts will be much appreciated.