This is great, thank you!
Minor formatting note: The italics font on both the AI Alignment Forum and LessWrong isn't super well suited to large block of text, so I took the liberty to unitalicize a bunch of the large blockquotes (which should be sufficiently distinguishable as blockquotes without the italics). Though I am totally happy to reverse it if you prefer the previous formatting.
Promoted to curated: As Adele says, this feels related to a bunch of the Jeffery-Bolker rotation ideas, which I've referenced many many times since then, but in a way that feels somewhat independent, which makes me more excited about there being some deeper underlying structure here.
I've also had something like this in my mind for a while, but haven't gotten around to formalizing it, and I think I've seen other people make similar arguments in the past, which makes this a valuable clarification and synthesis that I expect to get referenced a bunch.
So secret that even a spoiler tag wasn't good enough.
Promoted to curated: This is a long and dense post, but I really liked it, and find this kind of commentary from a large variety of thinkers in the AI Alignment space quite useful. I found that it really helped me think about the implications of a lot of the topics discussed in the main sequence in much more detail, and in a much more robust way, and I have come back to this post multiple times since it's been published.
Also, of course, the whole original sequence is great and I think currently the best short introduction to AI-Risk that exists out there.
I agree with this, and was indeed kind of thinking of them as one post together.
I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it.
I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section.
I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space.
I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems relate... (read more)
adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book.
The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur.
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an... (read more)
I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.
I think the CAIS framing that Eric Drexler proposed gave concrete shape to a set of intuitions that many people have been relying on for their thinking about AGI. I also tend to think that those intuitions and models aren't actually very good at modeling AGI, but I nevertheless think it productively moved the discourse forward a good bit.
In particular I am very grateful about the comment thread between Wei Dai and Rohin, which really helped me engage with the CAIS ideas, and I think were necessary to get me to my current understanding of CAIS and to ... (read more)
Promoted to curated: Even if God and Santa Claus are not real, we do experience a Christmas miracle every year in the form of these amazingly thorough reviews by Larks. Thank you for your amazing work, as this continues to be an invaluable resource to anyone trying to navigate the AI Alignment landscape, whether as a researcher, grantmaker or independent thinker.
Unfortunately, they are only sporadically updated and difficult to consume using automated tools. We encourage organizations to start releasing machine-readable bibliographies to make our lives easier.
Oh interesting. Would it be helpful to have something on the AI Alignment in the form of some kind of more machine-readable citation system, or did you find the current setup sufficient?
Also, thank you for doing this!
Yep, you can revise it any time before we actually publish the book, though ideally you can revise it before the vote so people can be compelled by your amazing updates!
Coming back to this post, I have some thoughts related to it that connect this more directly to AI Alignment that I want to write up, and that I think make this post more important than I initially thought. Hence nominating it for the review.
I think of Utility != Reward as probably the most important core point from the Mesa-Optimizer paper, and I preferred this explanation over the one in the paper (though it leaves out many things and wouldn't want it to be the only thing someone reads on the topic)
Most of my points from my curation notice still hold. And two years later, I am still thinking a lot about credit assignment as a perspective on many problems I am thinking about.
This seems like one I would significantly re-write for the book if it made it that far. I feel like it got nominated for the introductory material, which I wrote quickly in order to get to the "main point" (the gradient gap). A better version would have discussed credit assignment algorithms more.
This post felt like it took a problem that I was thinking about from 3 different perspectives and combined them in a way that felt pretty coherent, though I am fully sure how right it gets it. Concretely, the 3 domains I felt it touched on were:
All three of these feel pretty important to me.
Gradient hacking seems important and I really didn't think of this as a concrete consideration until this post came out.
I've referred specifically to the section on "Generate evidence of difficulty" as a research purpose many times since this post has come out, and while I have disagreements with it, I do really like it as a handle for a consideration that I hadn't previously seen written up, and does strike me as quite important.
While I think this post isn't the best writeup of this topic I can imagine, I think it makes a really important point quite succinctly, and is one that I have brought up many times in arguments around takeoff speeds and risk scenarios since this post came out.
In talking to many people about AI Alignment over the years, I've repeatedly found that a surprisingly large generator of disagreement about risk scenarios was disagreement about the fragility of human values.
I think this post should be reviewed for it's excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.
As Robby said, this post isn't perfect, but it felt like it opened up a conversation on LessWrong that I think is really crucial, and was followed up in a substantial number of further posts by Daniel Kokotajlo that I have found really useful. Many of those were written in 2020, but of the ones written in 2019, this strikes me as the one I remember most.
I am organizing a reading group for this report next Tuesday in case you (or anyone else) wants to show up:
I... think this post was impacted by a bug in the LW API that GreaterWrong ran into, that made it so that it wasn't visible on the frontpage when it was published. It nevertheless appears to have gotten some amount of engagement, but maybe that was all from direct links?
Given the substantial chance that a number of people have never seen this post, I reposted it. Its original publishing date was the 11th of June.
Promoted to curated: I really enjoyed reading through this sequence. I have some disagreements with it, but overall it's one of the best plain language introductions to AI safety that I've seen, and I expect I will link to this as a good introduction many times in the future. I was also particularly happy with how the sequence bridged and synthesized a number of different perspectives that usually feel in conflict with each other.
Promoted to curated: This kind of thinking seems both very important, and also extremely difficult. I do think that trying to understand the underlying computational structure of the brain is quite useful for both thinking about Rationality and thinking about AI and AI Alignment, though it's also plausible to me that it's hard enough to get things right in this space that in the end overall it's very hard to extract useful lessons from this.
Despite the difficulties I expect in this space, this post does strike me as overall pretty decent and to at the very least open up a number of interesting questions that one could ask to further deconfuse oneself on this topic.
Promoted to curated! I held off on curating this post for a while, first because it's long and it took me a while to read through it, and second because we already had a lot of AI Alignment posts in the curation pipeline, and I wanted to make sure we have some diversity in our curation decisions. But overall, I really liked this post, and also want to mirror Rohin's comment in that I found this version more useful than the version where you got everything right, because this way I got to see the contrast between your interpretation and Paul's responses, which feels like it helped me locate the right hypothesis more effective than either would have on its own (even if more fleshed out).
Yep, fixed. Thank you!
Judging from the URL of those links, those images were hosted on a domain that you could access, but others could not, namely they were stored as Gmail image attachments, to which of course you as the recipient have access, but random LessWrong users do not.
Oh no! The two images starting from this point are broken for me:
Promoted to curated: These additions are really great, and they fill in a lot of the most confusing parts of the original Embedded Agency sequence, which was already one of my favorite pieces of content on all of Lesswrong. So it seems fitting to curate this update to it, which improves it even further.
Promoted to curated: This post is answering (of course not fully, but in parts) what seems to me one of the most important open questions in theoretical rationality, and I think does so in a really thorough and engaging way. It also draws connections to a substantial number of other parts of your and Scott's work in a way that has helped me understand those much more thoroughly.
I am really excited about this post. I kind of wish I could curate it two or three times because I do really want a lot of people to have read this, and expect that it will change how I think about a substantial number of topics.
This sounds fun! I probably won't have enough time to participate, but I do wish I had enough time.
I much prefer Rohin's alternative version of: "Are OpenAI's efforts to reduce existential risk counterproductive?". The current version does feel like it screens off substantial portions of the potential risk.
Promoted to curated: I think the question of whether we are in an AI overhang is pretty obviously relevant to a lot of thinking about AI Risk, and this post covers the topic quite well. I particularly liked the use of a lot of small fermi estimate, and how it covered a lot of ground in relatively little writing.
I also really appreciated the discussion in the comments, and felt that Gwern's comment on AI development strategies in particular help me build a much map of the modern ML space (though I wouldn't want it to be interpreted as a complete map o... (read more)
(Really minor formatting nitpick, but it's the kind of thing that really trips me up while reading, but you forgot a closing parenthesis somewhere in your comment)
Mod Note: Because of a bug in our RSS Import script, this post didn't import properly last week, which is why you are seeing it now. Sorry for the confusion and delay!
Note: This post was originally posted to the DeepMind blog, so presumably the target audience is a broader audience of Machine Learning researchers and people in that broad orbit. I pinged Vika about crossposting it because it also seemed like a good reference post that I expected would get linked to a bunch more frequently if it was available on LessWrong and the AIAF.
Promoted to curated: This is a technique I've seen mentioned in a bunch of places, but I haven't seen a good writeup for it, and I found it quite valuable to read.
Promoted to curated: I've been thinking about this post a lot since it has come out, and it is just a really good presentation of all the research on AI Alignment via debate. It is quite long, which has made me hesitant about curating it for a while, but I now think curating it is the right choice. I also think while it is reasonably technical, it's approachable enough that the basic idea of it should be understandable by anyone giving it a serious try.
Promoted to curated: This post is a bit more technical than the usual posts we curate, but I think it is still quite valuable to read for a lot of people, since it's about a topic that has already received some less formal treatment on LessWrong.
I also am very broadly excited about trying move beyond a naive bayesianism paradigm, and felt like this post helped me significantly in understanding what that would look like.
Promoted to curated: I really like this post. I already linked to it two times, and it clearly grounds some examples that I've seen people use informally in AI Alignment discussions in a way that will hopefully create a common reference, and allow us to analyze this kind of argument in much more detail.
Promoted to curated: I really liked this sequence. I think in many ways it has helped me think about AI Alignment from a new perspective, and I really like the illustrations and the way it was written, and how it actively helped me along the way thing actively about the problems, instead of just passively telling me solutions.
Now that the sequence is complete, it seemed like a good time to curate the first post in the sequence.
Oops, sorry for that. I restored your original comment.
You have an entire copy of the post in the commenting guidelines, fyi :)
Oops, sorry. My bad. Fixed.
Done! Daniel should now be able to edit the post.
Mod note: I edited the abstract into the post, since that makes the paper more easily searchable in the site-search, and also seems like it would help people get a sense of whether they want to click through to the link. Let me know if you want me to revert that.
I quite liked this paper, and read through it this morning. It also seems good to link to the accompanying Medium post, which I found a good introduction into the ideas:
I will leave this here, but not add it to the AI Alignment Newsletter sequence, since presumably the content there has already been edited.
Promoted to curated: This post has been an amazing resource for almost anyone who is working on AI Alignment every year. It also turns out to be one of the most comprehensive bibliographies for AI Alignment research (next to the AI Alignment Newsletter spreadsheet), which in itself is a various resource that I've made use of myself a few times.
Thank you very much for writing this!