Coding day in and out on LessWrong 2.0. You can reach me at firstname.lastname@example.org
When I begin a comment with a quotation, I don't know how to insert new un-quoted text at the top (other than by cutting the quotation, adding some blank lines, then pasting the quotation back). That would be great.
You can do this by pressing enter in an empty paragraph of a quoted block. That should cause you to remove the block. See this gif:
This is great, thank you!
Minor formatting note: The italics font on both the AI Alignment Forum and LessWrong isn't super well suited to large block of text, so I took the liberty to unitalicize a bunch of the large blockquotes (which should be sufficiently distinguishable as blockquotes without the italics). Though I am totally happy to reverse it if you prefer the previous formatting.
Promoted to curated: As Adele says, this feels related to a bunch of the Jeffery-Bolker rotation ideas, which I've referenced many many times since then, but in a way that feels somewhat independent, which makes me more excited about there being some deeper underlying structure here.
I've also had something like this in my mind for a while, but haven't gotten around to formalizing it, and I think I've seen other people make similar arguments in the past, which makes this a valuable clarification and synthesis that I expect to get referenced a bunch.
So secret that even a spoiler tag wasn't good enough.
Promoted to curated: This is a long and dense post, but I really liked it, and find this kind of commentary from a large variety of thinkers in the AI Alignment space quite useful. I found that it really helped me think about the implications of a lot of the topics discussed in the main sequence in much more detail, and in a much more robust way, and I have come back to this post multiple times since it's been published.
Also, of course, the whole original sequence is great and I think currently the best short introduction to AI-Risk that exists out there.
I agree with this, and was indeed kind of thinking of them as one post together.
I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it.
I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section.
I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space.
I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems related to inner optimization, I more often think of Utility != Reward than I think of most of the content in the actual paper and sequence. Though the sequence set the groundwork for this, so of course giving attribution is hard.
adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book.
The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur.
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don't think it's a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it.
My thinking about this is still pretty fuzzy and in its early stages, but the reasoning goes as follows:
This makes me think that the first paragraph of the post seems somewhat wrong to me when it says:
"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.
I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don't have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it's plausible that Evan intends the term "deceptively aligned mesa-optimizer" to refer to something broader that would also capture the scenario above.
Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not.
I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.