Oliver Habryka

Coding day in and out on LessWrong 2.0

Comments

Understanding “Deep Double Descent”

I agree with this, and was indeed kind of thinking of them as one post together.

Six AI Risk/Strategy Ideas

I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it. 

I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section. 

Utility ≠ Reward

I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space. 

I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems related to inner optimization, I more often think of Utility != Reward than I think of most of the content in the actual paper and sequence. Though the sequence set the groundwork for this, so of course giving attribution is hard. 

Gradient hacking

adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book. 

The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur. 

I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don't think it's a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it. 

My thinking about this is still pretty fuzzy and in its early stages, but the reasoning goes as follows: 

  • If we assume the lottery-ticket hypothesis of neural networks, we initialize our network with a large number of possible models of the world. In a sufficiently large networks, some of those models will be accurate models of not necessarily the world, but the training process of the very system that is currently being trained. This is pretty likely given that SGD isn't very complicated and it doesn't seem very hard to build a model of how it works.
  • From an evolutionary perspective, we are going to be selecting for networks that get positively rewarded by the gradient descent learning algorithm. Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network, creating a small ridge in the reward landscape that results in it itself getting most of the reward (This is currently very metaphorical and I feel fuzzy on whether this conceptualization makes sense). This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage. 
  • By default, luckily, this will create something I might want to call a "benign gradient hacker" that might deteriorate the performance of the system, but not obviously give rise to anything like a full inner optimizer. It seems that this strategy is simple enough that you don't actually need anything close to a consequentialist optimizer to run into it, and instead it seems more synonymous to cancer, in that it's a way to hijack the natural selection mechanism of a system from the inside to get more resources, and like cancer seems more likely to just hurt the performance of the overall system, instead of taking systematic control over it.

This makes me think that the first paragraph of the post seems somewhat wrong to me when it says: 

"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.

I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don't have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it's plausible that Evan intends the term "deceptively aligned mesa-optimizer" to refer to something broader that would also capture the scenario above.

-----

Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not. 

Eight claims about multi-agent AGI safety

I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

I think the CAIS framing that Eric Drexler proposed gave concrete shape to a set of intuitions that many people have been relying on for their thinking about AGI. I also tend to think that those intuitions and models aren't actually very good at modeling AGI, but I nevertheless think it productively moved the discourse forward a good bit. 

In particular I am very grateful about the comment thread between Wei Dai and Rohin, which really helped me engage with the CAIS ideas, and I think were necessary to get me to my current understanding of CAIS and to pass the basic ITT of CAIS (which I think I have succeeded in in a few conversations I've had since the report came out). 

An additional reference that has not been brought up in the comments or the post is Gwern's writing on this, under the heading: "Why Tool AIs Want to Be Agent AIs" 

2020 AI Alignment Literature Review and Charity Comparison

Promoted to curated: Even if God and Santa Claus are not real, we do experience a Christmas miracle every year in the form of these amazingly thorough reviews by Larks. Thank you for your amazing work, as this continues to be an invaluable resource to anyone trying to navigate the AI Alignment landscape, whether as a researcher, grantmaker or independent thinker.

TAI Safety Bibliographic Database

Unfortunately, they are only sporadically updated and difficult to consume using automated tools.  We encourage organizations to start releasing machine-readable bibliographies to make our lives easier.

Oh interesting. Would it be helpful to have something on the AI Alignment in the form of some kind of more machine-readable citation system, or did you find the current setup sufficient? 

Also, thank you for doing this!

Soft takeoff can still lead to decisive strategic advantage

Yep, you can revise it any time before we actually publish the book, though ideally you can revise it before the vote so people can be compelled by your amazing updates!

Evolution of Modularity

Coming back to this post, I have some thoughts related to it that connect this more directly to AI Alignment that I want to write up, and that I think make this post more important than I initially thought. Hence nominating it for the review. 

Load More