Oliver Habryka

Coding day in and out on LessWrong 2.0. You can reach me at habryka@lesswrong.com

Comments

AMA: Paul Christiano, alignment researcher

When I begin a comment with a quotation, I don't know how to insert new un-quoted text at the top (other than by cutting the quotation, adding some blank lines, then pasting the quotation back). That would be great.

You can do this by pressing enter in an empty paragraph of a quoted block. That should cause you to remove the block. See this gif: 

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

This is great, thank you! 

Minor formatting note: The italics font on both the AI Alignment Forum and LessWrong isn't super well suited to large block of text, so I took the liberty to unitalicize a bunch of the large blockquotes (which should be sufficiently distinguishable as blockquotes without the italics). Though I am totally happy to reverse it if you prefer the previous formatting. 

Utility Maximization = Description Length Minimization

Promoted to curated: As Adele says, this feels related to a bunch of the Jeffery-Bolker rotation ideas, which I've referenced many many times since then, but in a way that feels somewhat independent, which makes me more excited about there being some deeper underlying structure here.

I've also had something like this in my mind for a while, but haven't gotten around to formalizing it, and I think I've seen other people make similar arguments in the past, which makes this a valuable clarification and synthesis that I expect to get referenced a bunch.

Deducing Impact

So secret that even a spoiler tag wasn't good enough.

Commentary on AGI Safety from First Principles

Promoted to curated: This is a long and dense post, but I really liked it, and find this kind of commentary from a large variety of thinkers in the AI Alignment space quite useful. I found that it really helped me think about the implications of a lot of the topics discussed in the main sequence in much more detail, and in a much more robust way, and I have come back to this post multiple times since it's been published. 

Also, of course, the whole original sequence is great and I think currently the best short introduction to AI-Risk that exists out there.

Understanding “Deep Double Descent”

I agree with this, and was indeed kind of thinking of them as one post together.

Six AI Risk/Strategy Ideas

I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it. 

I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section. 

Utility ≠ Reward

I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space. 

I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems related to inner optimization, I more often think of Utility != Reward than I think of most of the content in the actual paper and sequence. Though the sequence set the groundwork for this, so of course giving attribution is hard. 

Gradient hacking

adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book. 

The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur. 

I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don't think it's a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it. 

My thinking about this is still pretty fuzzy and in its early stages, but the reasoning goes as follows: 

  • If we assume the lottery-ticket hypothesis of neural networks, we initialize our network with a large number of possible models of the world. In a sufficiently large networks, some of those models will be accurate models of not necessarily the world, but the training process of the very system that is currently being trained. This is pretty likely given that SGD isn't very complicated and it doesn't seem very hard to build a model of how it works.
  • From an evolutionary perspective, we are going to be selecting for networks that get positively rewarded by the gradient descent learning algorithm. Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network, creating a small ridge in the reward landscape that results in it itself getting most of the reward (This is currently very metaphorical and I feel fuzzy on whether this conceptualization makes sense). This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage. 
  • By default, luckily, this will create something I might want to call a "benign gradient hacker" that might deteriorate the performance of the system, but not obviously give rise to anything like a full inner optimizer. It seems that this strategy is simple enough that you don't actually need anything close to a consequentialist optimizer to run into it, and instead it seems more synonymous to cancer, in that it's a way to hijack the natural selection mechanism of a system from the inside to get more resources, and like cancer seems more likely to just hurt the performance of the overall system, instead of taking systematic control over it.

This makes me think that the first paragraph of the post seems somewhat wrong to me when it says: 

"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.

I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don't have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it's plausible that Evan intends the term "deceptively aligned mesa-optimizer" to refer to something broader that would also capture the scenario above.

-----

Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not. 

Eight claims about multi-agent AGI safety

I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.

Load More