Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi

Below is the transcript of my chat with Evan Hubinger, interviewed in the context of the inside view a podcast about AI Alignment. The links below will redirect you to corresponding timestamps in the youtube video.

Outline:

Michael: To give a bit of context to the viewers, MIRI is the Machine Intelligence Research Institute, can you just like give a brief recap on like how long have you been working there and what you do there and what have you been doing before, like in the past few years.

Evan: Yeah, so I work at MIRI, I'm a research fellow there, I work on broadly, to sort of very broadly, I think about inner alignment, which is sort of the problem of how do we align the models that we train with the sort of objectives that we're trying to train them on? I tend to think about this sort of problem from a prosaic perspective, from the perspective of thinking about concrete machine learning systems, also from a theoretical perspective, so trying to think about machine learning systems and understand what they're doing using sort of more theoretical tools and abstract reasoning rather than sort of concrete experiments, so that sort of broadly what I work on and what I think of.

Michael: You're talking about empirical work, so I remember when I first learned about you, it was because I was at a conference in 2018 with Vladimir Mikulik and it was just after the MIRI summer fellows and you guys were writing the mesa-alignment paper. It was mostly theoretical. And then I think you worked at the OpenAI, on theory there, but the whole company was more, you know, experiment focused. And also I think maybe before you have some more like software engineering in your background. So you have like different interests and different expertise in both domains.

Evan: Yeah, so I did do a bunch of software engineering stuff before I got into AI Safety. The biggest thing that I might be known for in that domain is that I wrote a somewhat popular functional programming language called coconut. And then I , actually the first thing I did in AI Safety was I did an internship at MIRI doing sort of more functional programming type theory stuff, but all sort of software engineering, and then sort of went to the MIRI summer fellows program, worked with Vlad and other people on the Risk from Learned Optimization paper. And then after that, when I graduated, when I finished my undergrad, I went to OpenAI, and I did some theoretical research with Paul Christiano there. And then when that was done, I joined MIRI as a full-time researcher. And that's where I have been for the past year and a bit.

Michael: For people not familiar with AI Alignment, which I think is not the most of the listeners, Paul Christiano was one of the OG in empirical AI Alignment research now after Yudkowsky. So interning with him is pretty, pretty high bar and it's pretty good to have done that after your undergrad and, yes, so the library built was also used as a function of programming and the stuff at MIRI was also functional programming. So if I remember MIRI has had one of the leading programmer in functional programming mostly on Haskell, maybe I'm wrong.

Evan: Are you talking about Ed Kmett?

Michael: Yes.

Evan: Yep, he works at MIRI. He is a really big Haskell guy.

Michael: Coconut is more like an interpreter or something on top of python.

Evan: It compiles to python. And the syntax is a superset of Python three and then it compiles to any Python version and also lots of functional features and stuff.

Evan: It compiles to Python source.

Michael: OK, python source. So is it like very poorly optimized? If you need to, like, put something that converts to python code then it would be like super slow, maybe.

Evan: Not super, it's like the same speed as Python because you just compile it to Python and then you run the Python.

Michael: Ok, python source. When you're compiling, you're not doing perfect work.

Evan: But you don't you don't compile at run time. Like with C you're not like, well, the speed of C is, first you have to compile it and compiling it takes a really long time. Then you have to link it and linking takes a really long time and then you have to run it. You're like no the speed of C is you compile it beforehand, and then you check how long it takes to run.

Michael: OK, well, I get it now. Um, yeah, and I think that's especially interesting to me because I think there's a lack in open source, at least in the AI Alignment place. So even if coconut is not especially for AI Alignment, I think functional programming might be useful for MIRI at some point.

Evan: Yeah, it was definitely how I first got into like doing stuff at MIRI because MIRI was like doing a lot of commercial programming stuff and I had a strong functional programming background and they were like, you should come do some stuff here.

Michael: I think at some point they were hiring more programmers I don't know if that's still the case. They're still hiring more programmers?

Evan: Yeah, I think that has changed, I don't know what exactly the current status is, I'm not the person to talk to you about that.

Michael: OK, no problem. Yeah. So I guess now your job is mostly writing interesting research on the AI Alignment forum, something posting it on arXiv. And yeah, I think your posts are both very precise in terms of vocabulary terminology and clear and also short. So you can read one and not spend too long and then you understood most of the points. That's a good thing. Most people don't try to distill what they think about, and you also try to give concrete proposals for how to solve this, which is kind of a shift I've seen in the past three or four years with people having their concrete agendas on how to solve things. So one contrarian view, not contrarian but some view you had that was opposite to most of the AI Alignment forum, which is a forum for talking about AI Alignment, is most people were talking about, take off speeds like how fast, to clarify what Paul Christiano meant by fast, what Bostrom meant by fast, slow, take off, and then you mentioned something else, which was homogeneous versus heterogenous takeoff. So maybe you can talk a bit about that, like summarize a bit the blog.

Evan: There's a lot of discussion where people talk about takeoff speeds, about fast versus slow takeoff, continuous takeoff versus discontinuous takeoff.

Michael: You can even, like, summarize takeoff. Takeoff is maybe a bit poorly defined. You can even define that if you want.

Evan: When people talk about AI takeoff they're sort of talking about, well, at some point we're going to build really powerful AI systems and then something happens after that. What does that look like? Do we build the first really powerful AI system and then it very quickly dominates the world, it has a decisive strategic advantage, the only thing that really matters is what that system does, or we build the first AI system and it's really powerful and transformative, but then we build another system and we build another system, and there's a lot of different systems, but also they exist in the world the same time.

Michael: Multipolar scenarios.

Evan: Yes, unipolar versus multipolar. There's a lot of different things you can talk about, So how quickly do the systems get better? Are they sort of big discontinuities in how they get better. How concentrated is power over these systems, et cetera, et cetera. One thing that I have sort of talked about in the past in this regard is this homogeneity idea, which is, I guess in my eyes, the axis that I care about most, it feels like the most relevant and also the one that I feel like the more confident in and I can make more definitive statements aboutm, where homogeneity is saying: a homogeneous takeoff is one where all of the AI's are basically equivalently aligned and an inhomogeneous takeoff is one where we have a bunch of different AIs that are sort of varying degrees of aligned. So, there's lot of different things that happen in these different situations, and there's also sort of different aspects of homogeneity. So by default, I sort of mean alignment, but also we can talk about sort of homogeneity of other aspects of how the AIs are built. So I expect quite a lot of homogeneity I think by default. I expect to sort of be in a situation where we have a lot of different AIs running around, but all of those AIs are basically all just kind of copies of each other or like very similar. Or in a situation where we just have a very small number of AIs. But they're still just like. If you have only one AI, then it's sort of homogeneous by definition. And so I think this is like in some sense the more important dimension, so I think a lot of times when people talk about sort of this fast, really fast takeoff scenario means that we have to get that first AI system totally correct because if we don't get the first AI system totally correct in a really fast takeoff scenario it very quickly controls all the resources, and sort of the only thing that matters is that system, whereas is in a sort of slow takeoff scenario we get the one system but then there's a lot of other systems that are competing for power and resources and we sort of have the opportunity to intervene and sort of control things more as it sort of continues to develop. And my take is something like, I don't think even in the second scenario, that we actually have the ability to really do much, even if there's lots of different AI systems running around competing for resources after the point at which we build the first powerful advanced AI system, given that I expect all of those other AI systems to basically just be copies of the first system because if they're all just kind of copies of each other then what really matters is did we align that first system properly? And so do we have a bunch of systems running around that are all basically aligned or do we have a bunch of systems running around that are all basically misaligned. And so, therefore, I'm like, well, if you believe in homogeneity, then that basically means you have to believe that the sort of first powerful, advanced system that we build is really important and critical and like aligning it is the most important thing, regardless of whether you believe we're actually going to end up in a very fast takeoff scenario. And so then there's the question of why do I believe in homogeneity? So I think basically I believe in homogeneity for a couple of reasons. First is that I think I expect pretty strongly that we'll be in a regime where the cost of training your AI is just like much, much, much higher than the cost of running it. And this creates a bunch of sort of particular economic incentives. So like one thing that it does is it means that you would generally rather use somebody else's AI whenever possible than to have to train your own one. And also in the situations where you do have to train your own AI, because let's say, for example, you don't want to use your, like, geopolitical rivals' AI or whatever, then, you're probably going to do so very conservatively, like if you have to spend a trillion dollars to train your, like, one AI system just because you don't want to use their AI system because you don't trust, like, you're the US and you don't trust China's AI, then you're going to spend a trillion dollars pretty conservatively. You're just going to be like, well, we basically kind of know what they did. Let's just do the same thing because we really don't want to spend a trillion dollars and have it not work. In this sort of regime, I expect, I think there's some other assumption here, which is like, well, I think basically if you're like running essentially the same training procedure, then you get essentially the same degree of alignment. And like the really small piddly details are not what what sort of contributes very much to like whether your thing is aligned. It's mostly like, do you basically get the incentives and biases right or not.

Michael: Right. So I guess your example with the US and China would be something like GPT-3 taking $5M or $20M to train, maybe much more to like pay the salaries and the experiments that went beforehand. This is like relatively cheap for a government now. But if we're talking about billions of dollars, then maybe it's like more expensive or trillions for an entire country.

Evan: So we get to the point where it's like a substantial fraction of the entire GDP of your country. You really don't want to like be in a situation where it's like we're going to spend one percent of our GDP, like there's this other country that has already done this, like built a really powerful AI and has demonstrated that it works in this particular way, like, let's say they use Transformers or whatever. Like you really don't want to then spend one percent of your entire GDP trying to make it work with LSTMs. You'll just go like "no they did it with Transformers, we're gonna do the same thing. Like we just want to have our own version of what they did." And so my default expectation is this sort of conservatism, which is like, well, probably people are just gonna copy what other people are doing. And so, like, it really matters what the first thing is that's like super impressive enough that it gets everyone to copy it.

Michael: Sure. So if we take the example of GPT-3, then they didn't release the weights and it was super expensive to reproduce, like according to different sources, Google might have reproduced it pretty quickly, but it's not public information. Then there's like Eleuther.ai who tried to reproduce for months. And then after like six months or something then they produced something that had somehow similar results. I'm not sure how close it is.

Evan: Then what they're aiming for is basically a reproduction. It's not worth spending all of those resources if you don't already have the evidence that it's like going to succeed.

Michael: They know it's going to succeed, there's a paper where it's known to succeed. But at least you have the architecture and, you know, the outputs, the loss, you know what's the expected behavior. But you're surely not sharing the weights nor the data. So you have to both like scrap the data one the internet and then get the like source code.

Evan: But still the data collection procedure was pretty similar, right. Like we're just like, well, the data collection procedure, the basic data collection procedure is we're just going to scrape a bunch of like large, you know, swathe of Internet data. You know, maybe they didn't explicitly release the data. But I think that, like, I guess my take is that if you're basically, if you have essentially the same data collection procedure, you have, like essentially the same architecture, you have essentially the same training process, then you're basically going to get the same alignment properties out the other end. And some people would disagree with that. And we can talk about it why they would believe that.

Michael: I think I broadly agree with that. I might just be pointing at, maybe in the future when when we get closer and closer to some kind of human-level AI then maybe people might share less of the research about, like how they collect the data and stuff, and we just have like super-hard-to-replicate results because we don't even have the architecture or something. We just have to output or, hum. I'm not sure if the OpenAI in 2025 or 2030 will actually share everything with the other companies and how long it will take to produce it will depend... [Maybe they will be] a big enough gap that they can have a comparative advantage that leads them to lead the race or something.

Evan: It's hard to keep your just like basic methods secret. I think I mean, so currently, most of these companies don't even try. Right. Like, you know, like you're saying, you know, OpenAI I just publish papers on exactly what they did. And they don't publish the weights but they publish the paper saying what they did. Anything with DeepMind, Google, basically. And I think a part of the reason for this is that even if you wanted to keep it secret, it's pretty hard to keep just like these very general ideas secret because like one person leaves and then like, you know, they can explain the general ideas. They don't need to steal anything from you. It's just like they already have the idea and the idea is very general.

Michael: The moment you hear about it, it's like the moment we heard about, when GPT-3 became mainstream, let's say July, it was released in maybe May, and maybe they had the results maybe in January or something. And they had some close results and then they had to like, make it better, improve it for publishing it to NeurIPS or something. For Dall-E maybe they knew about multimodal for a while and at DeepMind I think the politics is that they hold their private research, a bunch of research is going on for like months and then they try to publish it to Nature neuroscience, or like Nature. And so you have all those like deadline for papers where they have those those six months advantage or maybe a year advantage where they all have private information that they don't share to other companies.

Evan: Yeah, there's definitely a delay. But like at the point where you have this really powerful system and you start doing things with it such that other people start wanting to build their own really powerful systems. I expect convergence.

Michael: Convergence of people sharing stuff?

Evan: Like people are going to figure out what you're doing, even if you don't try to share things like it's just too hard to keep it secret when you're like you're like having this big model, you're out and you're doing all of the stuff people are going to figure out, like what the basic process that lead you to...

Michael: Like reverse engineering or social hacking? Like you just need one person to basically describe the idea. Like, honestly, I expect, it's so insecure with like, I don't know, like, can you imagine trying to keep the basic idea like we used attention secret. Like, I just like I don't it's not gonna happen.

Michael: We used a transformer, but please don't tell anyone.

Evan: I think it's totally practical to keep your weight secret because if somebody wants to steal your weights, they have to actually copy them and exfiltrate the data. And like, it's clearly illegal. And like, you know, it might still happen. Like, I totally I think it's still quite possible that, you know, you'll have, like, hacking people trying to steal your weights or whatever.

Evan: It's at least plausible that you could keep your weight secret like there's just no way to keep "we used attention to do really big language models" secret.

Michael: Hmm, I think I think one example that goes in your direction is Dall-E. This github account lucidrain's reproduced Dall-E in a couple of weeks I think, maybe like less than two months, something like that. I think, like Dall-E was maybe end of December beginning of January and lucidrains published it in maybe February or something. So for some experiments like it gets faster and faster to reproduce it. I was talking to someone else at OpenAI who told me that he expects multipolar scenarios to be like kind of the default because as an entire community, we get better and better at reproducing what the best labs do. So the time between something GPT-3 and reproduction gets closer over time. I guess one counterargument or question I had about this blog was, you say that when the first is alined and people will copy it and by default it will be aligned because it has the same architecture, but imagine the thing that is aligned as like some kind of those laws of robotics or, oh, don't kill people or be aligned with human values and stuff. If you're trying to be adversarial and try to beat the first AI or smartest AI alive, not implementing this aligned features, you could just like be very adversarial and attack the first one who would not attack you back because he's like human aligned right.

Evan: One thing which you could imagine is a situation where, like the first person, the first organization or whatever builds an AI, they successfully make it aligned and then somebody else decides to build an AI and not do the alignment part. This relies on there being some separability between the alignment parts and the nonalignment parts, which is unclear. I'm not sure if I actually expect there to be this sort of separatability, but then also it relies on you not caring about alignment as a good desideratum, but that seems really unlikely. Alignment is pretty useful for almost anything you want to use your AI for, like you would really like your AI to do the things you wanted to do. It doesn't really seem much different than just when we're in a situation which is like what I expect where these sort of training runs have a truly astronomical cost, right. Where like your little research lab in like a university or whatever isn't going to be able to replicate in the biggest lines because the thing costs like, you know, billions of dollars for a single training run, trillions of dollars or whatever. Then you're in a situation where, like, you really don't want to risk spending a billion, a trillion dollars or whatever and have it not be aligned, have it not do the thing which you want it to do. You're going to copy whatever alignment features the other team did that they successfully demonstrated. It might be the case that, in fact, the first thing only looks like, maybe it's not really aligned. You're still going to copy it, I think. And so this is why I sort of have this view, which is like, well, even in a sort of very multipolar, continuous slow take off world, we still basically just care about making sure that the first really powerful system is aligned in the traditional conventional way that we think about in the fast discontinuous takeoff world.

Michael: One of the post when you're trying to clarify alignment terms, I think Paul Christiano tried to clarify it in one of his blogs, on medium, which is alignment is basically I doing what you tell him to do. So very basic terms. And then you distinguish something like impact alignment and intent alignment, which is trying to do what the human wants you to do and then impact alignment, which is not destroying the universe and not causing harm to humans. Is it essentially correct, or do you want to maybe nuance it?

Evan: Yeah. So impact alignment is in fact not doing bad things. Intent alignment is like trying to not do bad things, in my definition, and then we can further split each of these things down, so intent alignment can be split into, outer alignment, which is like is the objective that it's trained on aligned and this objective robustness thing, which is like it is actually robust according to that objective across sort of different possible situations.

Evan: The whole post has a sort of breakdown where it sort of looks at, ok how do we sort of breakdown these different ways of sort of thinking about alignment into different sub-problems.

Michael: Yeah, I was trying to find the actual actual picture from the blogpost because it's pretty good, but yeah, I think it is in clarifying Alignment terminology. I was just trying to see if I could make it as my background for fun. But I think it's pretty hard, let's see. Yeah. So I think. So what you were mentioning before is kind of companies would want something like intent alignment. Or like something that does what the Chinese government wants them to do and if the Chinese government wants to kill everyone in the US, or like take over the world, then it must be intent aligned in the sense of trying to optimize for like do the same thing that the Chinese government wants them to do. But it doesn't mean that it won't kill other countries, right? The second actor just might want just to have something useful without it being beneficial for the entire human race?

Evan: Yes. So you still have even even if you have like you know, you solve the sort of like intent alignment style problem, and even if you have a very homogeneous world, you still have situations where you have like standard human coordination problems where you have to be able to coordinate between one, you know, one human organization is trying to, you know, fight some other human organization, and then the hope is that human coordination is a problem like, you know, human organizations fighting other human organizations is a problem that we can mostly solve via like normal social and political dynamics.

Evan: It's still hard. We did make it through the Cold War without any nuclear bombs being dropped, but it was close. Hopefully, if we're in a similar situation, we can make it through something like that again. The real problem that would be, you know, at least we want the opportunity to be able to do so, right? We want humans to be the ones that are in control of the levers of power such that it is even possible for humanity to be able to coordinate, to make it through a, like, similar to the Cold War style situation. If humans don't have access to those levers of power at all, then we can't even do that.

Michael: It's a necessary condition to have a peaceful outcome, not a sufficient one? I think that's essentially right. I agree with it. And then I guess some people might say that, the political problem is maybe the hardest one. And like, if we have some kind of very authoritarian regime that I don't know, say, is working on AIs, you can go into trouble with your work on AI, too advanced AI. And like everyone is doing good old manufacturing job or like agriculture jobs. Then if we solve the political stuff before and we have some large peace on [inaudible] government, then we solve kind of, we have more time for AI safety. That's a bit like the steelman of the other position.

Evan: Yeah, I mean, certainly I think there's lots of ways which you can approach this problem, or just like AI existential risk in general. There are like social and political aspects of things that are worth solving, as well as technical aspects. I think that currently I'm like, well, I feel like most of the existential risk comes from we're just not going to be able to solve the technical problem.

Michael: Most researchers working on this, I guess, think similar things, which I think is pretty tractable. I think the solutions you give in your other blogposts seem pretty tractable. Solving politics or human coordination or fighting against mullock seems a bit harder. I agree that it's worth thinking about it. I think we covered most of this blogpost, and then, sure, you had another one which was interesting to me because I worked a bit on AI Alignment research myself and did some open sourcing as well, on quantilizers. And so that's a bit what I'm familiar with, in terms of research. And you worked on a post on, like how to... Quantilizers is essentially if I were to summarize it would be "we try to have AIs that perform in some kind of human way without it being too bad at the task". So if you have a human demonstrating a task, let's say a robotic task or some game playing or feeding another baby or something, you want the AI not to find the optimal action because the optimal action would kind of hack the game. But you also want it to perform well and not do stupid things. So the human is like drinking alcohol a tenth of the time of the day. You don't want the AI to drink alcohol. You wanted the AI to do the normal human stuff in the afternoon. So quantilizer is essentially taking those 10 percent, or this quantile of actions that are good, that are still human-like so when the AI tries to imitate those actions... one thing about quantilizers is doing that. I guess there are other ways of seeing quantilizers, this is one way of looking at it. And so in your posts, you're kind of talking about what Yudkowsky's... how he defines bits of optimization, which is when you have this interval of length one, of probability mass, then if you have something that is the in the half of highest utility, then you're somehow halving the space in half. So you're having one bit of optimization or something. And then the closer you get to the optimization power, the more bits of kind of optimization you need. Yeah, maybe you can say that better and nuance it.

Evan: I guess you're referring to my post on operationalising compatibility with strategy stealing.

Michael: Exactly.

Evan: In that post I talk a little bit about optimization of power and quantilizers. I give a definition of optimization power in terms of quantilizers and then I try to relate this to the strategy stealing assumption and value-neutrality verification. Maybe the best thing to do will be yes, I'm happy to talk a little about this post, I think it's certainly interesting, I wrote it. So some things which I think. Maybe I'll start with strategy stealing. I think strategy stealing is an interesting thing because I think. There's a lot of different ways to think about it. The very simple thing is that there is this... there's this formal result, which is, well, you have a game and it's symmetric, we have a bunch of different players and they're all sort of playing, you know, they they have, you know, access to essentially the same actions and stuff. Then you can always just sort of copy the strategies of any other players such that you can never sort of do worse in expectation. And if there are N players getting 1/N of the sort of total resources. You know, even if there is like... in any situation, as long as it's symmetric, you can sort of make this work.

Evan: What does this mean? Well, so one of the ways in which we can get interpretations of this and Paul sorts of talk a lot about this, is that we can sort of think about this as giving us a sort of general desideratum for AI safety, because we are currently in a situation where humans sort of control most of the resources. And so we should expect that, you know, by default, that sort of distribution of resources shouldn't change over time because we can just sort of, you know, if other agents are introduced and they start with very little resources, then we can just sort of steal whatever strategy they have. And in expectation, we should have the same proportion of total resources that we have right now and at any future point in time. But then this fails for AI, because AI has this sort of property that it might be systematically biased towards certain values over other values. So, for example, values which are very easy to specify. And so we can build reward functions but not really easily, we [=the easy to specify ones] will win out systematically, and so this breaks strategy stealing because now it's no longer symmetric because some people's values that are easy to specify will systematically [inaudible]. Similarly values that are really simple, such that they're really easy to find by default in sort of training processes will sort of systematically [inaudible]. One way in which we can think about one thing you might want a sort of aligned training process with you is not the sort of systematically better for some of these sorts of values than other values. And in particular, not be systematically better for like simple or easy to specify things than for like actual human values that we care about. And so one way in which we can define that notion is using sort of this concept of optimization power and asking, you know, to what extent is it applying more optimization power? Is it sort of able to apply more optimization power to some sort of tasks than other tasks? And in particular, if it's able to apply more optimization power to... I have this example, we consider Sundar Pichai, who is the CEO of Google, and he wants a bunch of different things. He wants like to be happy and seek for his family to be happy, but also he wants Google to make money. And so he has like a really powerful AI system and it's like trying to do what he wants. And so he's like, ok, you know, here are some things I want you to do. I want you to, like, you know, find some ways to make Google money and also, you know, find some ways to like help me, you know, figure out my life. And also, he probably cares about humanity and want humanity to be in a good spot overall, but also he wants to make money obviously. And so this AI goes out and tries to do this. But there's like a real problem if the AI is just like much, much, much better at making money for Google than it is at any of these other things. Because then it goes out and it's like, well, it makes a bunch of money for Google and it's really, really good at making money for Google, but it's like very bad at like doing any of those other things, it doesn't really know how to, like, put the world in a good spot for humanity. It doesn't really know how to make Sundar happy. He doesn't really know how to do any of these other things that Sundar cares about. And so from Sundar's perspective, what ends up happening is that, you know, some of his values lose out to all of other values. And so in the long run, we end up in a situation where we built AIs that systematically favor the development of a sort of enhancement of certain values. The enhancement of competition values, like getting more money for Google, at the expense of these other values, like, you know, is the world good? And this is bad, so we'd like to avoid this.

Michael: Right. I think that it kind of resonates with how easy it is to hack the humans brain and optimize for like Facebook ads or TikTok views and it's harder to specify make humans happy in the long term. So like, we would kind of converge towards easy-to-hack-brains behaviors and maybe like even like optimizing for the crypto market or optimizing for the trading market is something with very little information and very little dimension compared to like visual inputs. So maybe AI would be good at things that are easy to do now and that are tractable in terms of input space, I guess. But then for Sundar, what you're saying... AI would converge to what is easy and if what is easy is maximizing profit then it will do that instead of other things. But if it understood that what Sundar wanted was actually Google making money to benefit the world and making his life good, it will not create bad Google doing bad for humanity and having Sunar work overnights and like not spend time with the family or something.

Evan: That's unclear, right? You can imagine a situation that is just kind of like the current world, but where like we know how to build AI systems that do like very simple tasks that we can specify, we don't know how to do systems that are really complex, hard to specify tasks then like we could very easily end up in a situation where due to competitive pressure, Sundar is just kind of forced to use these systems and sacrifice all these other values to make sure that Google is able to make money. But like, there's just no ability to because the only powerful actors in the world are these AIs but they can only be used for these sort of simple tasks, then you're forced competitively to keep deferring to them and giving them more power resources to be able to give you more power and resources that you never get to a point where you can actually use that power and resources for what you actually care about. To be clear this isn't the sort of world that I expect by default, but it's it's worth sort of pointing out as like, in a sort of way of thinking about a particular type of alignment problem that is not the traditional alignment problem and doesn't necessarily, isn't necessarily solved, even if you solve more sort of other aspects of the alignment problem.

Michael: Interesting. So, yeah, if you don't solve other problems, then you might end up here. And I guess the thing is some kind of Google... like I see it as a very powerful Siri or Google home, where it would be like a good oracle, like Sundar Pichai coming home and asking his Google home "What's the best strategy for tomorrow?" I guess somehow it's not that far away. Maybe like strategy-wise, running a company like Google is hard, but like chatbots that you can talk to and like ask for simple decisions. I don't know. And the link was the kind of optimisation thing is that?

Evan: Mathematically, we can use optimization power to give a definition of this.

Michael: I think that's interesting because like in the past, ok, in the first two episodes I had, the thing I called Connor's rule because Connor Leahy, he had other podcasts, on like machine learning street talks, with Yanick Kilcher and stuff, and they went on defining multiple times intelligence. The rule is like you shouldn't talk about intelligence, you should talk about optimization, or like other stuff that the AI would do and not like talk about words. So I feel like optimization is a good word and you give a bunch of different useful terminology in risk from learned optimization, you kind of introduce mesa-optimisation amongst other stuff, and then you clarify it even more in "Clarifying AI Alignment terminology", which is a reference to diagram behind me.

Evan: it appears to be inverted for me, but I can't see it.

Michael: Oh it's inverted for you. Sorry, it was inverted for me, so I inverted it back. So I need to invert it back. I don't know what the camera will do at the end. I can have both. Yeah. Go ahead. I will just remove that.

Michael: What do you want me to say?

Michael: Maybe you can start with the mesa-optimization term. How do you define optimizers and what is a mesa-optimizer?

Evan: Risks from learned optimization takes a stance, which is something like optimization is the important thing to think about. It's not intelligence or agency or any of these other things. It's like optimization is the key thing, which is similar to sort of what you were describing. Certainly there's a lot of sort of discussion around this stance or on whether this is a good stance. I think it is a stance that lets you say a lot of things because optimization is like a reasonably concrete, coherent phenomenon. And so you can say a lot of stuff about optimization. And this is sort of what Risks from learned optimization tries to do is say a bunch of stuff about optimization. I can say more I'm happy to talk more generally about what is Risk From Learned Optimization, basically saying what is the inner problem, et cetera, et cetera.

Michael: I recently re-read the actual introduction... I think there's a sequanece on the AI Alignment forum from where you define... there's an introduction where you define all these concepts pretty precisely. I feel like optimizer is, maybe you said that already but, it's like searching anything that searches over a space to find the best solution according to some objective and actively search. So, for instance, there was this example from from Daniel Filan about "is a bottle cap an optomizer" because it's kind of preventing water to go away. So it's not actually optimizing for anything, but it's something that humans optimize for. And so it's a result of an optimization process from humans.

Michael: And humans are something evolution to mind for as we're optimizing for different things, like instrumental things, like having sex without making kids or or other things. And so that's maybe some kind of disagreement I have on your examples. And I think in your podcast with Daniel Filan, AXRP, you said it is kind of useful to see humans as optimizing, searching for some solution that is not directly evolution's function. So in terms of alleles chromosomes and stuff, because like some humans don't make kids. My counterargument to that would be that even for humans that don't make kids, they're still like kind of trying to optimize for evolution's pressure, in a bad way. So imagine very good researchers. They don't care about making kids at all, but they're just very passionate about math. So they will end up producing value for the world with their math papers that will end up in, like, more GDP or more kids in the future for other humans.

Evan: Yeah, I think this is not how it works, though, right? I think it is just actually true that, like, if you really let evolution keep running, it would not select for this sort of behavior. Like evolution certainly wants some altruism, but it also doesn't want you to, like, live down the street from a sperm bank and not go there every day. right? Like, that's insane from an evolutionary perspective, regardless of what else, whatever else you're doing.

Michael: But it's still like we're like I still feel like we're trying. So. Like our our instincts, like our primal, our lizard brain still wants to optimize for evolution. It's just that we're bad at it. Or that we've evolved for like building those tribes. And society, that is a proxy for building more kids.

Evan: The keyword there is proxy. The things that we care about are in fact proxies for what evolution cares about. But they're not the same, right? Like you can certainly tell lots of stories about it. And It's true because there are proxies. You can think about status, and power and, you know, sex and all of those things could be our proxies for and they're related to in the ancestral environment, the sort of pass their genes on. But we don't actually care about passing our genes on, at least most of us don't. You know, I think, well, something like the sperm, do you wanna eat sperm or eggs is a good example of, like, you know, most people don't or would have to be paid to do it. And you know evolution would never want that. That's like clearly evolution is like "This is the most important thing you should be doing. You know, gotta be doing nothing with that". But from a human perspective, we care about the proxy. We're like, what I care about isn't actually literally just my genes in the next generation, you know, even like humans that can really care about like having children usually care about, like I want to be able to raise my children. I want to have a connection with my children, not just like I literally want more of my DNA in the next generation.

Michael: Those proxies are actually good at like making, like, more humans long term. Evolution evolved and found this new solution in search space, which is no the actual good stuff, is not just to be a lot. We actually need to be in some kind of tribes and have social defense to find dinosaurs or monkeys or something. And then if everyone was spending sperm, would give sperm to solving...

Evan: No, evolution doesn't work at a group level though, it works primarily on an individual level. And so evolution is happy to evolve to extinction on a group level because it's primarily selecting on an individual level.

Michael: Hmm. But wait, so if you're selecting genes.

Evan: This is why we have things like selfish genes. It doesn't actually help you it just like copies itself from place to place. Evolution isn't just selecting for the performance of the whole group, but it's very explicitly selecting for your individual performance. Another example of this is like sex ratios. So like in theory, you would like evolution for like the maximum production of additional children would want like significantly more females in each generation than males. But in fact, what we see is that across species, the sex ratio converges to 50 percent. And the reason that converges 50 percent is that from a selfish, individualistic perspective, even if you're in a population where there are greater than 50 percent females, then you are in an advantage passing on your genes to the next generation, if you have a male child and you're at a disadvantage if you have a female child. And so despite the fact that evolution from a group perspective would rather have a sex ratio that is not 50 percent, from an individual perspective, it has to be 50 percent because of like it's sort of the only stable equilibrium from a sort of selfish, individualistic perspective and evolution, primarily selects on the individual.

Michael: It's like a bunch of individuals with egoistic genes that converge to some Nash equilibrium at the society level.

Evan: Well, so we can also certainly talk about why is it that humans are altruistic? Where did that come from evolutionarily? I think the like leading theory is something like it's good for, it's useful for cooperation. Being altruistic is helpful for your ability to cooperate with the rest of the group because we care about the rest of the group and they care about you, that you can cooperate really seriously with them. And so in some sense, altruism is selfishly useful in this perspective. From an evolutionary perspective. It's like evolution would rather have each individual be more altruistic because it helps them work better with the group and less ostracized by the group and therefore have a more likely for that individual to have more children. And so this is a individualistic story of why, from the perspective of a single individual, evolution would rather that individual be more obvious.

Michael: And what about, like, people being like the opposite of altruistic and just like kind of defecting all the time with altruistic people? Like this would be like the better position, right?

Evan: No, the point that I'm making is that this is not the case. For evolution, for each individual, altruism serves a purpose for helping that specific individual have more altruism.

Michael: I think they're like other distinctions you make that are interesting. So just to define the basic terms again, because I think most of the listeners are not familiar with the paper. A good analogy for evolution is what we call the base objective. Maybe a neural network is an easier example,

Evan: Maybe it is better to start with neural networks and in Risks from Learned Optimization, we're really trying to ground everything on optimization. I think one of the big things that Risks from Learned Optimization does, that sort of all previous discussion didn't do is really carefully ground everything in Machine Learning.

Michael: So, let's talk about machine learning. What's interesting is when we have optimizers like Adam or stochastic gradient descent, then you're trying to change parameters theta so that you can better classify cats and dogs. At the end of the day, you change your parameters, they might end up at inference time doing something like optimization. The example for me would be something like a recurrent neural net where you do backprop through time, where you're optimizing and you're at inference time only using the latent cells or something? Some are frozen and some are not. And then you can adapt to, um, what you get at test time. And I think that was one example of a blog on LessWrong trying to reproduce mesa-optimization. Do you have better examples, maybe, of this sort of optimization.

Evan: So the classic example that I like to use to really explain sort of what's going on with Risks from Learned Optimization organization, is this maze example. We can imagine a situation where we train a model in a bunch of small mazes, sort of randomly generated mazes, but they're all kind of small. And we put a green arrow at the end of the maze. It gets like a picture of maze, and we put a green arrow at the end to say "this is the end of the maze, you're supposed to go here". and we train a bunch in this environment, and then we deploy to a new environment which has the following properties. It has larger mazes. The green arrow is no longer the end, it is some random location inside of the maze. But we still want the agent to get to the end, we still want the agent to read the maze. Or you can flip it and you can be like, we still want to go to the green arrow and not go to the end. Either way, the point is there are a bunch of different ways in which this agent can generalize. So here's one generalization, is it just goes to the larger mazes and it doesn't know how to solve then, it just fails to solve big mazes. And I would sort of call this "its capabilities do not generalize", it didn't learn a "general purpose means solving capability". Or, it could learn a general purpose maze solving capability, its capabilities could generalize, and it could learn to go to the end of the maze, which is what we want it to do. And so it's objective going to the end of the maze, also generalises properly. But then there's another situation, which is, its capabilities generalize, it knows how to navigate the maze successfully, but its objective fails to generalize and it learns to go to green arrow rather than to go to the end. And then what's scary about this situation is that we now have a situation where we have a very capable model, that sort of general purpose optimization procedure, but it's deploying that optimization procedure for the wrong objective to get to the wrong goal, not the one we intended to get to the green arrow instead of what we wanted, which is to go to the end of the maze. And so this is really dangerous because we have a powerful, competent model which is directed in a direction we never intended. And what's happening here is that there is this unidentifiability where on the train distribution we couldn't tell whether what it was really doing was going to green arrow or going to the end. And when we deployed in an environment where these two things came apart. Now we can tell. And if it does the wrong thing, it could be really capably doing the wrong thing in a this sort of new environment. And so this is one example of a way in which a model can sort of have failed to have objective generalization. Its objective can be generalized properly. Well, its capabilities still generalize properly, which is the sort of general sort of subheading under which it is trying to address as a problem.

Michael: So to summarize, it's good at finding green arrows, but it's not good at finding the end of the maze.

Evan: That would be that would be a situation where we're like unhappy because it's very powerful and knows how to solve mazes properly, but it isn't using those capabilities for the right reason. It's not using them to the one we wanted to use for. It's using it for this other thing instead.

Michael: I feel like it's somehow similar to people who criticize GPT-3 for not understanding what it's saying, but it's just like repeating and memorizing things. You could say that GPT-3 doesn't have what we want [it] to have, which is a natural language processing or like human understanding of words and concepts, but just has memorization. He can of memorized the way of finding the green arrow without understanding the actual task we wanted him to solve. Does it fall into this same category or is it different?

Evan: And you can certainly think about it that way. I think it's like a little bit tricky to really think about like... you know, in some sense, the objective of GPT-3 is predictive accuracy on the next token. It's a little bit hard to understand, but would it actually look like to sort of generalize well or poorly according to that? I mean, it's just like if you have an actual distribution that is similar... You know, I guess in some sense, but we only trained it on this Web text corpus, and then it was some new setting where the underlying generators of the text are different, then it might still be trying to do predictive accuracy or you might have learned a bunch of heuristics that are not particularly accuracy. What it really learned is it should try to, in any situation, output like coherent sentences or whatever, and then it's like it doesn't actually try to model the dynamics of this new setting and get good predictive accuracy. It just tries to do the simple like, well, I learned to do these sort of heuristics for how good sentences work. And so I'm just going to keep outputting that.

Michael: Right. He found those heuristics. Two things I remember from Connor's interviews, I'm not sure if he was with me or with other people when was "we don't really know the entropy of human language", of like English. We don't even know how hard the problem is. So i's very hard to say exactly how successful it is at predicting words or understanding it because we don't have a good model of what English is. And the other one, which is kind of a funny trick, is that I think it took something like one epoch or less than an epoch... it only passed through each example once. Maybe I'm wrong. Maybe it took more than one epoch, but it kind of learned to generalize from a few data just like passing... one shot learning.

Evan: I think one epoch is in fact compute optimal in most of these really big language models.

Michael: So, yeah, that was something impressive in terms of... for people who say like it's is memorizing. It's memorizing, yes, but maybe, but from one-shot learning or something.

Evan: I don't think one epoch is very meaningful here, it's just like, well, you got to see every data point in the training data. He'd seen the whole training distribution. He hasn't seen it multiple times thing is more just say "Oh, our training process just performs better when it can extract like it's sort of already extracted the information from that from that data point the first time through. And there's sort of diminishing returns and trying to run it through a second time. And so it's not compute optimal to do so."

Michael: Just running it for one epoch enough and compute optimal and otherwise you would just like lose money because you wouldn't get as much value for dollars something. I'm trying to find the post from Matthew Barnett, because he sent me the code at some point on how to do it, and I'm just trying to put it behind me, as I think I'm trying to do now. Putting stuff behind me. So give me just one second. So it's a map with treasure and chests, keys and chests. I don't know if you remember it, then maybe we can talk about it otherwise because I kind of remember some form of this environment, but maybe you also remember it.

Evan: Yeah, it's very similar to like my maze example. Where it's just like. There's a set of objectives which are indistinguishable on training and we move to deployment, and you can see that it is like this one and not the other.

Michael: Yeah. So if I remember correctly, it you would stumble on keys and then because there would be like more keys than chests, so it would open the chest without actually knowing what opening a chest is. And, then on big environments it wouldn't really know how to do it. Or you could still do it in bigger environments. Yes, something like that. So yeah, if you're a listener, I have some code for it. And there are people with problems to demonstrate mesa-optimisation or inner alignment failure? Is it inner alignment failure?

Evan: Well, so inner alignment the way we define it sort of requires there to be optimization. We don't know, especially in the keys versus chest, where it's simple, the model probably isn't doing any real optimization internally.

Michael: The problem with calling it optimization is that we're kind of assuming some form of complexity, or some form of, you know, he's doing some thinking or some elaborate task or finding some optimum somewhere of some precise task. So I remember there was this Lesswrong paper post a paper from Deepmind about meta-learning—meta-reinforcement learning. It was like top of the Alignment Forum for a bit where they showed that it was similar to some kind of mesa-optimization. And then, like some people commented that it was basically reinforcement learning the thing was doing. It was not some kind of very special trick. It was just like an LSTM plus some RL, and at the end you [freeze] the weights and then you get some stuff that's going to adapt to environments. I guess, like researchers can always say, you know, yeah GPT-3 is not intelligent, it's just memorizing sentences or this thing is not optimization it is just like doing whatever it was trained on to do at the beginning.

Evan: But there is a truth of the matter. Like it's an empirical question so we can look inside of models if we have the good enough transparency tools and discover how do they work. Are they doing an opposition algorithm? Are they not doing it optimization algorithm? That is something I hope we can eventually do. I don't think we currently are able to quite get there, but I am hoping that we will eventually be able to actually answer these sort of questions by literally looking inside of our models.

Michael: Just to close a bit on this, I think that this terminology is super important, so I'm just going to put that back behind me one last time because I think that's useful for the listeners. So is it on the right side for you now? What we want is alignment, which is kind of what you said about impact alignment, which is AI that doesn't do bad stuff then capability robustness is (you can correct me at anytime) the ability to generalize to harder environments or out of distribution environments. Is this correct? You need it to be capable enough to generalize well.

Evan: Yeah, but it's sort of like generalize "according to what?" is the question and capability robustness just says generalize according to whatever objective you learn. It isn't saying that you actually learn the correct objective. It's just saying according to whatever objective you learned, generalize well according to that.

Michael: Right. So, yeah, you're capable of maximizing that reward, of minimizing this loss in a more general setting than from the training data.

Evan: Importantly capability robustness has no reference to the actual reward. It's not saying according to the actual reward, you generalize well, that's like robustness. Robustness in general says according to the actual reward, you generalize well. Capability robustness is the subset of robustness that says not according to the actual reward, just according to whatever utility function, internal objective, weaks... what do I call in this post sort of "a behavioral objective", which is just like the objective that actually describes what you're optimizing for in practice. Do you generalize well, according to that, which is just showing you sort of just make a bunch of mistakes and not really know what you're doing and don't have anything coherent or are you coherently trying to do something, regardless of what that thing is the correct thing?

Michael: Generalizing in doing what you were previously trying to achieve in a new setting and then intent alignment is what we said before as "doing what the human wants to do" Like if you say, bring me some tea, bring some tea, assuming he's not killing the baby between you and the tea. Objective robustness, is... your objective is robust to what? I forgot.

Evan: Maybe a useful thing also for distinguishing between capability robustness and objective robustness would be... there's like another version of this picture where I have it in terms of just robustness.

Evan: There's the version of the top, which is how I actually think about it, but then I think if you think about these things in terms of robustness a lot, then like it may be a little bit better to start with the robustness centric version—they're equivalent.

Evan: I was trying to say... so we have robustness. In the robustness centric version, we split alignment into outer alignment and robustness at the top level where outer alignment says "is the base objective like doing the right thing". And then robustness, says "does it generalize well according to the base objective" which is on a [new] distribution, does it continue to pursue the base objective and then we can split that into objective robustness and capability robustness? And then here I think the distinction between objective robustness and capability robustness is maybe a little bit easier. Previously we had the notion of the base objective, which is just like the reward function or loss function, and then we also introduced the notion of the behavioral objective, which is like what does it appear to be optimizing? And then we say it's capability robust if it's robust according to its behavioral objective. So whatever it looks like it's optimizing, it does a really good job at optimizing that no matter where it is. So it looks like if we look at its behavior in general, it looks like it's going to the green arrow . And so we can say its behavorial objective is try to go to the green arrow. That's not what we want. We want it to go to the end of the maze. But when we look at what it's doing, it's clearly trying to go to green arrow and then we can ask how good is it at going to the green arrow ? And if it's really good to go into the green arrow , then it's very capability robust, even though it's doing the wrong thing. We didn't want it to go to the green arrow. And so the other part of robustness is objective robustness, which is how closely does that behavioral objective match onto the base objective, which is the one we actually want? And then a sub-problem of objective robustness is inner alignment, which is saying "ok, but what if specifically we have a model which is an optimizer, is running an optimization process, and then therefore it has some objective that the optimization process is optimizing for which we call the mesa-objective, and then we can ask, inner alignment ask, how close is the base-objective to the mesa-objective? And then the point of both of these diagrams, the sort of overall point is that if we get both inner and outer alignment, then... this is the part that's harder to see on this version of the diagram, on the other version of that diagram, it is very clear that inner alignment and outer alignment imply intent alignment. Which is like sort of, I think, a good justification for why it makes sense to sort of split the problem into inner and outer alignment in the situation where your model is a mesa-optimizer, that is it's like doing optimization. If it's not a mesa-optimizer, then you can't split it into... you can't sort of... you don't have inner alignment as a sort of concrete phenomenon, just have objective robustness and then maybe it makes more sense to look at it from the robustness picture. But I think if you're if you're thinking mostly about mesa-optimisers, then you're like in outer alignment plus inner alignment is your intent alignment. Both pictures are equivalent, they're just two different ways of looking at the same thing.

Michael: I think for for the listeners, the kind of errors are kind of sufficient ways of achieving X. If you have robustness and outer alignment, you get alignment and you don't even need to have inner alignment. If there's no mesa-optimization going on and you just have like one optimizer process, then you can just have one optimization process being robust and outer aligned. So those are less sufficient ways of achieving alignment, not necessary ones. Is this correct?

Evan: Yeah.

Michael: I think what you were saying is interesting because I studied Inverse Reinforcement Learning (IRL), where where the goal is to... we have a human having some behavior, doing some stuff and it's trying to guess [the human's] reward function from his behavior. And [the human's] reward function is kind of what he wants to do and could be mapped to like his values or something of some sort. If the human was performing optimally according to his reward function, then from his behavior you could infer his reward function. And so this kind of behavioral objective is what an AI would be doing if it was optimizing for the human's objective function, if IRL was tractable in some way.

Evan: Right, yes, you can think of the behavioral objective as being related to IRL (Inverse Reinforcement Learning) in the sense that if you did Inverse Reinforcement Learning on some model, then you can think of the objective that you get as a result of doing that, as that model's behavioral objective,

Michael: For any sequence of actions... for any mapping from state to action, you can construct a set of optimal policies according to those possible reward functions, right? Or utility functions or reward functions. Cool, I think we covered that pretty well. Sorry for saying basic stuff, but I think I think most of the audience doesn't know about this paper anyway. Doesn't mean that it's not a very essential one and one of the most important one in the past few years. This means that my audience is not literate om that. So you talked a bit about transparency and how it's important to solve the AI Alignment problem. And, I guess, Chris Olah is an important actor in that space. There's other people I met or talked to in the Clarity space, I think of Nick Cammarata. And I think they're like, it takes a lot of time to write a good distill post, to explain this stuff well. And it's a lot of effort. And it's somehow... maybe you get less exposure than, like a tweet or something... but somehow you can say that gaining understanding of how ML models work is accelerating ML research, and he's also giving a good feedback loop between how do we align those? And I think in your post, you kind of gave both of the arguments and counter-arguments for Chris Olah's views. And you're the best proxy on, or one of the best proxy of Chris Olah's views on that today.

Evan: Yeah. So, yes, I think you're referring to the post I wrote sort of summarizing...

Michael: Chris Olah's views on AGI Safety I think.

Evan: Yeah, it was a while ago just after talking with Chris. I think, Chris, that a lot of interesting stuff to say about AI safety. And, I think it's like under... I don't know, at least at the time I felt like it was under appreciated and like, not really like, people weren't engaging with this sort of way of looking at the [interpretability problem as much as] AI Safety, as much as I wish they were. It's been a while since I talked with Chris about this stuff, so I'm not necessarily up to date on all the stuff these days.

Michael: I think it's from November 2019. So one and a half years old.

Evan: Yeah, I think it was like a reasonably accurate sort of like, I gave the graph to Chris and a bunch of times, going back and forth, trying to get like, what does he make sure he agrees with it and stuff? So I think it was like reasonably accurate, at least a good summary of sort of stuff he was thinking about then. So, yes, I think it's definitely worth... Yeah. And I definitely think Chris is sort of doing a lot of transparency stuff and is still probably the person who's been most stuff in the [explainability?] space of general transparency stuff, that is at least... that is relevant to AI Safety. There are a lot of other people that are also doing certain stuff like Daniel Filan, other people. Yeah, I'm happy to talk about any other specific questions about like...

Michael: Yeah, so from what I remember from his post was... there was this word in English that I had to Google, which is called Mulligan. I don't know if I'm pronunciating it right. A "mulligan"?

Evan: A mulligan.

Michael: A Mulligan is like a second chance or something. So if we don't... if we build that breaks or if we build AI that is not something we can correct, it gives us a chance to correct stuff when we mess up. Being able to introspect and debug it is instrumentally useful in some way.

Evan: I think that this is like... this is sort of one of the arguments that Chris makes for why interpretability is useful is... it gives you the chance to catch problems before you deploy your system.

Michael: You can catch problems and there was something about auditing. Let me go back to see what was it, caching problems with auditing. So, yes, we can see if it's not a aligned early on, which is very similar to this thing about [Mulligan], I forgot the word in English, the second chance thing, and I think the other, so I think the more debatable thing is whether it is worse or not the... the acceleration in ML understanding... is it worth the gains in AI safety? I think he says that it's worth looking into it. I feel like... I don't know how much we've gained from [looking] at Inception or ResNet embeddings. So I don't think ML researchers are much more competent from looking at this, but I'm also not sure how better AI researchers are. So, yeah, I'm not so sure about the tradeoffs right now, but maybe in the future is very important to be able to debug it. So I don't know what do you think are the upsides and downsides, if you remember.

Evan: Yes, so I can sort of talk about my views. Which I think my perspective on transparency is that we sort of need to be able to train models in such a way that doesn't just look at the models behavior. So I think Chris has this view, you know like, training models with auditing, where he's like, well, we train some models.

Evan: Yes, I can sort of talk about my views and my sort of perspective on transparency is that we sort of need to be able to train models in such a way that doesn't just look at the models behavior. So I think Chris has this view, you know like with catching problems via auditing, where he's like, well, we trained some model and then we can check to see if we did a good job with transparency tools. I think I have a somewhat different view, which is we don't want to use transparency to check after the fact. We have to do transparency tools to solve the problem in the first place, because if we just try to train models to do the right thing via likebehaviorally making sure they look like they're doing the right thing, we can end up with models that aren't actually doing the right thing and are just pretending to do the right thing. And the only way to eliminate models that are pretending to do the right thing is via looking inside of the models internally and training on—is the model actually trying to do the right thing, not just sort of looking like it's doing the right thing? And so I'm sort of in favor of approaches where we directly train models using transparency tools, whereas I think Chris is sort of more in favor of trying to use transparency tools as a way to check behavior as a sort of independent check after we have attempted to train a [state?] model using some other approach.

Michael: Right. So you're more like interested in kind of looking at it while you're building it, so you're not like doing something like mesa-optimization or bad things or deceptive behaviors, whereas Chris is like more like in a post mortem, you see why it didn't work.

Evan: Yeah.

Michael: I think this diagram behind me, I hope it's in the right way for everyone, so we start from some kind of model. In the Y axis is I how interpretable the things are and on the X axis is how strong or capable the AI is so at the beginning you understand what it is doing, then you start doing something like MNIST, handwritten digit recognition, you don't understand the neurons because they're not expressive enough, or maybe you have some understanding but you're a bit confused or like those big models like Inception, ResNet or Transformers are a bit more abstract. And then what we're learning is when you look at latent space from GANs or Transformers, we're seeing something close to knowledge and we're more and more understanding because it's more and more expressive, right? And at the end when it's becoming superhuman, then it's very hard for humans to understand because it's like super optimized in a totally different language.

Evan: Yeah.

Michael: So it's useful to do interpretability, to be like in this crisp attraction way of understanding AI, when it's still kind of human level or before human level. It doesn't go alien before it goes human level, so we have some time.

Evan: So I have some belief that we can probably just avoid the drop off at the end if we use an amplified overseer to do transparency

Michael: An oversight overseer?

Evan: An amplified overseer.

Michael: Oh, amplified overseer, yes. I think that is like most of your proposals later, I think this will be like the last part of the podcast, in the last 20 minutes. It's just like your like 11 states on how to combine kind of amplification overseer and interpretability. I found there was also some field building so both of us are trying to do some field building in AI Alignment and Chris Olah is maybe more thinking about field building in the interpretability space. If I remember correctly, the two arguments that make it attractive for researchers. One is, if researchers are in a lab at university, they can still do interpretability research without having billions of dollars to spend. They can just look at neural networks and make [them] understandable. And I think one assumption, he has, is that there are like some low hanging fruits in doing interpretability research now because not many people... it's pretty neglected, or at least it was in 2019.

Evan: Yeah, I definitely think yes, this is something Chris has certainly worked on, like the point of Distill is to try to get like interpretability research more like... get more attention and more prestigious and like more cool.

Michael: I think he succeeded. The post about... I think it was coin run where they visualize features in coin run and they map the reward. That was pretty cool. And I think that Microscope from OpenAI, where you see the features of all those like [inaudible] models was pertty cool. I don't know if they've done some representations for clip. I think clip was... I think clip is only like pictures, it didn't use microscope, I'm not sure.

Michael: And yeah, and I think that there's like another argument, which is you're trying... so if you're forcing your models to be interpretable, it's a good analogy, would be forcing your students to show that they've done good work. So, like show their papers or show their processes, so they're not GoodHarting the actual optimisation, but they're like showing everything, so it's harder for them to lie if they're transparent. Explicitly transparent.

Evan: Yeah, I think that's sort of closer to the sort of thing I was talking about, where I want to use transparency sort of as a training.

Michael: Maybe not, Chris, maybe like more you in this post.

Evan: Chris also is interested in this, but it's not his primary motivation.

Michael: Right. Let's talk about your stuff. So your most important post on, ok, in my opinion, on the AI Alignment forum or Lesswrong was an overview of 11 proposals for buildings safe advanced AI, and you have 11... so maybe... I think like they're like five or something... key points, which is transparency over amplification, imitation, amplification. And then there's something like advesarial training. And then you kind of combine the three with microscopes, STEM AI, reward modeling. You have like five or six things that you combine... maybe we can start with the first one, which is the one that talks about amplification and I can put the slides behind me.

Evan: Yes, there's 11 proposals, so the second one is the first one that is about amplification and it talks about imitative amplification, which is a specific sort of amplification where very simply, we train a model on the objective of imitating a human consulting [mamal?]. And so I have a bunch of these different proposals. They're not unique to me. I try to sort of take a bunch of proposals throughout the literature and then I try to compare them. I think the sort of main thing that this post does is it's comparing all the proposals on four axes, where these sorts of axes are outer alignment and inner alignment, which we've talked about, and then training competitiveness and performance competitiveness, where training competitiveness is how hard is it to train and performance competitiveness is if we do train it, how good is it. And so all these sorts of four conditions of the sort of central things that we need, if we want to be able to have a sort of competitive and aligned procedure for building artificial intelligence, and so we can look at all these different things that people have talked about and try to address, do they satisfy these sorts of things? I think the general answer is, well, it's unclear, but certainly for none of these proposals, we don't have a really strong case that they definitely do. But certainly it seems, like, we can say, you know, some are more promising some are less promising, that's going to depend on your particular research taste. I'm happy to talk about any of the particular proposals.

Michael: I think there's more than just like... We can talk about would it work or not, but like is there any concrete feedback loop that will tell us if something works or not? Is there any empirical environments or research that can give us feedback? I feel like the whole debate, like amplification, more from Paul was pretty empirical, whereas... most of the stuff you post is empirical, but some of them are maybe easier to test in the next years, I don't know about amplification because that would require some kind of recursive loop or... I don't know where we are in terms of trying to do IDA empirically, but maybe just basically I think the first proposal was somehow doing multi-agent safety with a bunch of agents like the cooperative hide and seek from OpenAI and the second one is about imitation and amplification, so maybe you can explain a bit what is going here with the H and M, A and Q, because I think it's one of the most interesting and useful to think about the other ones.

Evan: Yes. What you have there is on the second proposal, which is about imitative application, which is describing how imitative application works. So imitative application you have... a sort of... you first need to find the amplification operator, which takes in some model M and produces a more powerful version of M, and the way it does that is it says we have some human which consults multiple copies of M, to produce an answer to some question. And then this process of a human consulting him is what we refer to as the amplified M. So this amplification operator applied to M produces a more powerful version of M, which is a human [with access to that?], then you can influence this amplification operator in different ways. In imitative application this is having [inaudible].

Michael: And, yeah, we can go back to our example of Sundar Pichai having five AIs, helping him to do stuff. I don't know if it's a good example, but it's like kind of amplified by AIs. I think one important thing is that in the case of amplification, you can get more intelligence from ten agents than from, let's say one. And then... but ten agents will be less able to take over the world because it would be like, you could kind of control them. Right. So it's easy to see like how each individual ones are aligned, but like each M is aligned... but then... the sum of them [is] smaller than just one M. Is that basically the intuition or?

Evan: Yeah, it's complicated, like, why do you think amplification actually produces better models? I mean, so I think that, like, you know, at least in imitative application, we have this argument about HCH where we can be like, in the limit it converges to an infinite tree of humans consulting humans. And then there's , you know, arguments you can make, for like, you know it seems like this is a reasonable sort of idealized reflection process, and so it seems like a good thing to trust.

Michael: Oh, is it like human consulting HGH? The thing you're saying.

Evan: HGH is a recursive acronym which stands for Humans Consulting HGH.

Michael: Ok, right. So then this is like an infinite loop. And then you're like, we have different ways of doing amplification by approval, by different [inaudible] training or something. And let me see if I have other ones, which are interesting. I think one of the funniest one is the one about STEM. So you basically tell the AI "stop thinking about humans, just think about science". This is like a very bad summary... this is a strawman, this is what I got from just like reading the first paragraph.

Evan: STEM AI is a proposal where we're like, well, it seems kind of dangerous to train models in environments where they have access to a bunch of information about humans, maybe we can just get away with training models in environments where they only have access to information on maps or something, or science or technology or whatever, and then we just use that. And then the problem with this, obviously, is that you can't use it for a lot of things. Like it doesn't... you can't use it for, like, you know, geopolitics or running a company or anything that would involve human modeling. But you can still use it for a lot of things. And so maybe it's enough. Though maybe not.

Michael: I think if we have a very good AI... maybe it's from Robin Hanson... it's like you have this advanced super-accelerated human civilization in your computer like brain emulation and it runs for like billions of years or a thousand years. And at the end, it produces some output of like all the research he has found over the years. If we have some kind of oracle AI that is very good at science, you would get like all those insights about science without having the problems of it, trying to find our values or something. But then we would still to have some kind of bouncing problem for it to not escape and, you know, make the earth a componotrium or something. But I think it's a good objective to be good at science. I think the other ones are about debate or amplification... I think one thing that is interesting me is reward modeling. So I think DeepMind, at least used to have this different approach. So CHAI, Center for Human Compatible AI at Berkeley who do inverse reinforcement learning, trying to find a reward function whereas DeepMind Safety team was mostly trying to do reward modeling and I don't fully understand the difference. In your blogpost, you give some solutions with reward modeling. So if you could explain that to me that would be helpful on a personal level.

Evan: Yeah. So here's how I think about recursive reward modeling. So in imitative amplification, we have this amplification operator AMP(M) where we just have a human consulting the model and this is how we produce a more powerful version of the model. In Recursive reward modeling, we have this sort of a new version of the amplification operator where now what the amplification operator does is it does the following thing: it trains a reward model on the sort of humans like feedback? It then trains an agent to optimize that reward model and then it gives the human the opportunity to look at what that agent does and give additional feedback to our [finding?] agent. And then once this sort of converges, the human is given a lot of feedback and trying to [agent?], which is like we found the reward model and we found an agent, which is in fact trying to do the sort of optimize for this reward that the human is trying to give feedback for. Then we called the resulting agents the sort of amplified version of the original model. And so we have this new amplification operator, which does this whole reward modeling process, and then we just do a sort of standard iterated amplification on top of that new amplification operator.

Michael: But, what's the difference? Maybe it's a layman question, but what's the difference between trying to infer, like deep RL from human preferences, like human saying yes or no, like trying to tell the AI... cooperative IRL where we're trying to have a human say what is the correct behavior and reward modeling. [There's always?] a human in the loop saying what he wants to do in a reward model, which is kind of a reward function... Is a reward model a reward function? Or is it different?

Evan: Reward model is not a reward function because we learned it. So reward model is a learned model.

Michael: Right. So it's like a model. When you're trying to do IRL you also have like a model of the reward function so you have parameters that define your reward. It seems to me very similar, but maybe I'm missing something.

Evan: It's similar to IRL, but that's not quite because we also train an agent to then optimize for that reward function. And then we also refine the reward function by letting the human look at what that agent does.

Michael: Right, ok, cool. And, yeah, I think the last paper that I think was kind of interesting in terms of AI Alignment was... where I'm also having trouble understanding is learning to summarize from human feedback where there is this kind of human feedback, where I think the AI does summaries and the human says which summaries are good or not. And so there's a mixture of kind of RL and NLP and at the end there's like human feedback in the loop. And if you if you can get good information in that, otherwise I can read it on my own. I think there's like a similar diagram to... let me find it.

Evan: Yeah. I mean, something similar. I think it's a little bit simpler. They're just saying we learn like, well, you know, it's actually very similar because they're learning a... I don't know if that actually goes through the step of having a reward model though, I think what it does is it just learns an agent and then the human gets to look at the agent's actions and then give some preferences and then you refine the agent based on the human's preferences after looking at the agent's behavior. So it's sort of similar, but skips the step where you [actively?] a separate reward model. At least if I'm remembering the paper correctly.

Michael: I'm just trying to find... I think I have the thing but I'm not sure. They're giving me an SVG image, which is a bit hard, ok but let's not go into this paper if we haven't both looked at it, anyway. This is my closing question. What is the most underappreciated sub-problem of AI alignment you would want people to work more on?

Evan: So, this is a little bit of a weird question, because it depends, I think, very heavily on who I'm giving the advice to. So like, I know there's the problems that I work on, which obviously I'm working on them because I think that they're the most important things to work on, which are things like myopia and how do we sort of understand what would be...

Michael: What's myopia?

Evan: Sort of how do we understand what would it look like for an agent to sort of only care about a sort of single action and not optimize for anything else. I think this is exciting, but it's not necessarily something I would want like, I don't know. I think it's complicated and it's like I don't know. I think if you wanted to work on this or something, the right thing to do is just like talk to me a bunch. Whereas like say the more general advice if I want to, just like... if you're trying to get into AI Safety and you have some machine learning experience and just want to do something, I think that like my current advice is like try to do transparency and interpretability research, sort of like on in the style of like circuit style stuff.

Michael: So yeah. You're referring to like your post or Paul Christiano work on circuits, right?

Evan: No

Michael: Chris Olah's work on circuits.

Evan: Yes I'm refering to Chris Olah's work on circuits.

Michael: Cool. I think it is hard for people to actually give precise answers to that. Do you have... are your timelines aligned with kind of Ajeya Cotra's report... I don't know if you've read this.

Evan: I have a very high degree of uncertainty on AI timelines.

Michael: It's hard to talk about it publicly.

Evan: It's not that it's hard to be talked about it publicly. I have a high degree of uncertainty. I do not know what the correct AI timelines are. And in fact, I think that it's very very difficult in practice to estimate good AI timelines. And I think Ajeya has done an admirable job, and if I can pick a number like, as a [inaudible] guess, probably I would pick Ajeya's number, but like I don't actually put very much stake in any particular analysis of how long things are going to take, because I think it is very very difficult to predict these sorts of things.

Michael: You can say that she did a very good job and it was very rigorous, but it was before something like Dall-E came. So I think most people I've talked to in the ML space kind of updated a lot on Dall-E, or CLIP as at least as like, multi-modal and being able to understand concepts as, doing an avocado chair or something. And when I look at a bunch of art stuff on Eleuther.ai's Discord, I'm kind of amazed at how good it is an understanding concept. Even if you have like very conservative timelines, and being very uncertain. Have you updated on Dall-E or not? That's my question. That's my real final question.

Evan: I don't think it's that big of an update, I guess? I feel like, I don't know, like starting from like GPT-2, you know, and even from like BERT. We've seen really impressive [feats?] from language models going far back now, I think at this point. I think like... I guess I feel like you shouldn't have been that surprised that, like, it also works to like say things in multimodal settings. Like you just feel like that's the obvious thing that's going to happen next. I guess, like, I didn't feel like that for... or Dall-E was like extremely surprising, I guess?

Michael: What would be something that would surprise you?

Evan: What would be something that would surprise me? I don't know. I mean, lots of things would surprise me, I guess. In like, hindsight, what are things that were surprising to me? Well, like I said, I definitely think that the success of Transformer based language models was surprising. I definitely think that... I think that like AlphaGo was somewhat surprising.

Michael: [inaudible]

Evan: Yeah, right.

Michael: Go ahead.

Evan: No, nothing.

Michael: Transformers were surprising, and according to Connor, there is that hypothesis that Transformers is all you need. He didn't say that, but that's like a meme of like, if Transformers is all you need for AGI, then maybe we did the most important part, but then like plugging RL into it is the easy part.

Evan: It's the very strong version of attention is all you need. Attention is really all you will ever need.

Michael: Yeah. It is all you will ever need. So if it's right then we don't need... you will never be surprised, and just transformers is enough. Anyhow I wouldn't, I wouldn't take any more of your time is to me and my and my place was very good to have you. And you probably link of the video before the end of the week.

Evan: Ok, yeah, definitely.

17

Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

17