Garrabrant and Shah on human modeling in AGI

Rob Bensinger

This is an edited transcript of a conversation between Scott Garrabrant (MIRI) and Rohin Shah (DeepMind) about whether researchers should focus more on approaches to AI alignment that don’t require highly capable AI systems to do much human modeling. CFAR’s Eli Tyre facilitated the conversation.

To recap, and define some terms:

The alignment problem is the problem of figuring out "how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world" (outcome alignment) or the problem of building powerful AI systems that are trying to do what their operators want them to do (intent alignment).
In 2016, Hadfield-Mennell, Dragan, Abbeel, and Russell proposed that we think of the alignment problem in terms of “Cooperative Inverse Reinforcement Learning” (CIRL), a framework where the AI system is initially uncertain of its reward function, and interacts over time with a human (who knows the reward function) in order to learn it.
In 2016-2017, Christiano proposed “Iterated Distillation and Amplification” (IDA), an approach to alignment that involves iteratively training AI systems to learn from human experts assisted by AI helpers. In 2018, Irving, Christiano, and Amodei proposed AI safety via debate, an approach based on similar principles.
In early 2019, Scott Garrabrant and DeepMind’s Ramana Kumar argued in “Thoughts on Human Models” that we should be “cautious about AGI designs that use human models” and should “put more effort into developing approaches that work well in the absence of human models”.
In early February 2021, Scott and Rohin talked more about human modeling and decided to have the real-time conversation below.

You can find a recording of the Feb. 28 discussion below (sans Q&A) here.

1. IDA, CIRL, and incentives

Eli: I guess I want to first check what our goal is here. There was some stuff that happened online. Where are we according to you guys?

Scott: I think Rohin spoke last and I think I have to reload everything. I'm coming at it fresh right now.

Eli: Yeah. So we'll start by going from “What are the positions that each of you were thinking about?” But I guess I'm most curious about: from each of your perspectives, what would a win of this conversation be like? What could happen that you would go away and be like, "Yes, this was a success."

Rohin: I don't know. I generally feel like I want to have more models of what AI safety should look like, and that's usually what I'm trying to get out of conversations with other people. I also don't feel like we have even narrowed down on what the disagreement is yet. So, first goal would be to figure that out.

Eli: Definitely. Scott?

Scott: Yeah. So, I have two things. One is that I feel frustrated with the relationship between the AI safety community and the human models question. I mean, I feel like I haven't done a good job arguing about it or something. And so on one side, I'm trying to learn what the other position is because I'm stuck there.

The other thing is that I feel especially interested in this conversation now because I feel like some of my views on this have recently shifted.

My worry about human-modeling systems has been that modeling humans is in some sense "close" to behaviors like deception. I still see things that way, but now I have a clearer idea of what the relevant kind of “closeness” is: human-modeling is close to unacceptable behavior on some measure of closeness that involves your ability to observe the system’s internals and figure out whether the system is doing the intended thing or doing something importantly different.

I feel like this was an update to my view on the thing relative to how I thought about it a couple of years ago, and I want similar updates.

Eli: Awesome. As I'm thinking about it, my role is to try and make this conversation go well, according to your goals.

I want to encourage both of you to take full responsibility for making the conversation go well as well. Don't have the toxic respect of, "Ah, someone else is taking care of it." If something seems interesting to you, definitely say so. If something seems boring to you, say so. If there's a tack that seems particularly promising, definitely go for that. If I recommend something you think is stupid, you should say, "That's a stupid thing." I will not be offended. That is, in fact, helpful for me. Sound good?

All right. Scott, do you want to start with your broad claim?

Scott: Yeah. So, I claim that it is plausible that the problem of aligning a super capable system that is working in some domain like physics is significantly easier than aligning a super capable system that's working in contexts that involve modeling humans. Either contexts involving modeling humans because they're working on inherently social tasks, or contexts involving modeling humans because part of our safety method involves modeling humans.

And so I claim with moderate probability that it's significantly easier to align systems that don’t do any human-modeling. I also claim, with small but non-negligible probability, that one can save the world using systems that don’t model humans. Which is not the direction I want the conversation to really go, because I think that's not really a crux, because I don't expect my probability on “non-human-modeling systems can be used to save the world” to get so low that I conclude researchers shouldn’t think a lot about how to align and use such systems.

And I also have this observation: most AI safety plans feel either adjacent to a Paul Christiano IDA-type paradigm or adjacent to a Stuart Russell CIRL-type paradigm.

I think that two of the main places that AI safety has pushed the field have been towards human models, either because of IDA/debate-type reasons or because we want to not specify the goal, we want it to be in the human, we want the AI to be trying to do "what we want” in quotation marks, which requires a pointer to modeling us.

And I think that plausibly, these avenues are mistakes. And all of our attention is going towards them.

Rohin: Well, I can just respond to that maybe.

So I think I am with you on the claim that there are tasks which incentivize more human-modeling. And I wouldn't argue this with confidence or anything like that, but I'm also with you that plausibly, we should not build AI systems that do those tasks because it's close to manipulation of humans.

This seems pretty distinct from the algorithms that we use to build the AI systems. You can use iterated amplification to build a super capable physics AI system that doesn't know that much about humans and isn't trying to manipulate them.

Scott: Yeah. So you're saying that the IDA paradigm is sufficiently divorced from the dangerous parts of human modeling. And you're not currently making the same claim about the CIRL paradigm?

Rohin: It depends on how broadly you interpret the CIRL paradigm. The version where you need to use a CIRL setup to infer all of human values or something, and then deploy that agent to optimize the universe—I'm not making the claim for that paradigm. I don't like that paradigm. To my knowledge, I don't think Stuart likes that paradigm. So, yeah, I'm not making claims about that paradigm.

I think in the world where we use CIRL-style systems to do fairly bounded, narrow tasks—maybe narrow is not the word I want to use here—but you use that sort of system for a specific task (which, again, might only involve reasoning about physics), I would probably make the same claim there, yes. I have thought less about it.

I might also say I'm not claiming that there is literally no incentive to model humans in any of these cases. My claim is more like, "No matter what you do, if you take a very broad definition of “incentive,” there will always be an incentive to model humans because you need some information about human preferences for the AI system to do a thing that you like."

Scott: I think that it's not for me about an incentive to model humans and more just about, is there modeling of humans at all?

Well, no, so I'm confused again, because... Wait, you're saying there's not an incentive to model humans in IDA?

Rohin: It depends on the meaning of the word “incentive.”

Scott: Yeah. I feel like the word “incentive” is distracting or something. I think the point is that there is modeling of humans.

(reconsidering) There might not be modeling of humans in IDA. There's a sense in which, in doing it right, maybe there's not... I don't know. I don't think that's where the disagreement is, though, is it?

Eli: Can we back up for a second and check what's the deal with modeling humans? Presumably there's something at least potentially bad about that. There's a reason why we care about it. Yes?

Rohin: Yeah. I think the reason that I'm going with, which I feel fairly confident that Scott agrees with me on, at least as the main reason, is that if your AI system is modeling humans, then it is “easy” or “close” for it to be manipulating humans and doing something that we don't want that we can't actually detect ahead of time, and that therefore causes bad outcomes.

Scott: Yeah. I want to be especially clear about the metric of closeness being about our ability to oversee/pre-oversee a system in the way that we set up the incentives or something.

The closeness between modeling humans and manipulating humans is not in the probability that some system that’s doing one spontaneously changes to doing the other. (Even though I think that they are close in that metric.) It's more in the ability to be able to distinguish between the two behaviors.

And I think that there's a sense in which my model is very pessimistic about oversight, such that maybe if we really try for the next 20 years or something, we can distinguish between “model thinking superintelligent thoughts about physics” and “model thinking about humans.” And we have no hope of actually being able to distinguish between “model that's trying to manipulate the human” and “model that's trying to do IDA-type stuff (or whatever) the legitimate way.”

Rohin: Right. And I'm definitely a lot more optimistic about oversight than you, but I still agree directionally that it's harder to oversee a model when you're trying to get it to do things that are very close to manipulation. So this feels not that cruxy or something.

Scott: Yeah, not that cruxy... I don't know, I feel like I want to hear more about what Rohin thinks or something instead of responding to him. Yeah, not that that cruxy, but what's the cruxy part, then? I feel like I'm already granting the thing about incentives and I'm not even talking about whether you have incentives to model humans. I'm assuming that there's systems that model humans, there's systems that don't model humans. And it's a lot easier to oversee the ones that don't.

2. Mutual information with humans

Rohin: I think my main claim is: the determining factor about whether a system is modeling humans or not—or, let's say there is an amount of modeling the system does. I want to talk about spectrums because I feel like you say too many wrong things if you think in binary terms in this particular case.

Scott: All right.

Rohin: So, there's an amount of modeling humans. Let's call it a scale—

Scott: We can have an entire one dimension instead of zero! (laughs)

Rohin: Yes, exactly. The art of choosing the right number of dimensions. (laughs) It's an important art.

So, we'll have it on a scale from 0 to 10. I think my main claim is that the primary determinant of where you are on the spectrum is what you are trying to get your AI system to do, and not the source of the feedback by which you train the system. And IDA is the latter, not the former.

Scott: Yeah, so I think that this is why I was distinguishing between the two kinds of “closeness.” I think that the probability of spontaneously manipulating humans is stronger when there are humans in the task than when there are just humans in the IDA/CIRL way of pointing at the task or something like that. But I think that the distance in terms of ability to oversee is not large...

Rohin: Yeah, I think I am curious why.

Scott: Yeah. Do I think that? (pauses)

Hm. Yeah. I might be responding to a fake claim right now, I'm not sure. But in IDA, you aren't keeping some sort of structure of HCH. You're not trying to do this, because they're trying to do oversight, but you're distilling systems and you're allowing them to find what gets the right answer. And then you have these structures that get the right answer on questions about what humans say, and maybe rich enough questions that contain a lot of needing to understand what's going on with the human. (I claim that yes, many domains are rich enough to require modeling the human; but maybe that's false, I don't know.)

And basically I'm imagining a black box. Not entirely black box—it's a gray box, and we can look at some features of it—but it just has mutual information with a bunch of stuff that humans do. And I almost feel like, I don't think this is the way that transparency will actually work, but I think just the question “is there mutual information between the humans and the models?” is the extent to which I expect to be able to do the transparency or something.

Maybe I'm claiming that in IDA, there's not going to be mutual information with complex models of humans.

Rohin: Yep. That seems right. Or more accurately, I would say it depends on the task that you're asking IDA to do primarily.

Scott: Yeah, well, it depends on the task and it depends on the IDA. It depends on what instructions you give to the humans.

Rohin: Yes.

Scott: If you imagined that there's some core of how to make decisions or something, and humans have access to this core, and this core is not specific to humans, but humans don't have access to it in such a way that they can write it down in code, then you would imagine a world in which I would be incentivized to do IDA and I would be able to do so safely (kind of) because I'm not actually putting mutual information with humans in the system. I'm just asking the humans to follow their “how to make decisions” gut that doesn't actually relate to the humans.

Rohin: Yes. That seems right.

What do I think about that? Do I actually think humans have a core?...

Scott: I almost wasn't even trying to make that claim. I was just trying to say that there exists a plausible world where this might be the case. I'm uncertain about whether humans have a core like that.

But I do think that in IDA, you're doing more than just asking the humans to use their “core.” Even if humans had a core of how to solve differential equations that they couldn't write down in code and then wanted to use IDA to solve differential equations via only asking humans to use their core of differential equations.

If this were the case… Yeah, I think the IDA is asking more of the human than that. Because regardless of the task, the IDA is asking the human to also do some oversight.

Rohin: … Yes.

Scott: And I think that the “oversight” part is capturing the richness of being human even if the task manages to dodge that.

And the analog of oversight in debates is the debates. I think it's more clear in debates because it's debate, but I think there's the oversight part and that's going to transfer over.

Rohin: That seems true. One way I might rephrase that point is that if you have a results-based training system, if your feedback to the agent is based on results, the results can be pretty independent of humans. But if you have a feedback system based not just on results, but also on the process by which you got the results—for whatever reason, it just happens to be an empirical fact about the world that there's no nice, correct, human-independent core of how to provide feedback on process—then it will necessarily contain a bunch of information about humans the agent will then pick up on.

Scott: Yeah.

Yeah, I think I believe this claim. I'm not sure. I think that you said back a claim that not only is what I said, but also I temporarily endorse, which is stronger. (laughs)

Rohin: Cool. (laughs) Yeah. It does seem plausible. It seems really rough to not be giving feedback on the process, too.

I will note that it is totally possible to do IDA, debate, and CIRL without process-level feedback. You just tell your humans to only write the results, or you just replace the human with an automated reward function that only evaluates the results if you can do that.

Scott: I mean...

Rohin: I agree, that's sort of losing the point of those systems in the first place. Well, maybe not CIRL, but at least IDA and debate.

Scott: Yeah. I feel like I can imagine "Ah, do something like IDA without the process-level feedback." I wouldn't even want to call debate “debate” without the process-level feedback.

Rohin: Yeah. At that point it's like two-player zero-sum optimization. It's like AlphaZero or something.

Scott: Yeah. I conjecture that the various people who are very excited about each of these paradigms would be unexcited about the version that does not have process-level feedback—

Rohin: Yes. I certainly agree with that.

I also would be pretty unexcited about them without the process-level feedback. Yeah, that makes sense. I think I would provisionally buy that for at least physics-style AI systems.

Scott: What do you mean?

Rohin: If your task is like, "Do good physics," or something. I agree that the process-level feedback will, like, 10x the amount of human information you have. (I made up the number 10.)

Whereas if it was something else, like if it was sales or marketing, I'd be like, "Yeah, the process-level feedback makes effectively no difference. It increases the information by, like, 1%."

Scott: Yeah. I think it makes very little difference in the mutual information. I think that it still feels like it makes some difference in some notion of closeness-to-manipulation to me.

So I'm in a position where if it's the case that one could do superhuman STEM work without humans, I want to know this fact. Even if I don't know what to do with it, that feels like a question that I want to know the answer to, because it seems plausibly worthwhile.

Rohin: Well, I feel like the answer is almost certainly yes, right? AlphaFold is an example of it.

Scott: No, I think that one could make the claim... All right, fine, I'll propose an example then. One could make the claim that IDA is a capability enhancement. IDA is like, "Let's take the humans' information about how to break down and solve problems and get it into the AI via the IDA process."

Rohin: Yep. I agree you could think of it that way as well.

Scott: And so one could imagine being able to answer physics questions via IDA and not knowing how to make an AI system that is capable of answering the same physics questions without IDA.

Rohin: Oh. Sure. That seems true.

Scott: So at least in principle, it seems like…

Rohin: Yeah.

Eli: There's a further claim I hear you making, Scott—and correct me if this is wrong—which is, “We want to explore the possibility of solving physics problems without something like IDA, because that version of how to solve physics problems may be safer.”

Scott: Right. Yeah, I have some sort of conjunction of “maybe we can make an AGI system that just solves physics problems” and also “maybe it's safer to do so.” And also “maybe AI that could only solve physics problems is sufficient to save the world.” And even though I conjuncted three things there, it's still probable enough to deserve a lot of attention.

Rohin: Yeah. I agree. I would probably bet against “we can train an AI system to do good STEM reasoning in general.” I'm certainly on board that we can do it some of the time. As I've mentioned, AlphaFold is an example of that. But it does feel like, yeah, it just seems really rough to have to go through an evolution-like process or something to learn general reasoning rather than just learning it from humans. Seems so much easier to do the latter that that's probably going to be the best approach in most cases.

Scott: Yeah. I guess I feel like some sort of automated working with human feedback—I feel like we can be inspired by humans when trying to figure out how to figure out some stuff about how to make decisions or something. And I'm not too worried about mutual information with humans leaking in from the fact that we were inspired by humans. We can use the fact that we know some facts about how to do decision-making or something, to not have to just do a big evolution.

Rohin: Yeah. I think I agree that you could definitely do a bit better that way. What am I trying to say here? I think I'm like, “For every additional bit of good decision-making you specify, you get correspondingly better speed-ups or something.” And IDA is sort of the extreme case where you get lots and lots of bits from the humans.

Scott: Oh! Yeah... I don't think that the way that humans make decisions is that much more useful as a thing to draw inspiration from than, like, how humans think ideal decision-making should be.

Rohin: Sure. Yeah, that's fair.

Scott: Yeah. It’s not obvious to me that you have that much to learn from humans relative to the other things in your toolbox.

Rohin: That's fair. I think I do disagree, but I wouldn't bet confidently one way or the other.

3. The default trajectory

Eli: I guess I want to back up a little bit and just take stock of where we are in this thread of the conversation. Scott made a conjunctive claim of three things. One is that it's possible to train an AI system to do STEM work without needing to have human models in the mix. Two, this might be sufficient for saving the world or otherwise doing pretty powerful stuff. And three, this might be substantially safer than doing it IDA-style or similar. I guess I just want to check, Rohin. Do you agree or disagree with each of those points?

Rohin: I probably disagree somewhat with all of them, but the upstream disagreement is how optimistic versus pessimistic Scott and I are about AI safety in general. If you're optimistic the way I am, then you're much more into trying not to change the trajectory of AI too much, and instead make the existing trajectory better.

Scott: Okay. (laughs) I don't know. I feel like I want to say this, because it's ironic or something. I feel like the existing framework of AI before AI safety got involved was “just try to solve problems,” and then AI safety's like, "You know what we need in our ‘trying to solve problems’? We need a lot of really meta human analysis." (laughs)

It does feel to me like the default path if nobody had ever thought about AI safety looks closer to the thing that I'm advocating for in this discussion.

Rohin: I do disagree with that.

Scott: Yeah. For the default thing, you do need to be able to say, "Hey, let's do some transparency and let's watch it, and let's make sure it's not doing the human-modeling stuff."

Rohin: I also think people just would have started using human feedback.

Scott: Okay. Yeah, that might be true. Yeah.

Rohin: It's just such an easy way to do things, relative to having to write down the reward function.

Scott: Yeah, I think you're right about that.

Rohin: Yeah. But I think part of it is that Scott's like, "Man, all the things that are not this sort of three-conjunct approach are pretty doomed, therefore it's worth taking a plan that might be unlikely to work because that's the only thing that can actually make a substantial difference." Whereas I'm like, "Man, this plan is unlikely to work. Let's go with a likely one that can cut the risk in half."

Scott: Yeah. I mean, I basically don't think I have any kind of plan that I think is likely to work… But I'm also not in the business of making plans. I want other people to do that. (laughs)

Rohin: I should maybe respond to Eli's original question. I think I am more pessimistic than Scott on each of the three claims, but not by much.

Eli: But the main difference is you're like, "But it seems like there's this alternative path which seems like it has a pretty good shot."

Rohin: I think Scott and I would both agree that the no-human-models STEM AI approach is unlikely to work out. And I’m like, "There's this other path! It's likely to work out! Let's do that." And Scott's like, "This is the only thing that has any chance of working out. Let's do this."

Eli: Yeah. Can you tell if it's a crux, whether or not your optimism about the no-human-models path would affect how optimistic you feel about the human-models path? It’s a very broad question, so it may be kind of unfair.

Rohin: It would seriously depend on why I became less optimistic about what I'm calling the usual path, the IDA path or something.

There are just a ton of beliefs that are correlated that matter. Again, I'm going to state things more confidently than I believe them for the sake of faster and clearer communication. There's a ton of beliefs like, "Oh man, good reasoning is just sort of necessarily messy and you're not going to get nice, good principles, and so you should be more in the business of incrementally improving safety rather than finding the one way to make a nice, sharp, clear distinction between things."

Then other parts are like, "It sure seems like you can't make a sharp distinction between ‘good reasoning’ and ‘what humans want.’” And this is another reason I'm more optimistic about the first path.

So I could imagine that I get unconvinced of many of these points, and definitely some of them, if I got unconvinced of those, I would be more optimistic about the path Scott's outlining.

4. Two kinds of risk

Scott: Yeah. I want to clarify that I'm not saying, like, "Solve decision theory and then put the decision theory into AI," or something like that, where you're putting the “how to do good reasoning” in directly. I think that I'm imagining that you have to have some sort of system that's learning how to do reasoning.

Rohin: Yep.

Scott: And I'm basically distinguishing between a system that's learning how to do reasoning while being overseen and kept out of the convex hull of human modeling versus… And there are definitely trade-offs here, because you have more of a daemon problem or something if you're like, "I'm going to learn how to do reasoning," as opposed to, "I'm going to be told how to do reasoning from the humans." And so then you have to search over this richer space or something of how to do reasoning, which makes it harder.

I'm largely not thinking about the capability cost there. There's a safety cost associated with running a big evolution to discover how reasoning works, relative to having the human tell you how reasoning works. But there's also a safety cost to having the human tell you how reasoning works.

And it's a different kind of safety cost in the two cases. And I'm not even sure that I believe this, but I think that I believe that the safety cost associated with learning how to do reasoning in a STEM domain might be lower than the safety cost associated with having your reasoning directly point to humans via the path of being able to have a thick line between your good and bad behavior.

Rohin: Yeah.

Scott: And I can imagine you saying, like, "Well, but that doesn't matter because you can't just say, 'Okay, now we're going to run the evolution on trying to figure out how to solve STEM problems,' because you need to actually have the capabilities or something."

Rohin: Oh. I think I lost track of what you were saying at the last sentence. Which probably means I failed to understand the previous sentences too.

Eli: I heard Scott to be saying—obviously correct me if I've also misapprehended you, Scott—that there's at least two axes that you can compare things along. One is “What's the safety tax? How much AI risk do you take on for this particular approach?” And this other axis is, “How hard is it to make this approach work?” Which is a capabilities question. And Scott is saying, "I, Scott, am only thinking about the safety tax question, like which of these is—”

Scott: “Safety tax” is a technical term that you're using wrong. But yeah.

Rohin: If you call it “safety difficulty”...

Scott: Or “risk factor.”

Eli: Okay. Thank you. “I'm only thinking about the risk factor, and not thinking about how much extra difficulty with making it work comes from this approach.” And I imagine, Rohin, that you're like, "Well, even—”

Rohin: Oh, I just agree with that. I'm happy making that decomposition for now. Scott then made a further claim about the delta between the risk factor from the IDA approach and the STEM AI approach.

Scott: Well, first I made the further claim that the risk from the IDA approach and the risk from the STEM approach are kind of different in kind. It's not like you can just directly compare them; they're different kinds of risks.

And so because of that, I'm uncertain about this: But then I further made the claim, at least in this current context, that the IDA approach might have more risk.

Rohin: Yeah. And what were the two risks? I think that's the part where I got confused.

Scott: Oh. One of the risks is we have to run an evolution because we're not learning from the humans, and so the problem's harder. And that introduces risks because it's harder for us to understand how it's working, because it's working in an alien way, because we had to do an evolution or something.

Rohin: So, roughly, we didn't give it process-level feedback, and so its process could be wild?

Scott: … Yeahh... I mean, I’m imagining... Yeah. Right.

Rohin: Cool. All right. I understand that one. Well, maybe not...

Scott: Yeah. I think that if you phrase it as "We didn't give it any process-level feedback," I'm like, “Yeah, maybe this one is obviously the larger risk.” I don't know.

But yeah, the risk associated with "You have to do an evolution on your system that’s thinking about STEM." Yeah, it's something like, "We had to do an evolution."

And the risk of the other one is that we made it such that we can't oversee it in the required ways. We can't run the policy, watch it carefully, and ensure it doesn’t reason about humans.

Rohin: Yeah.

Scott: And yeah, I think I'm not even making a claim about which of these is more or less risky. I'm just saying that they're sufficiently different risks, that we want to see if we can mitigate either of them.

Rohin: Yeah. That makes sense.

5. Scaling up STEM AI

Rohin: So the thing I wanted to say, and now I'm more confident that it actually makes sense to say, is that it feels to me like the STEM AI approach is lower-risk for a somewhat-smarter-than-human system, but if I imagine scaling up to arbitrarily smarter-than-human systems, I'm way more scared of the STEM AI version.

Scott: Yeah.

Eli: Can you say why, Rohin?

Rohin: According to me, the reason for optimism here is that your STEM AI system isn't really thinking about humans, doesn't know about them. Whatever it's doing, it's not going to successfully execute a treacherous turn, because successfully doing that requires you to model humans. That's at least the case that I would make for this being safe.

And when we're at somewhat superhuman systems, I'm like, "Yeah, I mostly buy that." But when we talk about the limit of infinite intelligence or something, then I'm like, "But it's not literally true that the STEM AI system has no reason to model the humans, to the extent that it has goals (which seems like a reasonable thing to assume).” Or at least a reasonable thing to worry about, maybe not assume.

It will maybe want more resources. That gives it an incentive to learn about the external world, which includes a bunch of humans. And so it could just start learning about humans, do it very quickly, and then execute a treacherous turn. That seems entirely possible, whereas with the IDA approach, you can hope that we have successfully instilled the notion of, "Yes, you are really actually trying to—”

Scott: Yeah. I definitely don’t want us to do this forever.

Rohin: Yeah.

Scott: I'm imagining—I don't know, for concreteness, even though this is a silly plan, there's the plan of “turn Jupiter into a supercomputer, invent some scanning tech, run a literal HCH.”

Rohin: Yeah.

Scott: And I don't know, this probably seems like it requires super-superhuman intelligence by your scale. Or maybe not.

Rohin: I don't really have great models for those scales.

Scott: Yeah, I don't either. It's plausible to me that... Yeah, I don't know.

Rohin: Yeah. I'm happy to say, yes, we do not need a system to scale to literally the maximal possible intelligence that is physically possible. And I do not—

Scott: Further, we don't want that, because at some point we need to get the human values into the future, right? (laughs)

Rohin: Yes, there is that too.

Scott: At some point we need to... Yeah. There is a sense in which it's punting the problem.

Rohin: Yeah. And to be clear, I am fine with punting the problem.

It does feel a little worrying though that we... Eh, I don't know. Maybe not. Maybe I should just drop this point.

6. Optimism about oversight

Scott: Yeah. I feel like I want to try to summarize you or something. Yeah, I feel like I don't want to just say this optimism thing that you've said several times, but it's like...

I think that I predict that we're in agreement about, "Well, the STEM plan and the IDA plan have different and incomparable-in-theory risk profiles." And I don't know, I imagine you as being kind of comfortable with the risk profile of IDA. And probably also more comfortable than me on the risk profile of basically every plan.

I think that I'm coming from the perspective, "Yep, the risk profile in the STEM plan also seems bad, but it seems higher variance to me."

It seems bad, but maybe we can find something adjacent that's less bad, or something like that. And so I suspect there’s some sort of, "I'm pessimistic, and so I'm looking for things such that I don't know what the risk profiles are, because maybe they'll be better than this option."

Rohin: Yeah.

Scott: Yeah. Yeah. So maybe we're mostly disagreeing about the risk profile of the IDA reference class or something—which is a large reference class. It's hard to bring in one thing.

Rohin: I mean, for what it's worth, I started out this conversation not having appreciated the point about process-level feedback requiring more mutual information with humans. I was initially making a stronger claim.

Scott: Okay.

Eli: Woohoo! An update.

Scott: (laughs) Yeah. I feel like I want to flag that I am suspicious that neither of us were able to correctly steel-man IDA, because of... I don't know, I'm suspicious that the IDA in Paul's heart doesn't care about whether you build your IDA out of humans versus building it out of some other aliens that can accomplish things, because it's really trying to... I don't know. That's just a flag that I want to put out there that it's plausible to me that both Rohin and I were not able to steel-man IDA correctly.

Rohin: I think I wasn't trying to. Well, okay, I was trying to steel-man it inasmuch as I was like, “Specifically the risks of human manipulation seem not that bad,” which, I agree, Paul might have a better case for than me.

Scott: Yeah.

Rohin: I do think that the overall case for optimism is something like, “Yep, you'll be learning, you'll be getting some mutual info about humans, but the oversight will be good enough that it just is not going to manipulate you.”

Scott: I also suspect that we're not just optimistic versus pessimistic on AI safety. I think that we can zoom in on optimism and pessimism about transparency, and it might be that “optimism versus pessimism on transparency” is more accurate.

Rohin: I think I'm also pretty optimistic about process-level feedback or something—which, you might call it transparency...

Scott: Yeah. It’s almost like when I'm saying “transparency,” I mean “the whole reference class that lets you do oversight” or something.

Rohin: Oh. In that case, yes, that seems right.

So it includes adversarial training, it includes looking at the concepts that the neural net has learned to see if there's a deception neuron that's firing, it includes looking at counterfactuals of what the agent would have done and seeing whether those would have been bad. All that sort of stuff falls under “transparency” to you?

Scott: When I was saying that sentence, kind of.

Rohin: Sure. Yeah, I can believe that that is true. I think there's also some amount of optimism on my front that the intended generalization is just the one that is usually learned, which would also be a difference that isn't about transparency.

Scott: Yeah.

Rohin: Oh, man. To everyone listening, don't quote me on that. There are a bunch of caveats about that.

Scott: (laughs)

Eli: It seems like this is a pretty good place to wrap up, unless we want to dive back in and crux on whether or not we should be pessimistic about transparency and generalization.

Scott: Even if we are to do that, I want to engage with the audience first.

Eli: Yeah, great. So let us open the fishbowl, and people can turn on their cameras and microphones and let us discuss.

7. Q&A: IDA and getting useful work from AI

Scott: Wow, look at all this chat with all this information.

Rohin: That is a lot of chat.

So, I found one of the chat questions, which is, “How would you build an AI system that did not model humans? Currently, to train the model, we just gather a big data set, like a giant pile of all the videos on YouTube, and then throw it at a neural network. It's hard to know what generalizations the network is even making, and it seems like it would be hard to classify generalizations into a ‘modeling humans’ bucket or not.”

Scott: Yeah. I mean, you can make a Go AI that doesn't model humans. It's plausible that you can make something that can do physics using methods that are all self-play-adjacent or something.

Rohin: In a simulated environment, specifically.

Scott: Right. Yeah.

Or even—I don't know, physics seems hard, but you could make a thing that tries to learn some math and answer math questions, you could do something that looks like self-play with how useful limits are or something in a way that's not... I don't know.

Rohin: Yeah.

Donald Hobson: I think a useful trick here is: anything that you can do in practice with these sorts of methods is easy to brute-force. You can do Go with AlphaGo Zero, and you can easily brute-force Go given unlimited compute. AlphaGo Zero is just a clever—

Scott: Yeah. I think there's some part where the human has to interact with it, but I think I'm imagining that your STEM AI might just be, "Let's get really good at being able to do certain things that we could be able to brute-force in principle, and then use this to do science or something.”

Donald Hobson: Yeah. Like a general differential equation solver or something.

Scott: Yeah. I don't know how to get from this to being able to get, I don't know, reliable brain scanning technology or something like that. But I don't know, it feels plausible that I could turn processes that are about how to build a quantum computer into processes that I could turn into math questions. I don't know.

Donald Hobson: “Given an arbitrary lump of biochemicals, find a way to get as much information as possible out of it.” Of a specific certain kind of information—not thermal noise, obviously.

Scott: Yeah. Also in terms of the specifications, you could actually imagine just being able to specify, similar to being able to specify Go, the problem of, "Hey, I have a decent physics simulation and I want to have an algorithm that is Turing-complete or something, and use this to try to make more…” I don't know. Yeah, it's hard. It's super hard.

Ben Pace: The next question in the chat is from Charlie Steiner saying, "What's with IDA being the default alternative rather than general value learning? Is it because you're responding to imaginary Paul, or is it because you think IDA is particularly likely or particularly good?"

Scott: I think that I was engaging with IDA especially because I think that IDA is among the best plans I see. I think that IDA is especially good, and I think that part of the reason why I think that IDA is especially good is because it seems plausible to me that basically you can do it without the core of humanity inside it or something.

I think the version of IDA that you're hoping to be able to do in a way that doesn't have the core of humanity inside the processes that it's doing seems plausible to exist. I like IDA.

Donald Hobson: By that do you mean IDA, like, trained on chess? Say, you take Deep Blue and then train IDA on that? Or do you mean IDA that was originally trained on humans? Because I think the latter is obviously going to have a core of humanity, but IDA trained on Deep Blue—

Scott: Well, no. IDA trained on humans, where the humans are trained to follow an algorithm that's sufficiently reliable.

Donald Hobson: Even if the humans are doing something, there will be side channels. Suppose the humans are solving equations—

Scott: Yeah, but you might be able to have the system watch itself and be able to have those side channels not actually... In theory, there are those side channels, and in the limit, you'd be worried about them, but you might be able to get far enough to be able to get something great without having those side channels actually interfere.

Donald Hobson: Fair enough.

Ben Pace: I'm going to ask a question that I want answered, and there's a good chance the answer is obvious and I missed it, but I think this conversation initially came from, Scott wrote a post called “Thoughts on Human Models,” where he was—

Scott: Ramana wrote the post. I ranted at Ramana, and then he wrote the post.

Ben Pace: Yes. Where he was like, "What if we look more into places that didn't rely on human models in a bunch of ways?" And then later on, Rohin was like, "Oh, I don't think I agree with that post much at all, and I don't think that's a promising area." I currently don't know whether Rohin changed his mind on whether that was a promising area to look at. I'm interested in Rohin's say on that.

Rohin: So A, I'll note that we mostly didn't touch on the things I wrote in those comments because Scott's reasons for caring about this have changed. So I stand by those comments.

Ben Pace: Gotcha.

Rohin: In terms of “Do I think this is a promising area to explore?”: eh, I'm a little bit more interested in it, mostly based on “perhaps process level feedback is introducing risks that we don't need.” But definitely my all-things-considered judgment is still, "Nah I would not be putting resources into this," and there are a bunch of other disagreements that Scott and I have that are feeding into that.

Ben Pace: And Scott, what's your current feelings towards... I can't remember, in the post you said "there's both human models and not human models", and then you said something like, "theory versus"... I can't remember what that dichotomy was.

But how much is your feeling a feeling of “this is a great area with lots of promise” versus “it's plausible and I would like to see some more work in it,” versus “this is the primary thing I want to be thinking about.”

Scott: So, I think there is a sense in which I am primarily doing science and not engineering. I'm not following the plan of “how do you make systems that are…” I'm a little bit presenting a plan, but I'm not primarily thinking about that plan as a plan, as opposed to trying to figure out what's going on.

So I'm not with my main attention focused on an area such that you can even say whether or not there are human models in the plan, because there's not a plan.

I think that I didn't update much during this conversation and I did update earlier. Where the strongest update I've had recently, and partially the reason why I was excited about this, was: The post that was written two years ago basically is just talking about "here's why human models are dangerous", and it's not about the thing that I think is the strongest reason, which is the ability to oversee and the ability to be able to implement the strategy “don't let the thing have human models at all.”

It's easy to separate “thinking about physics” from “thinking about manipulating humans” if you put “think about modeling humans” in the same cluster as “think about manipulating humans.” And It's a lot harder to draw the line if you want to put “think about modeling humans” on the other side, with “thinking about physics.”

And I think that this idea just didn't show up in the post. So to the extent that you want me to talk about change from two years ago, I think that the center of my reasoning has changed, or the part that I'm able to articulate has changed.

There was a thing during this conversation... I think that I actually noticed that I probably don't, even in expectation, think that... I don't know.

There was a part during this conversation where I backpedaled and was like, "Okay, I don't know about the expectation of the risk factor of these two incomparable risks. I just want to say: first, they're different in kind, therefore we should analyze both. They're different in kind, therefore if we're looking for a good strategy, we should keep looking at both because we should keep that OR gate in our abilities.

“And two: It seems like there might be some variance things, where we might be able to learn more about the risk factors on this other side, AI that doesn’t use human models. It might turn out that the risks are lower, so we should want to keep attention on both types of approach.”

I think maybe I would have said this before. But I think that I'm not confident about what side is better, and I'm more arguing from “let's try lots of different options because I'm not satisfied with any of them.”

Rohin: For what it's worth, if we don't talk just about "what are the risks" and we also include "what is the plan that involves this story", then I'm like, "the IDA one seems many orders of magnitude more likely to work". “Many” being like, I don’t know, over two.

Ben Pace: It felt like I learned something when Scott explained his key motivation being around the ability to understand what the thing is doing and blacklist or whitelist certain types of cognition, being easier when it's doing no human modeling, as opposed to when it is trying to do human modeling but you also don't want it to be manipulative.

Rohin: I think I got that from the comment thread while we were doing the comment thread, whenever that was.

Ben Pace: The other question I have in chat is: Steve—

Scott: Donald was saying mutual information is too low a bar, and I want to flag that I did not mean that mutual information is the metric that should be used to determine whether or not you're modeling humans. But the type of thing that I think that we could check for, has some common intuition with trying to check for mutual information.

Donald Hobson: Yeah, similar to mutual information. I'd say it was closer to mutual information conditional on a bunch of physics and other background facts about the universe.

Scott: I want to say something like, there's some concept that we haven't invented yet, which is like “logical mutual information,” and we don't actually know what it means yet, but it might be something.

Ben Pace: I'm going to go to the next question in chat, which is Steve Byrnes: “Scott and Rohin both agreed/posited right at the start that maybe we can make AIs that do tasks that do not require modeling humans, but I'm still stuck on that. An AI that can't do all of these things, like interacting with people, is seemingly uncompetitive and seemingly unable to take over the world or help solve all of AI alignment. That said, ‘use it to figure out brain scanning’ was a helpful idea by Scott just now. But I'm not 100% convinced by that example. Is there any other plausible example, path, or story?”

Scott: Yeah, I feel like I can't give a good story in terms of physics as it exists today. But I could imagine being in a world where physics was different such that it would help. And because of that I'm like, “It doesn't seem obviously bad in principle,” or something.

I could imagine a world where basically we could run HCH and HCH would be great, but we can't run HCH because we don't have enough compute power. But also, if you can inverse this one hash you get infinity compute because that's how physics works.

Group: (laughs)

Scott: Then your plan is: let's do some computer science until we can invert this one hash and then we unlock the hyper-computers. Then we use the hyper-computers to run a literal HCH as opposed to just an IDA.

And that's not the world we live in, but the fact that I can describe that world means that it's an empirical about-the-world fact as opposed to an empirical about-how-intelligence-works fact, or something. So I'm, like, open to the possibility. I don’t know.

Gurkenglas: On what side of the divide would you put reasoning about decision theory? Because thinking about agents in general seems like it might not count as modeling humans, and thinking about decision theory is kind of all that's required in order to think about AI safety, and if you can solve AI safety that's the win condition that you were talking about.

But also it's kind of hard to be sure that an AI that is thinking about agency is not going to be able to manipulate us, even if it doesn't know that we are humans. It might just launch attacks against whoever is simulating its box. So long as that person is an agent.

Scott: Yeah. I think that thinking about agents in general is already scary, but not as scary. And I'm concerned about the fact that it might be that thinking about agents is just convergent and you're not going to be able to invert the hash that gives you access to the hyper-computer, unless you think about agents.

That's a particularly pessimistic world, where it's like, you need to have things that are thinking about agency and thinking about things in the context of an agent in order to be able to do anything super powerful.

I'm really uncertain about the convergence of agency. I think there are certain parts of agency that are convergent and there are certain parts of agency that are not, and I'm confused about what the parts of agency even are, enough that I kind of just have this one cluster. And I'm like, yeah, it seems like that cluster is kind of convergent. But maybe you can have things that are optimizing “figuring out how to divert cognitive resources” and that's kind of like being an agent, but it's doing so in such a way that's not self-referential or anything, and maybe that's enough? I don’t know.

There's this question of the convergency of thinking about agents that I think is a big part of a hole in my map. A hole in my prioritization map is, I'm not actually sure what I think about what parts of agency are convergent, and this affects a lot, because if certain parts of agency are convergent then it feels like it really dooms a lot of plans.

Ben: Rohin, is there anything you wanted to add there?

Rohin: I don’t know. I think I broadly agree with Scott in that, if you're going to go down the path of “let's exclude all the types of reasoning that could be used to plan or execute a treacherous turn,” you probably want to put general agency on the outside of that barrier. That seems roughly right if you're going down this path.

And on the previous question of: can you do it, can this even be done? I'm super with, I think it was Steve, on these sorts of AI systems probably being uncompetitive. But I sort of have a similar position as Scott: maybe there's just a way that you can leverage great knowledge of just science in order to take over the world for example. It seems like that might be possible. I don't know, one way or the other. I would bet against, unless you have some pretty weird scenarios. But I wouldn't bet against at, like, 99% confidence. Or maybe I would bet against it that much. But there I'm getting to “that's probably a bit too high.”

And like, Scott is explicitly saying that most of these things are not high-probability things. They're just things that should be investigated. So that didn't seem like an avenue to push down.

Ben Pace: I guess I just disagree. I feel like if you were the first guy to invent nukes, and you invented them in 1800 or something, I feel I could probably tell various stories about having advanced technologies in a bunch of ways helping strategically—not even just weaponry.

Rohin: Yeah, I'd be interested in a story like this. I don't feel like you can easily make it. It just seems hard unless you're already a big power in the world.

Ben Pace: Really! Okay. I guess maybe I'll follow up with you on that sometime.

Joe Collman: Doesn't this slightly miss the point though, in that you'd need to be able to take over the world but then also make it safe afterwards, and the “make it safe afterwards” seems to be the tricky part there. The AI safety threat is still there if you through some conventional means—

Donald Hobson: Yeah, and you want to do this without massive collateral damage. If you invent nukes... If I had a magic map, where I pressed it and the city on it blew up or something, I can't see a way of taking over the world with that without massive collateral damage. Actually, I can't really see a way of taking over the world with that with massive collateral damage.

Ben Pace: I agree that the straightforward story with nukes sounds fairly unethical. I think there's probably ways of doing it that aren't unethical but are more about just providing sufficient value that you're sort of in charge of how the world goes.

Joe Collman: Do you see those ways then making it safer? Does that solve AI safety or is it just sort of, “I'm in charge but there's still the problem?”

Scott: I liked the proof of concept, even though this is not what I would want to do, between running literal HCH and IDA. I feel like literal HCH is just a whole lot safer, and the only difference between literal HCH and IDA is compute and tech.

8. Q&A: HCH’s trustworthiness

Joe Collman: May I ask what your intuition is about motivation being a problem for HCH? Because to me it seems like we kind of skip this and we just get into what it's computationally capable of doing, and we ignore the fact that if you have a human making the decisions, that they're not going to care about the particular tasks you give them necessarily. They're going to do what they think is best, not what you think is best.

Scott: Yeah, I think that that's a concern, but that's a concern in IDA too.

If I think that IDA is the default plan to compare things to, then I'm like, “Well, you could instead do HCH, which is just safer—if you can safely get to the point where you can get to HCH.”

Joe Collman: Right, yeah. I suppose my worry around this is more if we're finding some approximation to it and we're not sure we've got it right yet. Whereas I suppose if you're uploading and you're sure that the uploading process is actually—

Scott: Which is not necessarily the thing that you would do with STEM AI, it's just a proof of concept.

Donald Hobson: If you're uploading, why put the HCH structure on it at all? Why not just have a bunch of uploaded humans running around a virtual village working on AI safety? If you've got incredibly powerful nano-computers, can't you just make uploaded copies of a bunch of AI safety researchers and run them for a virtual hundred years, but real time five minutes, doing AI safety research?

Scott: I mean, yeah, that's kind of like the HCH thing. It's different, but I don't know...

Charlie Steiner: I'm curious about your intuitive, maybe not necessarily a probability, but gut feeling on HCH success chances, because I feel like it's quite unlikely to preserve human value.

Donald Hobson: I think a small HCH will probably work and roughly preserve human value if it's a proper HCH, no approximations. But with a big one, you're probably going to get ultra-viral memes that aren't really what you wanted.

Gurkenglas: We have a piece of evidence on the motivation problem for HCH and IDA, namely GPT. When it pretends to write for a human, we can easily make it pretend to be any kind of human that we want by simply specifying it in the prompt. The hard part is making the pretend human capable enough.

Donald Hobson: I'm not sure that it's that easy to pick what kind of human you actually get from the prompt.

Scott: I think that the version of IDA that I have any hope for has humans following sufficiently strict procedures and such that this isn't actually a thing that... I don’t know, I feel like this is a misunderstanding of what IDA is supposed to do.

I think that you're not supposed to get the value out of the individual pieces in IDA. You're supposed to use that to be able to… Like, you have some sort of task, and you're saying, "Hey, help me figure out how to do this task." And the individual humans inside the HCH/IDA system are not like, "Wait, do I really want to do this task?". The purpose of the IDA is to have the capabilities come from a human-like method as opposed to just, like, a large evolution, so that you can oversee it and trust where it’s coming from and everything.

Donald Hobson: Doesn't that mean that the purpose is to have the humans provide all the implicit values so obvious no one bothered to mention them. So if you ask IDA to put two strawberries on the plate, it's the humans’ implicit values that do it in a way that doesn't destroy the world.

Scott: I think that it's partially that. I think it's not all that. I think that you only get the basics of human values out of the individual components of the HCH or IDA, and the actual human values are coming from the humans using the IDA system.

I think that IDA as intended should work almost as well if you replace all the humans with well-intentioned aliens.

Ben Pace: Wait, can you say that again? That felt important to me.

Scott: I think, and I might be wrong about this, that the true IDA, the steel-man_Scott IDA, should work just as well or almost as well if you replaced all of the humans with well-intentioned aliens.

Donald Hobson: Well-intentioned aliens wouldn't understand English. Your IDA is getting its understanding of English from the humans in it.

Scott: No, I think that if I wanted to use IDA, I might use IDA to for example solve some physics problems, and I would do so with humans inside overseeing the process. There's not supposed to be a part of the IDA that’s making sure that human values are being brought into our process of solving these physics problems. The IDA is just there to be able to safely solve physics problems.

And so if you replaced it with well-intentioned aliens, you still solve the physics problems. And if you direct your IDA towards a problem like “solve this social problem”, you need information about social dynamics in your system, but I think that you should think of that as coming from a different channel than the core of the breaking-problems-up-and-stuff, and the core of the breaking-problems-up-an-stuff should be thought of as something that could be done just as well by well-intentioned aliens.

Donald Hobson: So you've got humans that are being handed a social problem and told to break it up, that is the thing you're imitating, but the humans are trying to pretend that they know absolutely nothing about how human societies work except for what's on this external piece of paper that you handed them.

Scott: It's not necessarily a paper. You do need some human-interacting-with-human in order to think about social stuff or something... Yeah, I don't know what I'm saying.

It's more complicated on the social thing than on the physics thing. I think the physics thing should work just as well. I think the physics thing should work just as well with well-intentioned aliens, and the human thing should work just as well if you have an HCH that's built out of humans and well-intentioned aliens and the humans never do any decomposition, they only ask the well-intentioned aliens questions about social facts or something. And the process that's doing decomposition is the well-intended aliens' process of doing composition.

I also might be wrong.

Ben Pace: I like this thread. I also kind of liked it when Joe Collman asked questions that he cared about. If you had more questions you cared about, Joe, I would be interested in you asking those

Joe Collman: I guess, sure. I was just thinking again with the HCH stuff—I would guess, Scott, that you probably think this isn't a practical issue, but I would be worried about the motivational side of HCH in the limit of infinite training of IDA.

Are you thinking that, say, you've trained a thousand iterations of IDA, you're training the thousand-and-first, you've got the human in there. They've got this system that's capable of answering arbitrarily amazing, important questions, and you feed them the question “Do you like bananas?” or something like that, some irrelevant question that's just unimportant. Are we always trusting that the human involved in the training will precisely follow instructions? Are you seeing that as a non-issue, that we can just say, well that's a sort of separate thing—

Scott: I think that you could make your IDA kind of robust to “sometimes the humans are not following the instructions”, if it's happening a small amount of the time.

Joe Collman: But if it's generalizing from that into other cases as well, and then it learns, “OK, you don't follow instructions a small amount of time…”—if it generalizes from that, then in any high-stakes situation it doesn't follow instructions, and if we're dealing with an extremely capable system, it might see every situation as a high-stakes situation because it thinks, “I can answer your question or I can save the world.” Then I'm going to choose to save the world rather than answering your question directly, providing useful information.

Scott: Yeah, I guess if there's places where the humans reliably... I was mostly just saying, if you're collecting data from humans and there's noise in your data from humans, you could have systems that are more robust to that. But if you have it be the case that humans reliably don't follow instructions on questions of type X, then the thing that you're training is a thing that reliably doesn't follow instructions on questions of type X. And if you have assumptions about what the decomposition is going to do, you might be wrong based on this, and that seems bad, but...

Rohin: I heard Joe as asking a slightly different question, which is: “In not HCH but IDA, which is importantly—”

Joe Collman: To me it seems to apply to either. In HCH it seems clear to me that this will be a problem, because if you give it any task and the task is not effectively “Give me the most valuable information that you possibly can,” then the morality of the H—assuming you've got an H that wants the best for the world—then if you ask it, “Do you like bananas?”, it's just going to give you the most useful information.

Rohin: But an HCH is made up of humans. It should do whatever humans would do. Humans don't do that.

Joe Collman: No, it's going to do what the HCH tree would do. It's not going to do what the top-level human would do.

Rohin: Sure.

Joe Collman: So the top level human might say, “Yes, I like bananas,” but the tree is going to think, “I have infinite computing power. I can tell you how to solve the world's problems, or I can tell you, ‘Yes, I like bananas.’ Of those two, I'm going to tell you how to solve the world's problems, not answer your question.”

Rohin: Why?

Joe Collman: Because that's what a human would do—this is where it comes back into the IDA situation, where I'm saying, eventually it seems... I would assume probably this isn't going to be a practical problem and I would imagine your answer would be, “Okay, maybe in the limit this…”

Rohin: No, I'm happy with focusing on the limit. Even eventually, why does a human do this? The top-level human.

Joe Collman: The top level human: if it's me and you've asked me a question “Do you like bananas?”, and I'm sitting next to a system that allows me to give you information that will immediately allow you to take action to radically improve the world, then I'm just not going to tell you, “Yes, I like bananas”, when I have the option to tell you something that's going to save lives in the next five minutes or otherwise radically improve the world.

It seems that if we assume we've got a human that really cares about good things happening in the world, and you ask a trivial question, you're not going to get an answer to that question, it seems to me.

Rohin: Sure, I agree if you have a human who is making decisions that way, then yes, that's the decision that would come out of it.

Joe Collman: The trouble is, isn't that the kind of human that we want in these situations—is one precisely that does make decisions that way?

Rohin: No, I don't think so. We don't want that. We want a human who's trying to do what the user wants them to do.

Joe Collman: Right. The thing is, they have to be applying their own... HCH is basically a means of getting enlightened judgment. So the enlightened judgment has to come from the HCH framework rather than from the human user.

Rohin: I mean, I think if your HCH is telling the user something and the user is like, "Dammit, I didn't like this", then your HCH is not aligned with the user and you have failed.

Joe Collman: Yeah, but the HCH is aligned with the enlightenment. I suppose the thing I'm saying is that the enlightened judgment as specified by HCH will not agree with me myself. An HCH of me—it's enlightened judgment is going to do completely different things than the things I would want. So, yes, it's not aligned, but it's more enlightened than me. It's doing things that on “long reflection” I would agree with.

Rohin: I mean, sure, if you want to define terms that way then I would say that HCH is not trying to do the enlightened judgment thing, and if you've chosen a human who does the enlightened judgment thing you've failed.

Donald Hobson: Could we solve this by just never asking HCH trivial questions?

Joe Collman: The thing is, this is only going to be a problem where there's a departure between solving the task that's been assigned versus doing the thing that sort of maximally improves the world. Obviously if those are both the same, if you've asked wonderfully important questions or if you asked the HCH system “What is the most important question I can ask you?” and then you ask that, then it's all fine—

Scott: I think that most of the parts of the HCH will have the correct belief that the way to best help the world is to follow instructions corrigibly or something.

Joe Collman: Is that the case though, in general?

Scott: Well, it's like, if you're part of this big network that's trying to answer some questions, it's—

Joe Collman: So are you thinking from a UDT point of view, if I reasoned such that every part of this network is going to reason in the same way as me, therefore I need the network as a whole to answer the questions that are assigned to it or it's not going to come up with anything useful at all. Therefore I need to answer the question assigned to me, otherwise the system is going to fail. Is that the kind of…?

Scott: Yeah, I don't even think you have to call on UDT in order to get this answer right. I think it's just: I am part of this big process and it's like, "Hey solve this little chemistry problem," and if I start trying to do something other than solve the chemistry problem I'm just going to add noise to the system and make it less effective.

Joe Collman: Right. That makes sense to me if the actual top-level user has somehow limited the output so that it can only respond in terms of solving a chemistry problem. Then I guess I'd go along with you. But if the output is just free text output and I'm within the system and I get the choice to solve the chemistry problem as assigned, or I have the choice to provide the maximally useful information, just generally, then I'm going to provide the maximally useful information, right?

Scott: Yeah, I think I wouldn't build an HCH out of you then. (laughs)

Rohin: Yeah, that’s definitely what I’m thinking.

Scott: I think that I don't want to have the world-optimization in the HCH in that way. I want to just... I think that the thing that gives us the best shot is to have a corrigible HCH. And so—

Joe Collman: Right, but my argument is basically that eventually the kind of person you want to put in is not going to be corrigible. With enough enlightenment—

Rohin: I think I continue to be confused about why you assume that we want to put a consequentialist into the HCH.

Joe Collman: I guess I assume that with sufficient reflection, pretty much everyone is a consequentialist. I think the idea that you can find a human who at least is reliably going to stay not a consequentialist, even after you amplify their reasoning hugely, that seems a very suspicious idea to me.

Essentially, let's say for instance, due to the expansion of the universe we're losing—I calculated this very roughly, but we're losing about two stars per second, in the amount of the universe we can access, something like that.

So in theory, that means every second we delay in space colonization and the rest of it, is a huge loss in absolute terms. And so, any system that could credibly make the argument, “With this approach we can do this, and we can not lose these stars”—it seems to me that if you're then trading that against, “Oh, you can do this, and you can answer this question, solve this chemistry problem, or you can improve the world in this huge way”...

Rohin: So, if your claim is like, there are two people, Alice and Bob. And if you put Alice inside a giant HCH, and really just take the limit all the way to infinity, then that HCH is not going to be aligned with Bob, because Alice is probably going to be a consequentialist. Then yes, sure, that seems probably true.

Joe Collman: I'm saying it's not really going to be aligned with Alice in the sense that it will do what Alice wants. It will do what the HCH of Alice wants, but it won't do what Alice wants.

Ben Pace: Can I move to Donald’s question?

Rohin: Sure.

Donald Hobson: I was just going to say that, I think the whole reason that a giant HCH just answering the question is helpful: Suppose that the top-level question is “solve AI alignment.” And somewhere down from that you get “design better computers.” And somewhere down from that you get “solve this simple chemistry problem to help the [inaudible] processes or whatever.”

And so, all of the other big world-improvement stuff is already being done above you in some other parts of the tree. So, literally the best thing you can do to help the world, if you find yourself down at the bottom of the HCH, is just solving that simple chemistry problem. Because all the other AI stuff is being done by some other copy of you.

Joe Collman: That's plausible, sure. Yeah, if you reason that way. My only thing that I'm claiming quite strongly, is that eventually with suitable amplification, everyone is going to come around to the idea, “I should give the response that is best for the world,” or something like that.

So, if your chain of reasoning takes you to “solving that chemistry problem is making a contribution that at the top level is best for the world,” then sure, that's plausible. It's hard to see exactly how you reliably get to draw that conclusion, that solving the question you've been asked is the best for the world. But yeah.

Rob Miles: Is it necessary for HCH that the humans in it have no idea where in the tree they are? Or could you pass in some contexts that just says, "Solve this chemistry problem that will improve—"

Scott: You actually want to send that context in. There's discussion about one of the things that you might do in solving HCH, which is: along with the problems, you pass in an ordinal, that's how much resources of HCH you're allowed. And so you say, like, "Hey, solve this problem. And you get 10¹⁰⁰ resources.” And then you're like, "Here, sub-routine, solve this problem. You get 10¹⁰⁰/ 2 resources.” And then, once you get down into zero resources, you aren't allowed to make any further calls.

And you could imagine working this into the system, or you could imagine just corrigible humans following this instruction, and saying, “Whenever you get input in some amount of resources, you don't spend more than that.”

And this resource kind of tells you some information about where you are in the tree. And not having this resource is a bad thing. The purpose of this resource is to make it so that there's a unique fixed point of HCH. And without it, there's not necessarily a unique fixed point. Which is basically to say that, I think that individual parts of an HCH are not intended to be fully thinking about all the different places in the tree, because they're supposed to be thinking locally, because the tree is big and complex. But I think that there's no “Let's try to hide information about where you are in the tree from the components.”

Rob Miles: It feels like that would partly solve Joe's problem. If you're given a simple chemistry problem and 10¹⁰⁰ resources, then you might be like, "This is a waste of resources. I'm going to do something smarter." But if you're being allocated some resources that seem reasonable for the difficulty of the problem you're being set, then you can assume that you're just part of the tree—

Scott: Well, if you started out with—

Ben Pace: Yeah, Rob is right that in real life, if you give me a trillion dollars to solve a simple chemistry problem, I'll primarily use the resources to do cooler shit. (laughs)

Joe Collman: I think maybe the difficulty might be here that if you're given a problem where the amount of resources is about right for the difficulty of solving the problem, but the problem isn't actually important.

So obviously “Do you like bananas?” is a bad example, because you could set up an IDA training scheme that learns how many resources to put into it. And there the top level human could just reply, "Yes," immediately. And so, you don't—

Scott: I'm definitely imagining the resources are, it's reasonable to give more resources than you need. And then you just don't even use them.

Joe Collman: Right, yeah. But I suppose, just to finish off the thing that I was going to say is that yes, it still seems, with my kind of worry, it's difficult if you have a question which has a reasonable amount of resources applied to it, but is trivial, is insignificant. It seems like the question wouldn't get answered then.

Charlie Steiner: I want to be even more pessimistic, because of the unstable gradient problem, or any fixed point problem. In the limit of infinite compute or infinite training, we have at each level some function being applied to the input, and then that generates an output. It seems like you're going to end up outputting the thing that most reliably leads to a cycle that outputs itself, or eventually outputs itself. I don't know. This is similar to what Donald said about super-virulent memes.

9. Q&A: Disagreement and mesa-optimizers

Ray Arnold: So it seems like a lot of stuff was reducing to, “Okay, are we pessimistic or not pessimistic about safe AGI generally being hard?”

Rohin: Partly that, and partly me be plan-oriented, and Scott being science-oriented.

Scott: I mean, I'm trying to be plan-oriented in this conversation, or something.

Ray Arnold: It seems like a lot of these disagreements reduce to this ur-disagreement of, "Is it going to be hard or easy?", or a few different axes of how it's going to be hard or easy. And I'm curious, how important is it right now to be prioritizing ability to resolve the really deep, gnarly disagreements, versus just “we have multiple people with different paradigms, and maybe that's fine, and we hope one of them works out”?

Eli: I'll say that I have updated downward on thinking that that's important.

Ray Arnold: (laughs) As Mr. Double Crux.

Eli: Yeah. It seems like things that cause concrete research progress are good, and conversations like this one do seem to cause insights that are concrete research progress, but...

Ray Arnold: Resolving the disagreement isn't concrete research progress?

Eli: Yeah. It's like, “What things help with more thinking?” Those things seem good. But I used to think there were big disagreements—I don't know, I still have some probability mass on this—but I used to think there were big disagreements, and there was a lot of value on the table of resolving them.

Gurkenglas: Do you agree that if you can verify whether a system is thinking about agents, that you could also verify whether it has a mesa-optimizer?

Scott: I kind of think that systems will have mesa-optimizers. There's a question of whether or not mesa-optimizers will be there explicitly or something, but that kind of doesn't matter.

It would be nice if we could understand where the base level is. But I think that the place where the base level is, we won't even be able to point a finger at what a “level” is, or something like that.

And we're going to have processes that make giant spaghetti code that we don't understand. And that giant spaghetti code is going to be doing optimization.

Gurkenglas: And therefore we won’t be able to tell whether it's thinking about agency.

Scott: Yeah, I don't know. I want to exaggerate a little bit less what I'm saying. Like, I said, "Ah, maybe if we work at it for 20 years, we can figure out enough transparency to be able to distinguish between the thing that's thinking about physics and the thing that's thinking about humans." And maybe that involves something that's like, “I want to understand logical mutual information, and be able to look at the system, and be able to see whether or not I can see humans in it.” I don't know. Maybe we can solve the problem.

Donald Hobson: I think that actually conventional mutual information is good enough for that. After all, a good bit of quantum random noise went into human psychology.

Scott: Mutual information is intractable. You need some sort of, “Can I look at the system…” It's kind of like this—

Donald Hobson: Yeah. Mutual information [inaudible] Kolmogorov complexity, sure. But computable approximations to that.

Scott: I think that there might be some more information on what computable approximations, as opposed to just saying “computable approximations,” but yeah.

Charlie Steiner: I thought the point you were making was more like “mutual information requires having an underlying distribution.”

Scott: Yeah. That too. I'm conflating that with “the underlying distributions are tractable,” or something.

But I don't know. There's this thing where you want to make your AI system assign credit scores, but you want it to not be racist. And so, you determine whether or not by looking at the last layer, you can determine the race of the participants. And then you optimize against that in your adversarial network, or something like that.

And there's like that, but for thinking about humans. And much better than that.

Gurkenglas: Let's say we have a system that we can with our interpretability tools barely tell has some mesa-optimizer at the top level, that's probably doing some further mesa-optimization in there. Couldn't we then make that outer mesa-optimization explicit, part of the architecture, but then train the whole thing anew and repeat?

Scott: I feel like this process might cause infinite regress, but—

Gurkenglas: Surely every layer of mesa-optimization stems from the training process discovering either a better prior or a better architecture. And so we can increase the performance of our architecture. And surely that process converges. Or like, perhaps we just—

Scott: I’m not sure that just breaking it down into layers is going to hold up as you go into the system.

It's like: for us, maybe we can think about things with meta-optimization. But your learned model might be equivalent to having multiple levels of mesa-optimization, while actually, you can't clearly break it up into different levels.

The methods that are used by the mesa-optimizers to find mesa-optimizers might be sufficiently different from our methods, that you can't always just... I don't know.

Charlie Steiner: Also Gurkenglas, I'm a little pessimistic about baking a mesa-optimizer into the architecture, and then training it, and then hoping that that resolves the problem of distributional shift. I think that even if you have your training system finding these really good approximations for you, even if you use those approximations within the training distribution and get really good results, it seems like you're still going to get distributional shift.

Donald Hobson: Yeah. I think you might want to just get the mesa-optimizers out of your algorithm, rather than having an algorithm that's full of mesa-optimizers, but you've got some kind of control over them somehow.

Gurkenglas: Do you all think that by your definitions of mesa-optimization, AlphaZero has mesa-optimizers?

Donald Hobson: I don't know.

Gurkenglas: I'm especially asking Scott, because he said that this is going to happen.

Scott: I don't know what I think. I feel not-optimistic about just going deeper and deeper, doing mesa-optimization transparency inception.

23

Garrabrant and Shah on human modeling in AGI

23

1. IDA, CIRL, and incentives

2. Mutual information with humans

3. The default trajectory

4. Two kinds of risk

5. Scaling up STEM AI

6. Optimism about oversight

7. Q&A: IDA and getting useful work from AI

8. Q&A: HCH’s trustworthiness

9. Q&A: Disagreement and mesa-optimizers