MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger

Below, I’ve copied comments left by MIRI researchers Eliezer Yudkowsky and Evan Hubinger on March 1–3 on a draft of Ajeya Cotra’s "Case for Aligning Narrowly Superhuman Models." I've included back-and-forths with Cotra, and interjections by me and Rohin Shah.

The section divisions below correspond to the sections in Cotra's post.

0. Introduction

How can we train GPT-3 to give “the best health advice it can give” using demonstrations and/or feedback from humans who may in some sense “understand less” about what to do when you’re sick than GPT-3 does?

Eliezer Yudkowsky: I've had some related conversations with Nick Beckstead. I'd be hopeful about this line of work primarily because I think it points to a bigger problem with the inscrutable matrices of floating-point numbers, namely, we have no idea what the hell GPT-3 is thinking and cannot tell it to think anything else. GPT-3 has a great store of medical knowledge, but we do not know where that medical knowledge is; we do not know how to tell it to internally apply its medical knowledge rather than applying other cognitive patterns it has stored. If this is still the state of opacity of AGI come superhuman capabilities, we are all immediately dead. So I would be relatively more hopeful about any avenue of attack for this problem that used anything other than an end-to-end black box - anything that started to address, "Well, this system clearly has a bunch of medical knowledge internally, can we find that knowledge and cause it to actually be applied" rather than "What external forces can we apply to this solid black box to make it think more about healthcare?"

Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.

Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff.

Ajeya Cotra: I've edited the description of the challenge to emphasize human feedback less. It now reads "How can we get GPT-3 to give “the best health advice it can give” when humans in some sense “understand less” about what to do when you’re sick than GPT-3 does? And in that regime, how can we even tell/verify that it’s “doing the best it can”?"

Rob Bensinger: Nate and I tend to talk about "understandability" instead of "transparency" exactly because we don't want to sound like we're talking about normal ML transparency work.

Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability.

Ajeya Cotra: Thanks all -- I like the project of trying to come up with a good handle for the kind of language model transparency we're excited about (and have talked to Nick, Evan, etc about it too) but I think I don't want to push it in this blog post right now because I haven't hit on something I believe in and I want to ship this.

In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.

Eliezer Yudkowsky: (I think you want an AGI that is superhuman in engineering domains and infrahuman in human-modeling-and-manipulation if such a thing is at all possible.)

Ajeya Cotra: Fair point, added a footnote:

“Though if we could pull off a path where we build an AI system that is superhuman in certain engineering capabilities but not yet human-level in modeling and manipulating people, and use that system to cut down on x-risk from other AI projects without having to figure out how to supervise arbitrary superhuman models, that could be really good.”

1. What aligning narrowly superhuman models could look like

First of all, it’s important to note that not all narrowly superhuman models are going to be equally interesting as alignment case studies. AlphaGoZero (AGZ) is narrowly superhuman in an extremely strong sense: it not only makes Go moves better than the moves made by top human players, but also probably makes moves that top players couldn’t even reliably recognize as good. But there isn’t really an alignment problem for Go: a precise, algorithmically-generated training signal (the win/loss signal) is capable of eliciting the “full Go-playing potential” of AGZ given enough training, and would keep working even as the model got much bigger.

Eliezer Yudkowsky: Having access to an incorruptible ground truth solves some of the alignment problems but not all of them, in particular inner alignment problems. In the limit of optimizing infinitely hard on logical Tic-Tac-Toe, it won't kill you because it hits a capability bound early and stops; in the limit of optimizing infinitely hard on any real-world problem, there's no capability bound lower than extreme superintelligence so the thing inside keeps getting smarter and kills you. It is not obvious to me where logical Go falls on this spectrum, or rather, it is obvious to me that the answer is "it depends on the outer optimization method". (That is, some ways of optimizing a system to play Go will create an inner optimizer that plays Go and that will kill you; some ways might create a system that just played arbitrarily good Go.)

Ajeya Cotra: Good point, changed to:

"But there isn’t really an outer alignment problem for Go: a precise, algorithmically-generated training signal (the win/loss signal) is capable of eliciting the “full Go-playing potential” of AGZ given enough training, and would keep working even as the model got much bigger (although at a certain scale inner alignment issues may crop up)."

Choose a helpful “fuzzy” task (e.g. summarization, question-answering, advice-giving, story-writing) for which we have suggestive evidence that makes us suspect a state-of-the-art model has the capacity to significantly outperform some reference set of humans (e.g. Mechanical Turk workers) given the right training signal. Then,

Eliezer Yudkowsky: I'd feel more cheerful about an open call "Yo, try to do something about the fact that text transformers seem like they in some sense contain the capability to solve this problem but we can't make them use that capability to [do] what we want" than "Yo, retrain GPT-3 to do this using an outer training signal in a way that scales well with task and model complexity". The latter call is tuned to one particular approach to the first problem, which some, such as myself, would worry is not the most promising approach in the long run.

Evan Hubinger: +1 I think this is very similar to my objection that a focus on solving downstream tasks is less likely to translate into important insights than focusing on the core problem directly—though having talked with Ajeya in the comment thread at the bottom, I think I'm now convinced that the alternative pitch of “get language models to actually answer questions honestly to the best of their ability” also has some problems getting people to work on the important part of the problem.

Ajeya Cotra: Added some language to the bottom of this section in purple to emphasize that this is just one formulation.

I want a concrete formulation and specific project ideas in this post, even if they're not the best ones, but I agree that the broader question of "What the hell do we do about situations where models can do stuff to help us but don't want to?" is where the focus should be.

1.2. What kinds of projects do and don’t “count”

I think that if you can get the model to achieve superhuman performance at some task without collecting any human feedback or human demonstrations, the task is probably not “fuzzy” enough.

Eliezer Yudkowsky: What we have with GPT-3 is a case of a clear outer optimization, "predict the next character", which creates a wide-ranging variety of inner knowledge that seems like it could in principle be useful for many many tasks; but we can't directly make GPT-3 use that knowledge for other tasks, because we have no idea what the hell GPT-3 is thinking or how to make it think anything in particular. Instead we have to do elaborate dances to shoehorn our task into the shape of the predicted next character, i.e., prompt engineering. If you pick a new mechanical task with no humans in the loop and a clear outer loss function, it doesn't force you to confront any interesting part of this problem - we already know how to pretrain and retrain nets.

Things you do with humans in the loop can in principle also be boringly straightforward. But humans in the loop are expensive; so this potentially forces you to do something more data-efficient than usual, and find an efficient way to apply leverage. (This is how I'd interpret "Learning to summarize." It didn't poke at the internals, but it at least found a higher-leverage way to apply external pressures.)

Ajeya Cotra: I agree with what you've said -- I can't tell if your comment here was "I agree and here's some elaboration", or if it's objecting to the way I've framed it here?

Eliezer Yudkowsky: Something like, "I might have phrased that differently". I don't see a simple change of local phrasing that fits with the rest of the message in context; my inner editor worries that the message sounds here like "Go make it wantonly fuzzy to impress me" rather than "Here is why going for the easy clarity is a bad sign."

1.3. Potential near-future projects: “sandwiching”

In all of these cases, my guess is that the way to get the less-capable group of humans to provide training signals of a similar quality to the more-capable group will involve some combination of:

Eliezer Yudkowsky: In some ways this strikes me as a much more ambitious project than getting GPT-3 to cough up what it actually knows about health. You're trying to make the system perform "better" than its training data, in a certain sense. This focuses on some interesting parts of a capability problem. It potentially leads into a much more important alignment problem: "Given that humans can be fooled by some inputs, how do you operate a more intelligent input-finder without that fooling the humans?", where we could potentially test this by building a system that didn't fool easy-to-fool humans, while the capability levels are still too low to fool harder-to-fool humans, which gives us a possible check on whether the methodology actually succeeded, so long as we don't try too many times or try with an overly capable system.

But this all strikes me as a different project than getting better transparency and steering into a system that has only learned things implicit in the training data. Of course it may be that the solutions to both problems end up related, somehow, but their outer form at least looks different.

The different problem is potentially interesting in its own right, if its solution forces a non-black-box system that is then more transparent or steerable (but note that you are basically asking for a capability solution and praying it will be more alignable); or the variant problem of "How do you make a system not fool easy-to-fool humans, in a domain where there are hard-to-fool humans to check the answer" (and the system is not an AGI yet and does not anticipate the existence of the hard-to-fool humans checking its answers &c).

Ajeya Cotra: I think there's room for both types of work in this umbrella. I'm mostly trying here to create a slogan that points researchers at the exciting types of work within ML, so I think outer alignment issues and figuring out how to get humans who are less-foolable (which I've focused on a lot in this post) and inner alignment stuff / transparency could fit under this umbrella. I've emphasized the less-foolable humans stuff here because I think there's something juicily concrete about the problem statement, and it's very easy to tell if you've made progress.

2. How this work could reduce long-term AI x-risk

On the outside view, I think we should be quite excited about opportunities to get experience with the sort of thing we want to eventually be good at (aligning models that are smarter than humans). In general, it seems to me like building and iterating on prototypes is a huge part of how R&D progress is made in engineering fields, and it would be exciting if AI alignment could move in that direction.
If there are a large number of well-motivated researchers pushing forward on making narrowly superhuman models as helpful as possible, we improve the odds that we first encounter serious problems like the treacherous turn in a context where a) models are not smart enough to cause actually catastrophic harm yet, and b) researchers have the time and inclination to really study them and figure out how to solve them well rather than being in a mode of scrambling to put out fires and watching their backs for competitors. Holistically, this seems like a much safer situation to be in than one where the world has essentially procrastinated on figuring out how to align systems to fuzzy goals, doing only the minimum necessary to produce commercial products.

Eliezer Yudkowsky: So as not to speak disagreements only, I remark that I agree with these two paragraphs. I worry about how very far underwater we are on the logistic success curve, here, but that doesn't mean we should not throw resources at any hope of starting to swim (upward).

Ajeya Cotra: Thanks -- honestly I felt like your comments were radically more positive than I was expecting overall, not just this one.

Chance of discovering or verifying long-term solution(s): I’m not sure whether a “one shot” solution to alignment (that is, a single relatively “clean” algorithm which will work at all scales including for highly superintelligent models) is possible. But if it is, it seems like starting to do a lot of work on aligning narrowly superhuman models probably allows us to discover the right solution sooner than we otherwise would have.

Eliezer Yudkowsky: It's not possible. Not for us, anyways. A textbook that fell out of a wormhole from the future might have the simplest straightforward working solution with no extra gears, all of whose pieces work reliably. We won't get it in time because it takes multiple decades to go from sigmoid activation functions to ReLUs, and so we will definitely be working with the AGI equivalent of sigmoid activation functions instead of ReLUs while the world is ending. Hope that answers your question!

It also seems plausible that a solution will emerge directly from this line of work rather than the conceptual work -- the latter is mostly focused on finding a one-shot solution that will work under ~pessimal empirical assumptions,

Eliezer Yudkowsky: "Pessimal" is a strange word to use for this apt description of humanity's entire experience with ML to date. Unless by "generalize" you mean "generalize correctly to one new example from the same distribution" rather than "generalize the underlying concept that a human would".

Ajeya Cotra: I used "pessimal" here in the technical sense that it's assuming if there are N generalizations equally valid on the training distribution the model will pick the one which is worst for humans. Even if there's a very high probability that the worst one is in fact picked, assuming the worst one will be picked is still "assuming the worst case."

Fwiw, I know lots of ML alignment researchers who would not agree this is highly likely and e.g. point to lots of "human-like" generalization going on in neural networks that makes them hopeful about the empirical situation. I haven't found those arguments super compelling but am generally in a state of uncertainty rather than confidence one way or another.

I recognize this is the sort of thing MIRI has been arguing with other people about for ages though, and I don't have an inside view here, so it probably doesn't make sense to have this discussion in the comments.

3. Advantages over other genres of alignment research

I’m broadly supportive of all three of these other lines of work, but none of these approaches come as close to “practicing the thing we eventually want to be good at” as aligning narrowly superhuman models does.

Eliezer Yudkowsky: I remark that it seems to me that work like "Learning to summarize", plus this, falls into a subcategory that's much closer to the subcategories of "reliability+robustness" and the subcategory "interpretability", than to gridworlds; and in turn, the category of gridworlds is much closer to that aforesaid supercategory than either of those categories is to "conceptual research".

4. Objections and responses

4.1. How would this address treachery by a superintelligence?

I'm very uncertain how relevant the near-term work will turn out to be for more exotic problems like the treacherous turn, and I want to think more about ways to nudge it to be more relevant.

Eliezer Yudkowsky: Having some idea of what the hell the system was thinking internally might be one good start. It's not a one-shot solution, but you'd at least get a couple of warning lights before you went on ahead anyways and died.

To be more precise, I'd imagine the warning lights shut off at a higher capability level if that capability level is "smart enough to conceal own thoughts from transparency mechanisms, without that intention itself lighting up in the transparency mechanisms" instead of "smart enough to socially manipulate operators through outer acts and shape the quoted inner intentions that operators infer from acts". Plausibly the former case results in everybody dying because surrounding arms race conditions caused people to press ahead ignoring warning signs; instead of everybody dying because they were wholly ignorant of the dangerous thoughts their AGI was thinking, and not especially focused on the missing knowledge. Trying to die in a more dignified and self-aware fashion can be a first step in climbing the logistic success curve towards eventually surviving, or at least that would be the optimistic take.

Ajeya Cotra: I added a footnote about this idea, and also mentioned it in my description of your views. I am interested in it but I don't think I'm the right person to speak for this idea since I don't totally "get it" or have good inside view project ideas. I'd be excited for you or Evan to leave a comment on the blog post discussing this.

[Added footnote: “One idea a couple of others have suggested here and which I’m generally interested in is ‘transparency in (narrowly superhuman) language models’: finding ways to understand ‘what models are thinking and why,’ especially when they know more about something than humans do. I like this idea but am very unsure about what execution could look like. E.g., would it look like Chris Olah’s work, which essentially ‘does neuroscience’ on neural networks? Would it look like training models to answer our questions about what they’re thinking? Something else?”]

Evan Hubinger: I think that if we're in a situation where our model is actively trying to trick our transparency tools, we've already lost—but that we can potentially use transparency tools as a mechanism to understand what's happening during training enough to avoid ever entering that regime. Thus, I think current transparency research is likely to be useful because I think it will eventually enable us to build better training processes that avoid deception, rather than helping us detect deception directly. My guess is that that's sort of what Eliezer is saying in terms of knowing what's happening being a step towards preventing it, though I'm not sure I fully understand Eliezer's beliefs here.

Ajeya Cotra: That's a good point re using transparency to create better training processes, and makes me more positive.

Rohin Shah: The way I've summarized this point previously is that if our source of feedback can see how the model is reasoning, then we can select for models that get the right answer for the right reason, and not just models that get the right answer. One would hope that this selects heavily against treachery.

Ajeya Cotra: Well, I'm kind of wary of selecting for looking good on the transparency tools we currently have available without more information, tbc: I'd rather set things up so the transparency tools themselves are not under optimization pressure.

Evan Hubinger: I agree that ideally we wouldn't want to be applying optimization pressure against our transparency tools since we risk Goodharting on them, but imo I think this is a very manageable risk. I expect our training processes (e.g. SGD) to be dumb enough that a good overseer can mostly avoid being tricked by them, but I don't expect the same thing to be true about the model being overseen—that is, if you have to fight against SGD trying to trick you, that's probably fine, but if you have to fight against a deceptive model trying to trick you, that's probably not.

Though you could think that in an absolute sense it and all the other approaches that aren’t tackling treachery head-on are doomed.

Eliezer Yudkowsky: Doomed, I tell you! Doomed!

4.3. Isn’t this not neglected because lots of people want useful AI?

Eliezer Yudkowsky: Strong agree with this entire section (as it appeared at the time this comment was written).

[Editor’s note: this section is more or less completely unchanged.]

5. Current state of opinion on this work

I think most researchers at MIRI probably feel that this work is very unpromising or net-harmful -- I don’t totally understand their full reasoning, but my sense is that a strong version of the treachery objection plays a part (although they might frame it differently in important ways). But MIRI’s view on virtually all non-MIRI research is that it’s near-useless, and if anything Eliezer personally seems maybe marginally more positive on this type of thing than other non-MIRI alignment work -- so while it’s unendorsed by MIRI, I don’t think it’s particularly unendorsed relative to other things.

Eliezer Yudkowsky: Doesn't seem especially fair? Every year or two I check again with Chris Olah whether he's currently receiving all the resources he thinks he can use, and/or remind OpenPhil that I wish it was possible to spend a billion dollars on that research.

Ajeya Cotra: I actually genuinely wasn't aware of this and thought that MIRI would pretty happily endorse this statement, my bad. I was also pretty surprised by your positive engagement here -- thanks for that!

I was wrong about the MIRI view here and glad I checked with Rob. Will propose an edit.

Ajeya Cotra: Okay I added in purple a different summary of MIRI's view: "My understanding of Eliezer Yudkowsky’s position is one of “cautious relative optimism” about something in this general space compared to other non-MIRI alignment work, though he would frame the core concern differently, with more emphasis on transparency and interpretability (“GPT-3 has somewhere buried inside it knowledge of what to do when you’re sick; how do you extract all of that and how can you tell when you’ve succeeded?”). He was reasonably positive on Stiennon et al., 2020 when it came out, and would be happy to see more work like that. Evan Hubinger’s position seems similar, and I’m not sure where others at MIRI would land on this work."

Eliezer Yudkowsky: I endorse this statement of my position.

Ajeya Cotra: Awesome, thanks!

6. Takeaways and possible next steps

If you disagree with this argument, say so -- especially if you think it would be harmful or would be dominated by a different line of work that shares similar practical advantages of tangibility, good feedback loops, and potential-for-scale.

Eliezer Yudkowsky: The closest thing I have to a disagreement with this draft is a general sense that 19 out of 20 bright ideas in machine learning don't work especially well; so I have more hope in calls for proposals that pinpoint key problems in a way that seems sufficiently crisp to technical thinkers, possibly including examples of potential bright ideas to show what the problem is, but leave wide open how those problems are to be addressed.

Compared to saying, "The problem is that we have no idea what the hell GPT-3 is thinking, and if that's still true at AGI then everybody dies", I think you get better results if you say "GPT-3 obviously contains a lot of medical knowledge but we would like a better way than prompt engineering to get GPT-3 to think using that knowledge, something that scales better than prompt engineering using a lot of human labor, and doesn't leave us wondering if today was the day that GPT-3 decided to emulate an Internet troll despite our best prompting". But you get worse results if you have your own bright idea for how to solve that, and ask for proposals to carry out your bright idea. This draft doesn't descend particularly far into that - it's not proposing a particular exact architecture - but it also doesn't explicitly distinguish "Here is what I think the problem is" and "Here is one example idea of how to go about it, which the call for proposals definitely isn't limited to because 19 out of 20 ideas don't work".

For that matter, I think you want to explicitly signal that you are open to people reframing the problem (providing that you think they are still directly challenging a key step and not just telling you to care about something else instead; but in real life I imagine you'd get a lot of lousy distantly-related proposals with strained arguments, no matter what you say in the CfP, if you are getting lots of proposals at all).

Ajeya Cotra: Thanks -- I basically agree with all of this on reflection (except for, ironically/appropriately, your particular framing of the problem). I've suggested some edits already (highlighted in purple) to make it more agnostic, and will look back at this and add more.

Ajeya Cotra: I tweaked various things, and I don't think it's totally nailing the balance between not being too prescriptive and not being too vague, but I'll probably ship it anyway to get the ball rolling.

[Editor’s note: Evan made the following comment before Ajeya made the Q&A section "Why not focus on testing a long-term solution?".]

Evan Hubinger: Some comments after reading this, which I think broadly fall into the category of thinking that this is less valuable than other work that could be done:

I think there is a very important problem that this line of research is pointing at, which is figuring out how to get models that are ascription universal—that is, models that tell you everything they know. Paul has a whole sequence of posts on this (that I was surprised to not see linked anywhere, but maybe I missed it, see https://ai-alignment.com/towards-formalizing-universality-409ab893a456).

That being said, I feel pretty skeptical about telling people to just go work on something useful and hope that good work on universality comes out as a result. There's a lot of engineering work that goes into making useful systems that's pretty unrelated to actually solving universality-style problems and my guess is that most of the effort on a project like that would end up being unrelated engineering work rather than useful universality-relevant work.

I think that the distinctions you draw between for-profit work and what you're proposing here might help mitigate the above problem somewhat, but I'm still not sure why this would be better than just working directly on the universality problem. I feel like I would much rather have someone doing a bunch of research trying to figure out how to get GPT-3 to give truthful answers than someone trying to get GPT-3 to give legitimately helpful medical advice. There's just so much random, domain-specific stuff you'd need to do to get GPT-3 to give good medical advice that doesn't feel particularly alignment-relevant to me.

Another way of thinking about this: if you're doing research directly on universality, you can have a much tighter feedback loop, where you can iterate with lots of different techniques and try different things. If you have to produce something that's actually going to be useful, I don't think you can really afford to do that same sort of open-ended iteration. Instead, I expect you'll have to do a lot of very specific fine-tuning for the downstream tasks you're trying to solve that will prevent you from being able to iterate quickly.

Ajeya Cotra: Thanks for the thoughts! I don't think I expect most of the value of this work to be directly testing the specific concept of ascription universality or otherwise directly feeding into Paul's agenda, which is why I didn't link the post. The thing I'm saying is specifically trying to be pretty common sense and atheoretical, rather than focusing on resolving a particular uncertainty within a particular one shot solution agenda.

I think telling people to train models to give truthful answers to a wide range of questions would be good, but not obviously better than the other projects and not obviously very optimized for testing IDA / ascription universality (which I think most people don't really understand well enough to have a good sense of whether they're testing it or what they can learn about it). I think that whole line of work needs more conceptual clarification, at which point Paul will probably propose specific projects that would test his views (which might end up looking very different than his current views when the dust settles).

With that said, I don't consider the point of this work to make models useful -- I just consider the usefulness of the final models a litmus test for whether you were choosing sufficiently challenging alignment problems to work on, and I agree it would be bad to waste a bunch of time on faff productizing the thing. A model which answers a bunch of questions well and honestly would pass that bar.

If you have ideas for research projects that would provide good tests of ascription universality and could be executed by someone who doesn't deeply understand ascription universality or really buy into Paul's agenda, I'd be excited to add that to the list of suggested projects. But I think I would frame it as "Here's a thing that looks like aligning a narrowly superhuman model. A special benefit of this project is that Evan Hubinger thinks it's particularly well-suited for shedding light on ascription universality", rather than leading with "testing out ascription universality" as the main benefit you could get out of any of this work.

Another point here is that I'm focusing a lot on the long-run field growth potential, which "directly testing (particular theory)" seems to have less of.

Evan Hubinger: Hmmm... I guess I'm just having difficulty, from an inside-view perspective, of understanding what non-universality-related insights we would get from building models which are useful in this sense. I think I do understand the outside view argument, though I would feel a lot happier if I also had an inside view argument.

Perhaps my difficulty here is just coming from the fact that the primary example I have in mind is of trying to get GPT-3 to give more honest/helpful/useful answers. Do you have another example of a useful short-term alignment project that wouldn't fall into that category?

If it is just that category, though, it feels to me like it would just be better to tell people “work on getting GPT-3 to give more honest/helpful/useful answers” than “work on getting GPT-3 to be useful for concrete downstream task X.”

Ajeya Cotra: Coding is an example that doesn't seem like it's about "honest helpful answers": I'm excited to figure out how to get non-software engineers to effectively supervise a coding model to write better code. I'm also interested in e.g. getting GPT-3 to write a good mystery story using feedback from people who aren't good writers, or maybe don't even speak the language.

What's your take on the three sources of value I list in the "How this work could reduce long-term x-risk" section? They were: 1) Practical know-how and infrastructure, 2) Better AI situation in the run up to superintelligence, and 3) Chance of discovering or verifying a long-term solution.

It seems like your take is something like "#3 is the main hope for impact, and getting insight about ascription universality is the main hope for #3, so if I don't see a strong route to getting insight about ascription universality from this work I'm not very bought in." Is that right? I think I feel much more agnostic about both of those claims: #1 and #2 seem pretty big to me, and it seems plausible that we'll discover routes to long-term solutions that look pretty different from the current Paul agenda.

E.g. I think it's pretty likely the thing Paul is trying to do is impossible because he is setting a certain bar like "If we don't have an a priori argument that it should be safe, assume it won't be", but at the same time something less stringent and more contingent on the actual empirical situation would work fine. It also seems plausible to me that the ascription universality stuff "just works out" and the main difficulty is elsewhere, etc.

As a more minor point, I think I don't like the frame of "train GPT-3 to give honest, helpful answers" because that doesn't put the focus on the "narrowly superhuman" part. That's kind of what I was getting at in my response to the "Why not just stick with getting models not to do bad things?" objection.

I think the interesting part of the challenge is getting GPT-3 to add something to humans -- to be better than they are at something. I think a frame of "get it to be honest" is more likely to create a bunch of ho hum work on correcting errors humans can notice pretty easily themselves.

Evan Hubinger: > I think the interesting part of the challenge is getting GPT-3 to add something to humans -- to be better than they are at something. I think a frame of "get it to be honest" is more likely to create a bunch of ho hum work on correcting errors humans can notice pretty easily themselves.

That's a good point that I hadn't considered; I definitely feel more convinced that this is a good idea now.

> What's your take on the three sources of value I list in the "How this work could reduce long-term x-risk" section?

I think I feel less convinced by (1) and (2) since they seem reasonably likely to just happen by default.

> I think it's pretty likely the thing Paul is trying to do is impossible

Fwiw, I agree with this—I think this is one of my current biggest cruxes with Paul. Nevertheless, I think that a better understanding of the extent to which it is possible is likely to be very useful and teach us a lot.

Ajeya Cotra: I agree that digging into the Paul thing is likely to teach us a lot even if it's impossible, fwiw. I think Paul should keep doing his thing, and I hope he comes out with experiment ideas soon. I think there's limited room for that kind of thinking and a limited number of people who could do well at it though, and the room to do this work (which could explore in parallel a bunch of worlds where Paul's thing is impossible but we're not doomed anyway) is much larger.

I think of this stuff as a pretty strong baseline to beat -- if you think you have privileged insight that beats the baseline you should probably try that out for at least a couple years and return to the baseline if it's not going well. (I think even if you have privileged insight, I'd want to see more experimental work and less pure pen-and-paper stuff, but that's a deeper worldview disagreement that I'm less confident in.)

I agree that 1) and 2) are likely to happen by default, but I also think 3) is pretty likely to happen by default too -- I see some gap there, just not that large of one. For all of #1-3 I'm thinking in terms of "speed up" relative to the default. I think right now if you look around, people could be doing this aligning narrowly superhuman models stuff and aren't, and I'd guess that if EAs don't make a push for it it'll get delayed by a couple more years still, and less of it will get done. E.g. Paul was the one to start making human feedback a thing at all at OpenAI.

Evan Hubinger: Yeah—I definitely agree that Paul's work is good but hard to replicate. “Do more Paul stuff” was definitely not the alternative I was proposing. My thinking was just that it seems like focusing on downstream tasks with the hope of that leading to good insights into how to align language models feels less direct—and thus less likely to yield good insights by default—than just focusing on aligning language models directly. I buy that just framing it as “get GPT-3 to give honest answers” might lead people to not work on anything superhuman, though I don't yet feel fully convinced that there isn't still a yet better framing than either of those—that both emphasizes the importance of doing superhuman things but also focuses on the directly useful alignment task rather than things that are downstream of that.

Ajeya Cotra: Hm I think there's still some disconnect between us here -- the way I'm thinking about it, the activities I'm proposing here simply are aligning large models (most of them are language models but I don't want to be married to that). I don't see it as "doing things that might give us insight into how to align narrowly superhuman models"; it just seems like aligning them, i.e. getting them to try their best to use all their faculties to help us. I want to find ways to just directly practice the long-run thing.

I'm definitely open to a better framing of the thing I'm saying that's more likely to inspire productive work, though.

Evan Hubinger: Taking your examples from earlier:

> Coding is an example that doesn't seem like it's about "honest helpful answers": I'm excited to figure out how to get non-software engineers to effectively supervise a coding model to write better code. I'm also interested in e.g. getting GPT-3 to write a good mystery story using feedback from people who aren't good writers, or maybe don't even speak the language.

I think I wouldn't say that any of these things “simply are aligning large models”—they all feel like particular tasks which likely require some amount of alignment to get them to work, but also require lots of other non-alignment stuff as well. Ideally, I'd much prefer work that cuts out the non-alignment stuff and just focuses on the alignment stuff, but I think it's basically impossible to do that if you're trying to actually produce a model which is useful in practice for some downstream use case, since you're just not going to be able to do that without a ton of non-alignment-relevant work. I certainly think it's reasonable, even if you're just trying to do alignment work, to have some particular task in mind that you focus on, but once you start trying to do something like produce a product (not necessarily even for profit, just something that you want to be useful for real users), I expect most of the work that you'll end up needing to do for that won't be alignment-relevant.

Ajeya Cotra: Hm, my take is like "alignment = trying to do what we want." In the end we want models that are trying to run countries and companies and shepherd the future the way we want; in the near-term we could get models that are trying to do what we want as personal assistants and research assistants, and right now I think we should be tackling the hardest tasks where we could get models to try to do what we want, ideally where they are already better than us at the task.

I think of "seeing a glimmer of usefulness" as a proxy for "did you pick the hardest tasks", not as the end goal. I agree you'll need to do a bunch of stuff to make it maximally useful that isn't the core part that seems like "aligning it" to me (basically training it with human feedback augmented/arranged in a certain way), and I think researchers working on aligning narrowly superhuman models should skip that work. But I don't think I understand the view that alignment is something that can totally be divorced from a task.

As a concrete example of the usefulness bar I'm thinking will usually make sense, take the Paul lab paper Stiennon et al 2020: it's definitely not like a summarization product, and they left tons of things on the table (like collecting demonstrations from expert writers and hardcoding certain heuristics) that would feed into a real product. But it feels sort of markedly more useful than raw GPT-3, as a direct result of the fact that the model is now trying to use its faculties to be helpful instead of play an improv game. That's kind of the threshold that I think we should be aiming for.

Evan Hubinger: I certainly don't think that alignment can be totally divorced from a task, at least no more so than capabilities—and in fact I think the analogy to capabilities is very apt here. When you focus on solving Go, if you try to do it in a task-agnostic way, you learn a lot about AI capabilities in general. On the other hand, if you try to do it in a way that is very specific to Go, you don't learn very much about AI capabilities at all. Similarly, I expect relatively open-ended research on getting large language models to do helpful things in a relatively generic way to be useful for learning about alignment in general, but narrow work on getting large language models to solve specific tasks in a relatively tasks-specific way to not be very useful.

That being said, if you think “Learning to summarize from human feedback” is a good example of what you're talking about, then maybe we just agree, because I feel like it's a good example of what I'm talking about also—that is, an example of relatively open-ended research that was just trying to get a large language model to do a helpful alignment thing, rather than actually produce anything that might actually be used in the real world as a summarization tool.

Ajeya Cotra: Yeah I agree with trying to do task-general techniques and using tasks as case studies; I tried to emphasize this in the post but maybe I can tweak to make more clear or prominent.

Evan Hubinger: Yeah, I definitely think that's good—though I think a part of what I'm saying also is that the message of “solve specific task X” is likely to lead to that sort of task-specific work, whereas something more like “figure out how to make GPT-3 honest” is less likely to do that, in my opinion.

Ajeya Cotra: I want the one sentence summary to be "align narrowly superhuman models" rather than "solve specific task X"; the "align narrowly superhuman models" line doesn't seem less likely to lead to good work than the "Make GPT-3 honest" line to me (and right now I think it would lead to better work because of the problem of "honesty" calling to mind "avoid certain specific lies" and not sufficiently pushing the researcher to consider the hardest cases).

Evan Hubinger: Yeah, I think that's a good pitch to try, but I still worry that you'll end up with a lot of useless product engineering.

Ajeya Cotra: Cool thanks, this was a helpful discussion -- I just added a section on "Why not test the Paul stuff"

Evan Hubinger: I'm glad I was able to be helpful! :)

As I said above, Open Phil is not soliciting grant applications right now from people who want to try it out -- this blog post is my personal viewpoint, and institutionally we’re still figuring out how much we want to prioritize this (discussion and arguments surrounding this post will feed into that).

Eliezer Yudkowsky: If I thought a proposal in this subarea looked as good as "learning to summarize" and OpenPhil wasn't picking it up or was throwing paperwork in front of it, I'd start sending emails to other funding sources or possibly even have MIRI fund it directly. We obviously have a lot less total funding, but not getting done what can be done in ML alignment is kind of... not acceptable at this point.

7. Appendix: beyond sandwiching?

I’m definitely very unsure what this would look like, but an important starting assumption I have is that whatever techniques worked well to get less-capable humans to reproduce the judgments of more-capable humans in a “sandwich” setting stand a good chance of just continuing to work.

Eliezer Yudkowsky: I remark that this is the kind of thought that needs to be hedged around very carefully on pain of undignified planetary extinction. Did you scale the answers of less-capable humans to results checkable by more-capable humans, while operating the AI under the capability threshold for modeling human psychology in detail, and are you assuming the same technique will generalize if an AGI is that smart? Did you scale worse human answers to checkable better answers while applying a small amount of optimization power, and are you assuming the same method will scale to using much more power than that? Did you scale worse to better across an identical environmental distribution, and are you hoping the same applies when the environmental distribution is being effectively altered between training and testing by the impact of an AGI that's smarter than when it was trained? And so on and so on.

I'm not saying it's useless to poke around and find things that seem to work for scaling unreliable human answers to better-than-those-humans answers, that smarter humans can still check to see if the whole method worked. I'm saying that if researchers actually believe the part where the journalists are like "and lo the entire alignment problem has been solved!", and the authors don't explicitly list out five stability conditions that held inside the experiment that might be necessary conditions, that's not what I'd call dignified.

61