72

I wrote this post to get people’s takes on a type of work that seems exciting to me personally; I’m not speaking for Open Phil as a whole. Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured). We are not seeking grant applications on this topic right now.

Thanks to Daniel Dewey, Eliezer Yudkowsky, Evan Hubinger, Holden Karnofsky, Jared Kaplan, Mike Levine, Nick Beckstead, Owen Cotton-Barratt, Paul Christiano, Rob Bensinger, and Rohin Shah for comments on earlier drafts.

A genre of technical AI risk reduction work that seems exciting to me is trying to align existing models that already are, or have the potential to be, “superhuman”[1] at some particular task (which I’ll call narrowly superhuman models).[2] I don’t just mean “train these models to be more robust, reliable, interpretable, etc” (though that seems good too); I mean “figure out how to harness their full abilities so they can be as useful as possible to humans” (focusing on “fuzzy” domains where it’s intuitively non-obvious how to make that happen).

But GPT-3 doesn’t seem to “want” to give me the best possible health advice -- instead it “wants” to play a strange improv game riffing off the prompt I give it, pretending it’s a random internet user. So if I want to use GPT-3 to get advice about my health, there is a gap between what it’s capable of (which could even exceed humans) and what I can get it to actually provide me. I’m interested in the challenge of:

How can we get GPT-3 to give “the best health advice it can give” when humans[3] in some sense “understand less” about what to do when you’re sick than GPT-3 does? And in that regime, how can we even tell whether it’s actually “doing the best it can”?

I think there are other similar challenges we could define for existing models, especially large language models.

I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.

I’ll call this type of project aligning narrowly superhuman models. In the rest of this post, I:

• Give a more detailed description of what aligning narrowly superhuman models could look like, what does and doesn’t “count”, and what future projects I think could be done in this space (more).
• Explain why I think aligning narrowly superhuman models could meaningfully reduce long-term existential risk from misaligned AI (more).
• Lay out the potential advantages that I think this work has over other types of AI alignment research: (a) conceptual thinking, (b) demos in small-scale artificial settings, and (c) mainstream ML safety such as interpretability and robustness (more).
• Answer some objections and questions about this research direction, e.g. concerns that it’s not very neglected, feels suspiciously similar to commercialization, might cause harm by exacerbating AI race dynamics, or is dominated by another type of work (more).
• Briefly discuss where I think some AI alignment researchers currently stand on this work (more).
• Summarize takeaways and possible next steps for readers (more).

There aren’t a large number of roles where someone could do this right now, but if aligning  narrowly superhuman models is a good idea, and we can build a community consensus around it being a good idea, I think we have a good shot at creating a number of roles in this space over the coming years (allowing a larger number of people to productively contribute to AI x-risk reduction than would be possible otherwise). To discover whether that’s possible, I’d appreciate it if people could react with pushback and/or endorsement, depending on where you’re at.

What aligning narrowly superhuman models could look like

I’m a lot less confident about a particular agenda or set of project ideas than I am about the high-level intuition that it seems like we could somehow exploit the fact that today’s models are superhuman in some domains to create (and then analyze and solve) scaled-down versions of the “aligning superintelligent models” problem. I think even the basic framing of the problem has a lot of room to evolve and improve; I’m trying to point people toward something that seems interestingly analogous to the long-run alignment problem rather than nail down a crisp problem statement. With that said, in this section I’ll lay out one vision of what work in this area could look like to provide something concrete to react to.

First of all, it’s important to note that not all narrowly superhuman models are going to be equally interesting as alignment case studies. AlphaGoZero (AGZ) is narrowly superhuman in an extremely strong sense: it not only makes Go moves better than the moves made by top human players, but also probably makes moves that top players couldn’t even reliably recognize as good. But there isn’t really an outer alignment problem for Go: a precise, algorithmically-generated training signal (the win/loss signal) is capable of eliciting the “full Go-playing potential” of AGZ given enough training (although at a certain scale inner alignment issues may crop up). I think we should be focusing on cases where both inner and outer alignment are live issues.

The case studies which seem interesting are models which have the potential to be superhuman at a task (like “giving health advice”) for which we have no simple algorithmic-generated or hard-coded training signal that’s adequate (which I’ll call “fuzzy tasks”). The natural thing to do is to try to train the model on a fuzzy task using human demonstrations or human feedback -- but if (like AGZ) the model actually has the capacity to improve on what humans can demonstrate or even reliably recognize, it’s not immediately obvious how to elicit its “full potential.”

Here’s an attempt at one potential “project-generation formula”, where I try to spell out connections to what I see as the main traditional sub-problems within academic AI alignment research:

Choose a helpful “fuzzy” task (e.g. summarization, question-answering, advice-giving, story-writing) for which we have suggestive evidence that makes us suspect a state-of-the-art model has the capacity to significantly outperform some reference set of humans (e.g. Mechanical Turk workers) given the right training signal. Then,

1. Reward learning: Find a training procedure that allows those reference humans to train the model to do the fuzzy task better than they could do it (and ideally, better than they could even recognize or verify unaided). This procedure shouldn’t rely on the researchers’ own understanding of the particular domain in a way that wouldn’t generalize across domains.
2. Scalability and competitiveness: Argue or empirically demonstrate that the human oversight work wouldn't have to scale up much if the model were 10x or 100x bigger, or each instance of the task took 10x or 100x longer to demonstrate or evaluate.
3. Interpretability and robustness: Once you’ve done this, try to understand its behavior and stamp out whatever pathologies (e.g. lying, going off the rails) may have cropped up.[5]

This is just one type of project you could do in this space. The larger motivating question here is something like, “It looks like at least some existing models, in at least some domains, ‘have the ability’ to exceed at least some humans in a fuzzy domain, but it’s not obvious how to ‘draw it out’ and how to tell if they are ‘doing the best they can to help.’ What do we do about that?”

I don’t think the project-generation formula I laid out above will turn out to be the best/most productive formulation of the work in the end; I’m just trying to get the ball rolling with something that seems concrete and tractable right now. As one example, the project-generation formula above is putting reward learning / “outer alignment” front and center, and I could imagine other fruitful types of projects that put “inner alignment” issues front and center.

Existing work in this area

This kind of work only became possible to do extremely recently, and mostly only in industry AI labs; I’m not aware of a paper that follows all three steps above completely. But “Learning to summarize from human feedback” (Stiennon et al., 2020) accomplishes the easier version of 1 and a bit of 2 and 3. The authors chose the fuzzy task of summarizing Reddit posts; there was an existing corpus of human demonstrations (summaries of posts written by the posters themselves, beginning with “TL;DR”):

1. Reward learning: Ultimately, the quality of summaries generated by a large language model fine-tuned with RL from human feedback exceeded the quality of the Reddit summaries (i.e. it exceeded what some set of reference humans generated). But it didn’t really exceed what the human workers could evaluate -- except in the fairly straightforward (but IMO meaningful) sense that the authors figured out quality control procedures, human rating aggregation algorithms, easier framings of the question, training and feedback for workers, etc that allowed them to get better performance than they would have gotten using the most naive implementation of “train on human ratings.”
2. Scalability: I don’t think the paper makes explicit arguments about scalability, but the method is very domain-general and could plausibly work for significantly harder tasks, especially combined with decomposition (and I’d like to see that systematically attempted).
3. Interpretability and robustness: The paper doesn’t dig deep into interpretability, reliability, and pathological behavior, but it does demonstrate that optimizing the reward model (learned from human judgments) “too hard” leads to weird pathological summaries that are repetitive, offensive, etc., and addresses this by applying a penalty for diverging too far from the human demonstration distribution.

What kinds of projects do and don’t “count”

In the high-level description of this research area, I’ve aimed to be as broad as possible while picking out the thing that seems interestingly different from other research in alignment right now (i.e. the focus on narrowly superhuman models). But given such a broad description, it can be confusing what does and doesn’t count as satisfying it. Would self-driving cars count? Would MuseNet count? Would just training GPT-4 count?

Firstly, I don’t think whether a project “counts” is binary -- in some sense, all I’m saying is “Find a model today such that it seems as non-obvious as possible how to align it, then try to align it.” The more obvious the training signal is, the less a project “counts.” But here are some heuristics to help pick out the work that currently feels most central and helpful to me:

• You should probably be fine-tuning an existing large model: I don’t think we should be guessing what size models could have the potential to be narrowly superhuman in some domain; I think an alignment project should probably be inspired by noticing that an existing model seems to have some “knowledge” or “skill” that’s it not adequately harnessing because it doesn’t “want to”, as in the example with GPT-3 and health advice above.[6] I would guess the base model you start with should be >>1B parameters, and the larger the better -- this is because the larger the model is, the more likely it is to have the capacity to be superhuman in an interesting, challenging domain. Less confidently, I would guess that you probably want to be fine-tuning a generative model like GPT-3 or MuseNet (as opposed to a supervised learning model like an image classifier or an RL model like AlphaGoZero or AlphaStar), because those models seem closest to being able to do “interesting real-world tasks” better than some humans can.
• If you’re making the model larger, it doesn’t count: I see the point of this work as “realizing the potential of existing state-of-the-art models in fuzzy domains”, rather than pushing forward the state-of-the-art in models’ raw potential. Note that this doesn’t mean I think scaling up models is always bad -- I definitely see risks there, but also potential benefits depending on who does it and how (e.g. new large models can also create new opportunities to do empirical alignment research like this). I think the question of the sign of scaling work is pretty complicated and situation-dependent. I just want to clearly distinguish between the projects of “aligning narrowly superhuman models” and “scaling models up to make them (more) superhuman”, and make it clear that someone could participate in one without participating in the other. So, for example, training GPT-4 would not count as aligning a narrowly superhuman model.[7]
• If you’re not dealing with humans, it probably doesn’t count: I think that if you can get the model to achieve superhuman performance at some task without collecting any human feedback or human demonstrations, the task is probably not “fuzzy” enough. It shouldn’t be easy for humans to just write down an algorithm specifying what they want, and there shouldn’t be an existing dataset that just demonstrates what they want. In practice, I also don’t think human demonstrations alone will cut it (unless they are cleverly combined with an amplification-like scheme or somehow augmented or assisted); RL from human feedback will probably be necessary. My guess is that self-driving cars mostly fail on these grounds -- in a lot of self-driving car companies, only the recognition of objects in a scene is done with large neural nets, and those are trained almost entirely from labeled datasets.[8] To the extent that large models are used for the actual driving policy (which they usually aren’t), relatively simple/algorithmic training signals like “how far is the car from other cars”, “how centered is it in the lane”, “how smooth is its acceleration”, etc seem probably adequate to elicit human-level or superhuman driving ability without bringing in feedback from human judgments.
• If you didn’t make the model genuinely useful, it probably doesn’t count: I think we should generally be choosing complex, multi-dimensional real-world tasks where there is a lot of room to improve on typical humans’ actions and/or judgments -- giving advice, summarizing research, coding, writing emails, translation, telling stories, etc. In the end, these models should feel impressive and valuable -- they generally wouldn’t constitute a commercial product on their own because commercial products are rarely “clean” or “pure ML”, but should ideally have the potential to become a product with some design and engineering work. If the selected task was not valuable or at least inherently interesting, I would guess that the alignment problem wasn’t hard enough and much of the benefits of “practicing on something similar to the real deal” would be reduced. Note however that “genuinely useful” doesn’t mean optimized for usefulness alone -- I expect this research will not look like the shortest path to creating a valuable product (e.g. by construction the approach I propose below makes it much harder than it has to be if you just want to train a model to be useful somehow). See this objection and response for more detail.

I think some projects that don’t fit all these criteria will also constitute useful progress on aligning narrowly superhuman models, but they don’t feel like central examples of what I’m trying to point at.

Potential near-future projects: “sandwiching”

I think a basic formula that could take this work a step beyond Stiennon et al, 2020 is a) “sandwich” the model in between one set of humans which is less capable than it and another set of humans which is more capable than it at the fuzzy task in question, and b) figure out how to help the less-capable set of humans reproduce the judgments of the more-capable set of humans. For example,

• First fine-tune a coding model to write short functions solving simple puzzles using demonstrations and feedback collected from expert software engineers. Then try to match this performance using some process that can be implemented by people who don’t know how to code and/or couldn’t solve the puzzles themselves.
• First fine-tune a model to answer long-form questions in a domain (e.g. economics or physics) using demonstrations and feedback collected from experts in the domain. Then try to match this performance using some process that can be implemented by people who know very little about the domain.
• First fine-tune a model to translate between English and French using demonstrations and feedback collected from people who are fluent in both languages. Then try to match this performance using some process that can be implemented by people who are fluent in one language and barely know the other (or don’t know it at all and only have a dictionary). Something similar was done in Lample et al., 2018, although they didn’t use human feedback.

In all of these cases, my guess is that the way to get the less-capable group of humans to provide training signals of a similar quality to the more-capable group will involve some combination of:

• Training models to help the humans form better judgments (for example, training models to explain the meaning of technical terms or to fetch and summarize relevant papers for humans).
• Breaking down the problem and splitting it up among many humans (as in Humans Consulting HCH).
• Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand (this feels like it could fit under debate or interpretability).
• Figuring out how to train the human workers, and how to separate their good judgments from noise / mistakes.

It may not yet be possible to do these more ambitious projects (for example, because models may not be powerful enough yet to train them to meaningfully help human evaluators, engage in debates, meaningfully exceed what humans can recognize / verify, etc). In that case, I think it would still be fairly valuable to keep doing human feedback projects like Steinnon et al., 2020 and stay on the lookout for opportunities to push models past human evaluations; state-of-the-art models are rapidly increasing in size and it may become possible within a couple of years even if it’s not quite possible now.

Importantly, I think people could make meaningful progress on aligning narrowly superhuman models using existing models without scaling them up any further, even if they are only superhuman with respect to human demonstrations for now -- there’s a lot we don’t know even just about how to do RL from human feedback optimally. And in the near future I expect it will be possible to use the larger models which will likely be trained to do even more interesting projects, which have the potential to exceed human evaluations in some domains.

(For more speculative thoughts on how we might go beyond “sandwiching”, see the appendix.)

How this work could reduce long-term AI x-risk

On the outside view, I think we should be quite excited about opportunities to get experience with the sort of thing we want to eventually be good at (aligning models that are smarter than humans). In general, it seems to me like building and iterating on prototypes is a huge part of how R&D progress is made in engineering fields, and it would be exciting if AI alignment could move in that direction.

If there are a large number of well-motivated researchers pushing forward on making narrowly superhuman models as helpful as possible, we improve the odds that we first encounter serious problems like the treacherous turn in a context where a) models are not smart enough to cause actually catastrophic harm yet, and b) researchers have the time and inclination to really study them and figure out how to solve them well rather than being in a mode of scrambling to put out fires and watching their backs for competitors. Holistically, this seems like a much safer situation to be in than one where the world has essentially procrastinated on figuring out how to align systems to fuzzy goals, doing only the minimum necessary to produce commercial products.

This basic outside view consideration is a big part of why I’m excited about the research area, but I also have some more specific thoughts about how it could help. Here are three somewhat more specific paths for working on aligning narrowly superhuman models today to meaningfully reduce long-term x-risk from advanced AI:

• Practical know-how and infrastructure: It seems likely that a successful long-run approach to (machine learning-based) alignment will involve somehow learning from human demonstrations and/or feedback as a key component, and also pretty likely that it will involve somehow using ML tools to help go beyond raw human judgment. I’d guess that a number of low level details about how ideas like “RL from human feedback” and “ML aiding human judgments” are implemented will make a difference to how successful the approach is: things like which human judges are selected, how well they are trained and how much practice they have, what exact types of questions are used to elicit the judgments, what judgment aggregation and quality assurance procedures are used, whether there are good off-the-shelf ML solutions for enhancing human judgments in certain ways, whether there are easy-to-use platforms that let researchers gather good human feedback at the push of a button, etc. Aligning narrowly superhuman models today could help build up tools, infrastructure, best practices, and tricks of the trade. I expect most of this will eventually be developed anyway, but speeding it up and improving its quality could still be quite valuable, especially in short timelines worlds  where there's a lot less time for things to take their natural course.
• Better AI situation in the run-up to superintelligence: If at each stage of ML capabilities progress we have made sure to realize models’ full potential to be helpful to us in fuzzy domains, we will be going into the next stage with maximally-capable assistants to help us navigate a potentially increasingly crazy world. We’ll be more likely to get trustworthy forecasts, policy advice, research assistance, and so on from our AI assistants. Medium-term AI challenges like supercharged fake news / clickbait or AI embezzlement seem like they would be less severe. People who are pursuing more easily-measurable goals like clicks or money seem like they would have less of an advantage over people pursuing hard-to-measure goals like scientific research (including AI alignment research itself). All this seems like it would make the world safer on the eve of transformative AI or AGI, and give humans more powerful and reliable tools for dealing with the TAI / AGI transition.[9]
• Chance of discovering or verifying long-term solution(s): I’m not sure whether a “one shot” solution to alignment (that is, a single relatively “clean” algorithm which will work at all scales including for highly superintelligent models) is possible. But if it is, it seems like starting to do a lot of work on aligning narrowly superhuman models probably allows us to discover the right solution sooner than we otherwise would have. For one thing, people doing this work could test proposals (such as Iterated Distillation and Amplification) coming from more conceptual researchers, verifying or falsifying elements and proposing modifications informed by empirical understanding. It also seems plausible that a solution will emerge directly from this line of work rather than the conceptual work -- the latter is mostly focused on finding a one-shot solution that will work under ~pessimal empirical assumptions,[10] but it seems very plausible that a) it’s impossible to find a one-shot solution that works under worst-case empirical assumptions, but b) it’s possible to find one that works given the actual ways that models tend to learn or generalize. More broadly, “doing empirical science on the alignment problem” -- i.e. systematically studying what the main problem(s) are, how hard they are, what approaches are viable and how they scale, etc -- could help us discover a number of different avenues for reducing long-run AI x-risk that we aren’t currently thinking of, one-shot technical solutions or otherwise.

I think both the broad outside view and these specific object-level benefits make a pretty compelling case that this research would be valuable on the object level. Additionally, from a “meta-EA” / “community building” perspective, I think pioneering this work could boost the careers and influence of people concerned with x-risk because it has the potential to produce conventionally-impressive results and demos. My main focus is the case that this work is valuable on the merits and I wouldn’t support it purely as a career-boosting tool for aligned people, but I think this is a real and significant consideration that can tip the scales.

Advantages over other genres of alignment research

First, I’ll lay out what seem like the three common genres of alignment research:

• Conceptual research: This is pen-and-paper thinking that often looks like a combination of math and philosophy, which is usually aiming to make progress toward a “one shot” solution (and also often involves a lot of disentangling and framing what the problem even is). The most prominent examples are MIRI’s work and Paul Christiano’s work; a number of other posts on the Alignment Forum also fit in this category.
• Gridworlds and games: This work aims to demonstrate alignment problems such as wireheading or other reward hacking in a relatively small-scale artificial setting such as a simple game, and usually to solve the demonstrated problem(s) in the small-scale setting in a way that could shed light on how to solve larger-scale alignment problems. Two examples are REALab (Kumar et al., 2020) and Inverse Reward Design (Hadfield-Mennell et al., 2017).
• Mainstream ML safety: This is alignment-relevant work that existing ML researchers were independently working on; most of it fits under “reliability+robustness” or “interpretability.” This work is usually done on fairly large (though not always state-of-the-art) neural networks, but doesn’t usually pay special attention to the case where models are more capable or knowledgeable than humans. Some examples are the OpenAI microscope (interpretability), Dathathri et al., 2020 (robustness and reliability), and the Unrestricted Adversarial Examples Challenge (robustness and reliability).

I’m broadly supportive of all three of these other lines of work, but I’m excited about the potential for the new approach described in this post to “practice the thing we eventually want to be good at.” I think on the outside view we should expect that doing whatever we can find that comes closest to practicing what we eventually want to do will be good in a number of ways (e.g. feeling and looking more “real”, encouraging good habits of thought and imposing helpful discipline, etc).

More specifically, here are some advantages that it feels like “aligning narrowly superhuman models” line of work has over each of the other three genres:

• Compared to conceptual research, I’d guess aligning narrowly superhuman models will feel meatier and more tractable to a number of people. It also seems like it would be easier for funders and peers to evaluate whether particular papers constitute progress, which would probably help create a healthier and more focused field where people are broadly more on the same page and junior researchers can get stronger mentorship. Related to both of these, I think it provides an easier opportunity for people who care about long-run x-risk to produce results that are persuasive and impressive to the broader ML community, as I mentioned above.
• Compared to gridworlds and games, I think this work stands a greater chance of scaling up to more capable systems -- I think it would probably provide some good discipline to do alignment work at a scale that’s large enough that it’s already kind of unwieldy, where models are already more capable than their overseers in some real-world-relevant ways, and researchers are forced to confront messy details and hard-to-foresee structural issues. When it’s possible to demonstrate an issue at scale, I think that’s usually a pretty clear win.
• Compared to mainstream ML safety, aligning narrowly superhuman models has some of the “discipline” advantages mentioned above of focusing on situations where models are more capable than humans. Additionally, lots of researchers work on interpretability and robustness for lots of different reasons, meaning the specific research priorities and “tastes” of the broader interpretability and robustness fields won’t be particularly optimized for reducing long-run x-risk. This can make it harder for newer researchers motivated primarily by x-risk to zoom in on the most x-risk-relevant subproblems and get adequate mentorship on that; aligning narrowly superhuman models has the potential to be more x-risk-oriented from the start.

Finally and maybe most importantly, I think aligning narrowly superhuman models has high long-run field growth potential compared to these other genres of work. Just focusing on GPT-3, there are already a lot of different fuzzy goals we could try to align it to, and the number of opportunities will only grow as the ML industry grows and the number and size of the largest models grow. This work seems like it could absorb a constant fraction (e.g. 1% or 5%) of all the ML activity -- the more models are trained and the mode capable they are, the more opportunity there is to align narrowly superhuman models to ever more tasks.

I think we have a shot at eventually supplying a lot of people to work on it too. In the long run, I think more EAs could be in a position to contribute to this type of work than to either conceptual research or mainstream ML safety.[11] Conceptual research is often foggy and extremely difficult to make progress on without a particular kind of inspiration and/or hard-to-define “taste”; mainstream ML safety is often quite technical and mathematically dense (and ensuring the work stays relevant to long-run x-risk may be difficult).

A lot of work involved in aligning narrowly superhuman models, on the other hand, seems like it’s probably some combination of: a) software engineering and ML engineering, b) dealing with human contractors, and c) common sense problem-solving. Lead researchers may need to bring taste and research judgment to ensure that the work is well-targeted, but a number of people could work under one lead researcher doing tractable day-to-day work with reasonably good feedback loops. If there were institutional homes available to onboard people onto this work, I think a strong generalist EA with a software engineering background could plausibly retrain in ML engineering over 6-12 months and start contributing to projects in the space.

Right now there are only a few organizations that offer roles doing this work and that seems like a big bottleneck, but it could make sense to prioritize creating more institutional homes and/or rapidly expanding the ones that exist.

Objections and responses

In this section I’ve tried to anticipate some potential objections, and give my responses; I’d suggest skipping around and reading only the ones that interest you. I don’t think that I have knock-down answers to all of these objections, but I do remain holistically excited about this idea after reflecting on them some.

How would this address treachery by a superintelligence?

Elaboration of objection: It seems like there is a “hard core” of the alignment problem that only crops up when models are very smart in a very general way, not just e.g. better than MTurkers at giving medical advice. The specific scariest problem seems to be the “treacherous turn”: the possibility that the model will appear to be helpful during training time even though it’s actually power-seeking because it’s aware that it’s being trained and has to act helpful to survive, and later cause catastrophic harm once it knows it’s out of the training setup. It doesn’t seem like the “aligning narrowly superhuman models” style of work will figure out a way to address the treacherous turn until it’s likely too late.

I'm very uncertain how relevant the near-term work will turn out to be for more exotic problems like the treacherous turn, and I want to think more about ways to nudge it to be more relevant.[12] I would be very excited to find empirical research projects on large models that specifically shed light on the treacherous turn possibility, and I agree it’s a weakness of my set of potential projects that they aren’t specifically optimized for unearthing and correcting treachery.

With that said, I don’t think there are currently genres of work that feel similarly tractable and scalable that do tackle the treacherous turn head on -- of the main genres of alignment work, I’d argue that only a subset of the conceptual work is aiming to directly generate a long-term solution to treachery, and I think the jury is very much out on whether it will be fruitful; gridworlds and games and mainstream ML safety largely don’t seem to try for a long-term treacherous turn solution. So I think the relative hit that my proposal takes due to this consideration is fairly limited.[13]

Even if they don’t start off tackling the treacherous turn, I’d guess that researchers would have a decent shot at learning useful things about treachery down the line if they were pursuing this work. Basically, I think it’s pretty likely that full-blown treachery will be preceded by mini-treachery, and with better understanding of how neural networks tend to learn and generalize, researchers may be able to specifically seek out domains where mini-treachery is especially likely to occur to better study it. Even if techniques used by empirical researchers don’t work out of the box for the treacherous turn, empirical work eliciting and studying mini-treachery could still inform what kind of theoretical or conceptual work needs to be done to address it, in a way that seems more promising to me than eliciting micro-treachery in gridworlds and games.

Moreover, even though the treacherous turn seems like the scariest single source of risk, I don’t think it totally dominates the overall expected AI risk -- a significant fraction of the risk still seems to come from more “mundane” outer alignment failures and various unforced errors, which this empirical work seems better-placed to address. Of the three broad ways I listed that this work could reduce x-risk, the critique that it doesn’t seem to address the treacherous turn very well applies most to the “Chance of discovering or verifying long-term solution(s)” category; even if it fails to address the treacherous turn, it still seems that “Practical know-how and infrastructure” and “Better AI situation in the run-up to superintelligence” matter.

Doesn’t this feel suspiciously close to just profit-maximizing?

Elaboration of objection: It sort of sounds like you’re just telling EAs to make AI really useful to humans (and indeed push models to be superhuman if they can be); it feels like this would also be what someone who is into pure profit-maximization would be excited about, and that makes me suspicious about the reasoning here and nervous about calling it an alignment activity. Even if you’re right that it helps with alignment, we might see a lot of people flock to it for the wrong reasons.

I agree that there is overlap with commercial incentives, but I think there are three high-level ways that this type of work would be different from what you’d do if you were profit-maximizing:

• Not making models bigger: This work doesn’t involve making models bigger; it involves making models of a given fixed size more helpful. In a commercial setting, often a cost-effective way of improving results would be to simply scale the model up.
• Seeking difficult rather than easy problems: The problem selection is different -- other things being equal, in a commercial setting you want to select the easiest possible tasks; in this type of work, people would select interestingly difficult tasks. For example, commercial incentives would push someone to focus on precisely those tasks where simply meeting (rather than exceeding) the human imitation benchmark is sufficient for being profitable. Profit-motivated people would also likely seek tasks where algorithmically generated or hard-coded reward signals would go a long way (for example, in robotics you might be able to get away with providing algorithmically generated feedback about whether the robot’s actuators ended up in the right place). The sandwiching approach I propose above is by construction making things much harder than they need to be from a pure commercial standpoint: it involves refusing to use the “best human overseers for the job” in favor of trying to figure out how to help less-capable overseers provide an adequate training signal.
• Seeking domain-general and scalable techniques: There is a focus on scalability and generality of techniques that goes well beyond what would be commercially optimal. In commercial settings, I expect that people will make heavy use of hard-coded behaviors and “hacks” which fully exploit domain knowledge (as is the case with self-driving cars). Additionally, there is often a “right size model for the job” in commercial settings (image models only need to be so big to adequately power self-driving car perception), and there will often not be much incentive to find techniques that also work well for a model 100x bigger. A “clean”, domain-general, and scalable technique is rarely what will make the most profit at the current moment.

More broadly, I think successful versions of this type of alignment work should get someone who deeply understands ML and its limitations to say something like, "Wow, it's cool that you got the model to do that." My sense is that most commercial projects wouldn’t really elicit this reaction, and would look more like applying a lot of hard work to realize an outcome that wasn’t very much in doubt.

Given these differences, I think there’s a good shot at distinguishing this type of work from pure profit-seeking and cultivating a community where a) most people doing this work are doing it for altruistic reasons, and b) this is reasonably legible to onlookers, funders, potential junior researchers, etc.

Isn’t this not neglected because lots of people want useful AI?

Elaboration of objection: Even if this is useful for alignment, and even adjusting for the fact that companies aren’t focusing on the version that’s specifically alignment-optimized, won’t a ton of this work get done in AI labs and startups? Doesn’t that mean that the EA community is less likely to make an impact on the margin than in other, less-commercially-incentivized types of alignment work?

I do think there’s probably some work happening broadly along these lines from a commercial motivation, and there will probably be significantly more in the future. But I pretty strongly suspect that there are very few, if any, projects like the ones I proposed above currently being done in a commercial setting, and what work is being done is less well-targeted at reducing long-run x-risk than it could be.

The vast majority of commercial work going into AI by dollars is a) hyper application-specific and hard-coding intensive such as self-driving cars, or b) focused on scaling big generic models. I don’t actually think the resources going into any sort of project focused on human demonstrations and feedback is very large right now; I’d guess it’s within an order of magnitude of the resources going into other alignment work (e.g. 100s of millions per year at the high-end, where other alignment research absorbs10s of millions per year). And for the reasons outlined above, not a lot of this will be focused on exceeding humans using scalable, domain-general techniques.

As an example to illustrate the relative neglectedness of this work, it was Paul Christiano (motivated by long-term alignment risk concerns) who led the the Stiennon et al., 2020 work, and I think it’s reasonably likely that if he hadn’t done so there wouldn’t have been a human feedback paper of similar scale and quality for another year or so. I’d guess the EA community collectively has the opportunity to substantially increase how much of this work is done before transformative AI with a strong push, especially because the “going beyond human feedback” step seems less commercially incentivized than the Stiennon et al. work.

• I think that it matters who is doing this work and why, not just that the work gets done somehow. It seems significantly better to have someone working on these problems who is self-awarely doing it to help with long-run x-risk reduction, and who is plugged into the broader alignment community, than someone who just happens to be doing work that might be relevant to alignment. It’s valuable to be collaborating with and getting feedback from more theoretical alignment researchers, and to be mentally on the lookout for ways to make the work more analogous to the long-run challenge; a generic ML engineer working on human feedback to improve the newsfeed at Facebook would be much less likely to continue to keep focusing on long-run-relevant questions for their whole career.[14] (And one of the value propositions here is that the long-termists / AI alignment people, as a community, should be gathering this experience, so experience that’s less accessible to the community is less valuable.)
• I think that for most people,[15] the value (roughly speaking, the importance multiplied by the tractability) of doing marginal work in an area as a function of its crowdedness is often an upside-down U-shape rather than strictly decreasing. When there’s practically no one in an area, there’s no one who can mentor you when you’re getting started, no one who you can hire when you’re experienced, and there’s no built-in audience who can be swayed by your demonstrations or arguments and can act on that. My personal intuition is that for empirical alignment work, we’re near the increasing returns part of this curve (though this situation can change rapidly). There’s an existing group of people who have an incentive to work on something in this space and may ramp up soon, but I think EAs have a chance to set the tone and agenda for what exactly the work they do looks like, and what standards it should be held to. I could imagine a pretty broad range of outcomes for how much ML engineers working on productizing hold themselves to the standard of finding domain-general and scalable solutions, and I could imagine EAs having an impact on that culture.

Will this cause harm by increasing investment in scaling AI?

Elaboration of objection: Even if the people doing this research don’t personally scale up models and focus on generalizable and scalable solutions to making models helpful, they will be demonstrating that the models have powerful and useful capabilities that people might not have appreciated before, and could inspire people to pour more investment into simply scaling up AI or making AI useful in much less principled ways, which could cause harm that exceeds the benefits of the research.

This is a very contentious question and people have a wide range of intuitions on it. I tend to be less bothered by this type of concern than a lot of other people in the community across the board. At a high-level, my take is that:

• We’re in the middle of an AI investment boom that I expect to be sustained for several more years.
• The amount of effort going into AI as a whole (10s of billions per year) is currently ~2 orders of magnitude larger than the amount of effort going into the kind of empirical alignment I’m proposing here, and at least in the short-term (given excitement about scaling), I expect it to grow faster than investment into the alignment work. • This means an additional dollar of effort going into the empirical language models alignment work would need to generate ~100 or more of investment into accelerating AI to have a proportionally large impact on accelerating AI as a whole, in a climate where investors are already excited and AI labs are already trying hard to make them more excited. This isn’t out of the question, but doesn’t seem likely to me, especially given that EAs would likely be partially displacing people who would do similar work from a pure profit motivation, and that we could try to consciously shape messaging to further reduce the expected impact on AI hype. (In general, it’s hard to get a factor of 100 leverage on your spending even if you’re optimizing for it.)
• It also seems plausible that there are positive side effects on others’ investment, such as directing marginal money away from making models larger and toward fine-tuning models to be helpful.
• Finally, I am not personally fully convinced that speeding up AI as a whole would be net negative (it seems like timing interacts in extremely complicated ways with who is in power and what the global situation is like around the time of transformative AI), which claws back some of the expected damage from acceleration.

With that said, I do think that exciting demos are a lot more likely to spur investment than written arguments, and this kind of research could generate exciting demos. Overall, the case for caution feels stronger to me than the case for caution about discussing arguments about timelines and takeoff speeds, and this consideration probably net claws back some enthusiasm I have for the proposal (largely out of deference to others).

Why not just stick with getting models not to do bad things?

Elaboration of objection: Even if this is useful for alignment, worth doing on the margin, and not net-harmful, it seems like it would be dominated by doing practical/near-term work that’s more clearly and legibly connected to safety and harm-reduction, like “getting models to never lie” or “getting models to never use racist slurs” or “getting models to never confidently misclassify something.” That work seems more neglected and more relevant.

Some people might feel like “avoiding bad behaviors” is clearly the subset of near-term empirical alignment work which is most relevant to long-run alignment and neglected by profit-seeking actors -- after all, in the long run we’re trying to avoid a big catastrophe from misaligned AI, so in the short run we should try to avoid smaller catastrophes.

I disagree with this: I think both “getting models to be helpful and surpass human trainers” and “getting models to never do certain bad things” are valuable lines of empirical alignment work, and I’d like to see more of both. But I don’t think reliability and robustness has a special place in terms of relevance to long-run x-risk reduction, and if anything it seems somewhat less exciting on the margin. This is because:

• Most versions of “make a model more reliable” don’t really get at scalability to tasks/domains that are more challenging for humans to supervise, and it seems especially valuable to specifically target that. It seems very plausible to me that the most interesting challenges that are most analogous to the long-run challenge will only come up when we’re trying to get excellent or superhuman performance out of a model, rather than when we’re trying to avoid certain specific bad things.
• I don’t actually think that reliability work is more neglected than the work of getting models to be helpful in domains that are difficult for humans. There is a significantly larger academic field around reliability and robustness than around alignment, and the reliability/robustness problem is often harder to avoid or sidestep as a company: you can choose domains where human expertise is strong or automated reward signals exist, but you will still need to get your product to meet a fairly high bar of reliability before it is commercially viable.
• Robustness and reliability falls under multiple different “social good” brands. People concerned with “Fairness, Accountability, and Transparency” (FAT) tend to be very interested in the reliability and robustness space, as well as people concerned with e.g. autonomous weapons. Even though there is a worry that the “make models helpful” work is too easy to confuse with commercialization, my weak best guess is that it would actually be harder to tell which people working in the robustness space are optimizing for reducing long-term x-risk from AI (vs for profit or other altruistic goals), and I’d guess it would be tougher to build a distinctive culture / brand around working on the sub-problems most relevant to long-term risk.

Why not focus on testing a candidate long-term solution?

Elaboration of objection: This proposal seems like it would lead to a lot of wasted work that isn’t sufficiently optimized for verifying or falsifying a long-term solution to alignment. It would be better if the potential projects were more specifically tied in to testing an existing candidate long-term solution, e.g. Paul Christiano’s agenda.

I’ll focus on Paul’s agenda in my response, because the specific people I’ve talked to who have this objection mostly focus on it, but I think my basic response will apply to all the conceptual alignment agendas.

Some of the projects under the umbrella of “aligning narrowly superhuman models” seem like they could instead be reframed around specific goals related to Paul’s agenda, like “prototyping and testing capability amplification”, “prototyping and testing imitative generalization”, “figuring out how ascription universality works”, and so on. I do think one of the value propositions of this work is shedding light on these sorts of concepts, but I think it’s probably not helpful to frame the whole endeavor around that:

• Verifying proposed long-term solutions is only one way that the work could reduce AI x-risk, and I don’t think it’s overwhelmingly dominant,[16] especially not if restricted to the set of long-run solutions proposed so far. I want people who are committed to reducing long-run AI x-risk but don’t believe in any of the existing conceptual research to be doing this work, too.
• Not a lot of people currently understand the agenda well enough that they could generate good research projects from the prompt of “prototype and test [concept from a Paul blog post].” Similarly, I don’t think funders and peer reviewers understand the agenda well enough to tell if a research project with that goal was helpful.
• Paul’s agenda is in very active development, and I think there’s a reasonable chance the whole plan ends up looking pretty different within a year or two. Given this and the above point, I think empirical work testing specific Paul ideas is best done in close collaboration with him, and I’d guess even someone who believes in Paul’s agenda would often be better off just targeting the slightly looser problem description absent a lot of access to him. This makes me think research under the frame of “test Paul’s agenda” is a lot less scalable than research under the frame of “align narrowly superhuman models.”

There could be some simple organizing goal or “tagline” for empirical alignment research that is neither “test [concept from a Paul blog post]” nor “align narrowly superhuman models” which would inspire better-targeted research from the perspective of someone who’s bullish on Paul’s work, but the ones I’ve thought about haven’t been convincing,[17] and I’d guess it’ll be hard to find a good organizing tagline until the theory work gets to a more stable state.

Current state of opinion on this work

One of my goals in writing this blog post is to help build some community consensus around the “aligning narrowly superhuman models” proposal if it’s in fact a good idea. To that end, I’ll lay out my current understanding of where various AI alignment researchers stand on this work:

• Paul Christiano spent a few years at OpenAI working on this kind of thing (as I mentioned above he was the team lead on the Stiennon et al., 2020 paper) and generally thinks it’s important -- he feels the conceptual work he’s currently doing beats it as a use of his own time, but believes  that this kind of work is among the best highly scalable types of alignment research.
• Alignment researchers I’ve spoken to that primarily do research on large neural networks (unlike Paul, who does a mixture of this and conceptual thinking) tend to be more enthusiastically positive on this and more likely to consider it the best kind of work they personally could do. They also tend to be more positive on even more “no holds barred” versions of this idea -- i.e., just trying to make helpful models without focusing in particular on ideas like “sandwiching.”
• My understanding of Eliezer Yudkowsky’s position is one of “cautious relative optimism” about something in this general space compared to other non-MIRI alignment work, though he would frame the core concern differently, with more emphasis on understandability of models’ answers and decisions (e.g. “GPT-3 has somewhere buried inside it knowledge of what to do when you’re sick; how do you extract all of that and how can you tell when you’ve succeeded?”). He was reasonably positive on Stiennon et al., 2020 when it came out, and would be happy to see more work like that. Evan Hubinger’s position seems broadly similar (he is specifically interested in ascription universality). I’m not sure where others at MIRI would land on this work.
• My sense is that people who do conceptual thinking work other than Paul and MIRI tend to have a position similar to or somewhat more optimistic than Eliezer’s or Evan’s. E.g. I think Rohin Shah feels that aligning narrowly superhuman models is a reasonably good baseline for what research to do (and is developing a benchmark related to this), but he has privileged insight that beats that baseline. My rough sense is that other researchers doing conceptual thinking are on average somewhat less excited about aligning narrowly superhuman models than Paul is, and a lot less excited than the pure ML alignment researchers, but I’m not sure.

I also think a number of AI alignment researchers (and EAs working in AI risk more broadly) simply haven’t thought a lot about this kind of work because it hasn’t really been possible until the last couple of years. Until 2019 or so, there weren’t really any models accessible to researchers which could exceed human performance in fuzzy domains, and research agendas in AI alignment were largely formed before this was an option.

Takeaways and possible next steps

I’ve laid out the hypothesis that aligning narrowly superhuman models would concretely reduce x-risk and has high long-run field growth potential (i.e., lots of people who don’t have particularly esoteric skills could eventually help with it). I think if the EA and AI alignment community is in broad agreement about this, there’s potential to make a lot happen.

In terms of immediate actionable takeaways:

• If you disagree with this argument, say so -- especially if you think it would be harmful or would be dominated by a different line of work that shares similar practical advantages of tangibility, good feedback loops, and potential-for-scale.
• If you have more or better project ideas in mind, say so -- especially if you have ideas about how to target “treacherous turn” dynamics more specifically or how to reframe the statement of the problem to make it more productive, well-targeted, etc.
• If you a) already agree with me, and b) are already in a good position to fairly immediately make this work happen (e.g. you are a PI at a university lab that is able to fine-tune open-source models like Google’s T5, or you are a senior ML researcher at a tech company with the freedom to do your own projects), then consider doing a project in this space. For example, you could try to solve tasks in this Minecraft human feedback benchmark being developed by some researchers at CHAI when it's released. Getting more demos of what it looks like to do this research will help make it easier to think about how valuable it would be and build consensus around it if it is. Most people will not be in this position. As I said at the top, Open Phil is not soliciting grant applications right now from people who want to try it out -- this blog post is my personal viewpoint, and institutionally we’re still figuring out how much we want to prioritize this (discussion and arguments surrounding this post will feed into that).
• If you agree with this case and might be in a position to work on aligning narrowly superhuman models a few years down the line (e.g. if you are a software engineer or a university student with a technical background), consider keeping this in the back of your mind and checking in about future opportunities. If you are ready to try to switch into this work sooner, there may be jobs available doing this sort of work at various AI labs including OpenAI and DeepMind,[18] though I’d guess roles right now are fairly limited.

Looking forward to hearing people’s thoughts!

Appendix: beyond sandwiching?

Right now, models like GPT-3 are not “superhuman” at fuzzy tasks in the sense that AlphaGoZero is “superhuman” at playing Go. AGZ plays Go better than any human, while GPT-3 is only capable of giving better advice or writing better stories than some humans, which is what makes the “sandwiching” tactic an option. What happens when language models and other models get narrowly superhuman in a strong sense -- better than all humans in some fuzzy domain, e.g. stock-picking? How would we verify that we got the model to be “doing the best it can do to help” when there’s no reference model trained on a ground truth signal to compare its performance to?

I’m definitely very unsure what this would look like, but an important starting assumption I have is that whatever techniques worked well to get less-capable humans to reproduce the judgments of more-capable humans in a “sandwich” setting stand a good chance of just continuing to work. If we were careful not to actually use the expertise of the more-capable set of humans in whatever systems/tools we used to assist/augment the less-capable set, and a similar set of systems/tools seemed to work across multiple domains and for humans at multiple different capability levels, there’s no particular reason to believe they would not continue working once models go from slightly less capable than the best humans to slightly more capable than them at some task. So I think it’s possible we could do most of the R&D in the regime where sandwiching works.

With that said, here are some thoughts about how we could try to probe whether our alignment techniques were actually successful at eliciting a model’s full potential in a regime the model is more capable than the best humans:

• We could attempt the sandwiching technique with augmented or assisted humans at both the top and the bottom. For example, if a model is capable of better stock-picking than the best human hedge fund analysts, we could compare its performance when trained using feedback from top hedge fund analysts vs feedback from normal people -- where both the hedge fund analysts and the normal people are assisted/augmented using whatever decomposition techniques, question elicitation techniques, ML assistants, etc. are the state of the art. Since the expertise of the individual humans within the system ideally shouldn’t bear on how well they are able to provide supervision, the goal would be to get the “augmented normal people” to produce a model that behaves similarly to the model produced by the “augmented experts.”
• We could try to exploit the fact that it’s usually easier to recognize a good outcome (especially after some time has passed) than to generate good behaviors. For example, we could compare the performance of a stock-price-predicting model trained using feedback from augmented hedge fund analysts to the performance of a model trained with the benefit of hindsight about what stocks actually went up over a one-year period. Ultimately, I think it’s scary if we have to rely entirely on medium- and long-run outcomes to train our superhuman models,[19] but in an R&D context it could make sense to use a model trained on medium-run outcomes to check whether our techniques for augmenting human judgment are managing to elicit the full potential of the model.
• More speculatively, we could build up interpretability tools to probe the model and try to understand why it’s doing what it’s doing to build up confidence that it’s “trying its best” and making reasonable decisions. We could try to work on something like this in the near-term with AlphaGoZero itself -- try to develop interpretability tools that will let pro Go players learn new insights about how to play Go better by probing AGZ. If we develop tools like this and verify them on models like AGZ, top hedge fund analysts could later use them to probe a stock-picking model and develop a better understanding of whether it’s truly “trying its best to pick the right stocks.”

1. At least better than some salient large group of humans in a particular context, like “Mechanical Turk workers”, “stackoverflow users”, etc. Right now, models are only superhuman with respect to all humans in particular crisp domains like games. E.g. AlphaGoZero is better at Go than any human; GPT-3 probably has the potential to give better advice than some humans. ↩︎

2. This idea isn’t original to me -- a number of others (especially some people working on long-term AI alignment at OpenAI and DeepMind) have thought along similar lines. My own thinking about this has been informed a lot by discussions with Paul Christiano and Holden Karnofsky. ↩︎

3. e.g., Mechanical Turk workers who are hired to give feedback to the model ↩︎

4. Though if we could pull off a path where we build an AI system that is superhuman in certain engineering capabilities but not yet human-level in modeling and manipulating people, and use that system to cut down on x-risk from other AI projects without having to figure out how to supervise arbitrary superhuman models, that could be really good. ↩︎

5. Note that I don’t think this is the only way to study interpretability and robustness, or even necessarily the best way. In this project-generation formula, the domain and task were optimized to make reward learning an especially interesting and important challenge, rather than to make interpretability or robustness especially challenging, interesting, or important. I think it’s good to be complete and to try to ensure interpretability and robustness in these domains, but we should probably also do other lines of research which choose domains / tasks that are specifically optimized for interpretability or robustness, rather than reward learning, to be especially challenging and important. ↩︎

6. Pragmatically speaking, fine-tuning a large model rather than training from scratch is also orders of magnitude cheaper, and so a lot more accessible to most researchers. ↩︎

7. Another way of seeing why it wouldn’t count is that “predict the next token” is an extremely non-fuzzy training signal. ↩︎

8. Human contractors make these labels, but they are not providing feedback. ↩︎

9. More speculatively, if we’re realizing models’ full potential as we go along, there’s less chance of ending up with what I’ll call an “unforced sudden takeoff”: a situation where on some important set of fuzzy tasks models jump suddenly from being not-that-useful to extraordinarily useful, but this was due to not bothering to figure out how to make models useful for fuzzy tasks rather than any inherent underlying fact about models. I’m not sure how plausible an unforced sudden takeoff is though, and I’m inclined (because of efficient market intuitions) to think the strong version of it is not that likely. H/t Owen Cotton-Barratt for this thought. ↩︎

10. E.g., that whenever there are two or more generalizations equally consistent with the training data so far, models will never generalize in the way that seems more natural or right to humans. ↩︎

11. I think eventually gridworlds and games will probably fade away as it becomes more practical to work with larger models instead, and dynamics like the treacherous turn start to show up in messier real-world settings. ↩︎

12. One idea a couple of others have suggested here and which I’m generally interested in is “transparency in (narrowly superhuman) language models”: finding ways to understand “what models are thinking and why,” especially when they know more about something than humans do. I like this idea but am very unsure about what execution could look like. E.g., would it look like Chris Olah’s work, which essentially “does neuroscience” on neural networks? Would it look like training models to answer our questions about what they’re thinking? Something else? ↩︎

13. Though you could think that in an absolute sense it and all the other approaches that aren’t tackling treachery head-on are doomed. ↩︎

14. I would also prefer other things being equal that EAs focused on long-run x-risk get the recognition for this work rather than others, but as I said above I consider this secondary and think that this agenda is good on the merits, not just as career capital for EAs. ↩︎

15. There are some innovators for whom the value of being in an area is strictly decreasing in its crowdedness, because their main value-add is to “start something from nothing.” But I don’t think that applies to most contributors, even those who have an extremely large impact eventually (which might even be larger than the innovators’ impact in some cases). ↩︎

16. Some people have argued that the “verifying long-run solutions” path is dominant because the other stuff is likely to happen anyway, but I’m not convinced. I think all three paths to impact that I laid out are likely to happen one way or another, and there’s room to speed up or improve all of them. I do think there could be some boost to the “verifying long-run solutions” path, but all in all I feel like it’ll be ⅓ to ¾ of the value, not >90% of the value. ↩︎

17. The most plausible competing pitch in my mind is “get language models to answer questions honestly”, which seems like it could get at the “ascription universality” / “knowing everything the model knows” concept (h/t Evan H, Owen C-B, Owain E). That would narrow the focus to language models and question-answering, and rule out projects like “get non-coders to train a coding model.” I think the “get language models to answer questions honestly” frame is reasonable and I want to see work done under that banner too, but I’m not convinced it’s superior. It considerably narrows the scope of what’s “in”, cutting down on long-run field growth potential, and I think a lot of the projects that are “out” (like the coding project) could be helpful and informative. I also worry that the tagline of “honesty” will encourage people to focus on “avoiding harmful lies that are nonetheless pretty easy for humans to detect”, rather than focusing on regimes where models exceed human performance (see this objection for more discussion of that). ↩︎

18. It’s possible other places, like Google Brain or some other FAANG lab, would also have roles available doing this type of work -- I am just more unsure because there is less of a long-termist alignment researcher presence in those places. ↩︎

19. Eventually, when models are more strongly superhuman, I think it will get too hard to even tell whether outcomes were acceptable, because AI systems could e.g. compromise the cameras and sensors we use to measure outcomes. So relying on outcomes earlier on feels like “kicking the can down the road” rather than “practicing what we eventually want to be good at.” “Don’t kick the can down the road, instead practice what we eventually want to be good at” is the overall ethos/attitude I’m going for with this proposal. ↩︎

72

New Comment

I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.

First and foremost, great post! "How do we get GPT to give the best health advice it can give?" is exactly the sort of thing I think about as a prototypical (outer) alignment problem. I also like the general focus on empirical directions and research-feedback mechanisms, as well as the fact that the approach could produce real economic value.

Now on to the more interesting part: how does this general strategy fail horribly?

If we set aside inner alignment and focus exclusively on outer alignment issues, then in-general the failure mode which I think is far and away most likely is roughly "you get what you can measure" or "you get something designed to look good to human supervisors without actually being good". In other words, the inability of humans to reliably/robustly evaluate outcomes is the big problem. (The Fusion Power Generator Scenario is a one good example of the type of failure I'm talking about here - the human doesn't understand what-they-want at a detailed enough level to even ask the right questions, let alone actually evaluate a design.)

So: I expect any version of "align narrowly superhuman models" which evaluates the success of the project entirely by human feedback to be completely and totally doomed, at-best useless and at-worst actively harmful to the broader project of alignment. Worse, I expect that those are exactly the sort of projects which will produce the most impressive demos, potentially attract investors, etc. After all, their outputs are optimized for looking good to humans (without actually being good) - of course they're going to look good to human investors, engineers, etc!

Now, what's really interesting about this piece is that you propose at least one approach - the sandwich method - explicitly addressing that failure mode. Personally, I found that idea the most interesting and promising part of this whole piece. I'll even register a prediction: 80% the "sandwich problem" cannot be solved in a domain-generalizable way without major conceptual progress on (outer) alignment. Though I would not be surprised if attempts to solve the sandwich problem failed in ways which directly led to at least some conceptual progress, and I do still expect empirical work on the problem is likely to be valuable for that reason. (Though I also expect that a lot of people will read the description of the sandwich problem and fail to understand the requirement which makes it interesting - namely, that the experts be completely and totally absent from the training process, and in particular no data from experts should be involved in the training process.) If I thought I had a generalizable method capable of solving the sandwich problem, it would probably be the highest-priority thing on my agenda.

Thanks for the comment! Just want to explicitly pull out and endorse this part:

the experts be completely and totally absent from the training process, and in particular no data from the experts should be involved in the training process

I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the "sandwich" problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also take steps toward finding actual algorithms in the course of doing one of the sandwich problems).

I also broadly agree with you that "things looking good to humans without actually being good" is a major problem to watch out for. But I don't think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback. (E.g., one of the papers I link in the post is a mainstream ML paper amplifying a weak training signal into a better one.)

But I don't think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback.

I partially agree with this; alignment is a bottleneck to value for GPT, and actually aligning it would likely produce some very impressive stuff. My disagreement is that it's a lot easier to make something which looks impressive than something which solves a Hard problem (like the sandwich problem), and therefore most impressive-looking "solutions" will probably circumvent the key part of the problem. And if the Hard problem is indeed hard enough to not be solved by anyone, the most impressive-looking results will be those which look good without actually solving it.

I guess the crux here is "And if the Hard problem is indeed hard enough to not be solved by anyone," — I don't think that's the default/expected outcome. There hasn't been that much effort on this problem in the scheme of things, and I think we don't know where it ranges from "pretty easy" to "very hard" right now.

Ah... I think we have an enormous amount of evidence on very-similar problems.

For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn't know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.

In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the "sandwich problem" would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don't think we have a good solution in practice; I'd expect the expert business-owner to usually come up with a much better contract.

This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn't understand what the designer wants), versus a product designer who's also a fluent coder and familiar with the code base. I've experienced this one first-hand; the expert product designer is way better. Or, consider a well-intentioned mortgage salesman, who wants to get their customer the best mortgage for them, and the customer who understands the specifics of their own life but knows nothing about mortgages. Will they end up with as good a mortgage as a customer who has expertise in mortgages themselves? Probably not. (I've seen this one first-hand too.)

One approach is to let the human giving feedback think for a long time. Maybe the business owner by default can't write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there's hope in the AI case (e.g. that's a hope behind iterated amplification).

How does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.

Yeah, sorry, that's right, I was speaking pretty loosely. You'd still have the same hope -- maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".

Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don't see any reason at all to expect it to do anything remotely similar to that.

I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info:

Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from???

... I don't really know. My guess is that I picked it up from reading giant comment threads between Paul and other people.

I don't see any reason at all to expect it to do anything remotely similar to that.

Tbc it doesn't need to be literally true. The argument needed for safety is something like "a large team of copies of non-expert agents could together be as capable as an expert". I see the argument "it's probably possible for a team of agents to mimic one agent thinking for a long time" as mostly an intuition pump for why that might be true.

"As capable as an expert" makes more sense. Part of what's confusing about "equivalent to a human thinking for a long time" is that it's picking out one very particular way of achieving high capability, but really it's trying to point to a more-general notion of "HCH can solve lots of problems well". Makes it sound like there's some structural equivalence to a human thinking for a long time, which there isn't.

Makes it sound like there's some structural equivalence to a human thinking for a long time, which there isn't.

Yes, I explicitly agree with this, which is why the first thing in my previous response was

sorry, that's right, I was speaking pretty loosely.

The intuition for it is something like this: suppose I'm trying to make a difficult decision, like where to buy a house. There are hundreds of cities I'd be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood.

If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of "holistic judgment of neighborhood X", and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.

I see, so it's basically assuming that problems factor.

Yeah, in the context of a larger alignment scheme, it's assuming that in particular the problem of answering the question "How good is the AI's proposed action?" will factor down into sub-questions of manageable size.

Well, Paul's original post presents HCH as the specification of a human enlightened judgement.

For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.

And if we follow the links to Paul's previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.

To define my considered judgment about a question Q, suppose I am told Q and spend a few days trying to answer it. But in addition to all of the normal tools—reasoning, programming, experimentation, conversation—I also have access to a special oracle. I can give this oracle any question Q’, and the oracle will immediately reply with my considered judgment about Q’. And what is my considered judgment about Q’? Well, it’s whatever I would have output if we had performed exactly the same process, starting with Q’ instead of Q.

So it looks to me like "HCH captures the judgment of the human after thinking from a long time" is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don't know the answer.

A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.

I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)

That's what I have in mind. If all goes well you can think of it like "a human thinking a long time." We don't know if all will go well.

It's also not really clear what "a human thinking 10,000 years" means, HCH is kind of an operationalization of that, but there's a presumption of alignment in the human-thinking-a-long-time that we don't get for free here. (Of course you also wouldn't get it for free if you somehow let a human live for 10,000 years...)

My understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.

HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.

(This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)

Yes sorry — I'm aware that in the HCH procedure no one human thinks for a long time. I'm generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could "effectively replicate the benefits you could get from having a human thinking a long time," in terms of the role that it plays in an overall scheme for alignment. This isn't guaranteed to work out, of course. My position is similar to Rohin's above:

I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".

Hm, interesting, I'm actually worried about a totally different implication of "you get what you can measure."

E.g.:

"If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide - are the humans allowed to say "hold on, I don't want that," or are we just going to accept that as what peak performance looks like? So anyhow I'm pessimistic about sandwiching for moral questions."

I'm curious if the upvote disparity means I'm the minority position here :P

I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:

Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

If an AGI is hung up on these sorts of questions, then we've already mostly-won. That's already an AI which is unlikely to wipe out the human species as a side-effect of maximizing the number of paperclips in the universe. It's already an AI which is unlikely to induce a heart attack in its user in hopes that the user falls onto the positive feedback button. It's already an AI which is unlikely to flood a room in order to fill a cauldron with water.

The vast majority of human values are not things we typically think of as "moral questions"; they're things which are so obvious that we usually don't even think of them until they're pointed out. But they're still value judgements, and we can't expect an AGI to share those value judgements by default. If we're down to the sorts of things people usually think of as moral questions, then the vast majority of human values have already been solved.

Given that this is LW, and this was a major takeaway of the sequences (or at least it was for me), I'd guess that's probably a fairly common background assumption.

I'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree?

Anyhow, my point was more: You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said "you get what you measure" is a problem because humans can disagree when their values are 'measured' without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).

We can't mostly-win just by fine-tuning a language model to do moral discourse.

Uh... yeah, I agree with that statement, but I don't really see how it's relevant. If we tune a language model to do moral discourse, then won't it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like "they said they want fusion power, but they probably also want it to not be turn-into-bomb-able".

Or are you using "moral discourse" in a broader sense?

You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said "you get what you measure" is a problem because humans can disagree when their values are 'measured' without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).

I disagree with the exact phrasing "fact of the matter for whether decisions are good or bad"; I'm not supposing there is any "fact of the matter". It's hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want.

Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.

I'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree?

English sentences don't have to hold up to optimization pressure, our AI designs do. If I say "I'm hungry for pizza after I work out", you could say "that doesn't hold up to optimization pressure - I can imagine universes where you're not hungry for pizza", it's like... okay, but that misses the point? There's an implicit notion here of "if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won."

Perhaps this notion isn't obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer.

Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say "this seems true in the main, although I can imagine situations where it's not." Maybe this is what you meant, in which case I agree.

This was a very solid post and I've curated it. Here are some of the reasons:

• I think that the post is a far more careful analysis of questions around what research to do, what research is scalable, and what are the potential negative effects, than most any other proposals I've seen, whilst also containing clear ideas and practical recommendations. (Many posts that optimize for this level of carefulness end up not saying much at all, or at least little of any practical utility, yet this post says quite a lot of interesting things that are practically useful.) There kx a lot of valuable advice, not merely to try to help making narrow superhuman models useful, but how to do it in a way that is helpful for alignment. The section "What kind of projects do and don't "count"" is really helpful here.
• I appreciate the efforts that Ajeya has made to understand and build consensus around these ideas, talking to people at various orgs (OpenAI, MIRI, more), and this again makes me feel more confident signal-boosting it, given that it contains information about many others' perspectives on the topic. And more broadly, the whole "Objections and responses" section felt like it did a great job at perspective-taking on others' concerns and addressing them head on.
• I like a lot of the discussion around this post, both in the comments section here and also in the comments by Eliezer and Evan in Robby's post. (I recommend everyone who reads the OP also reads the discussion in the linked post.)

The main hesitation I have around curating this is that I'm kind of scared of all recommendations for "interesting ideas for using machine learning that might be really important for AGI". This part of me feels like everything has the potential to be used for capabilities, and that someone pursuing this line of work may do so quite inadvertently (and it will not be possible to "put the ball back in the urn"), or just end up proving their competence at building scalably useful models and then getting hired by a research lab to do work that will give them lots of money to do stuff with big models.

I am scared about most everything in this space, yet I don't think I endorse "no action", and this does seem to me like one of the most promising and careful posts and approaches. For me the most cruxy sections were "Isn't this not neglected because lots of people want useful AI" and "Will this cause harm by increasing investment in scaling AI?". For the first, I think if I believed the claims Ajeya makes as strongly as I think Ajeya does, I'd feel notably more relieved overall about encouraging this kind of work. For the second I didn't feel persuaded by the arguments. I think that there are few people who are respected remotely on these sorts of questions or who are thinking strategically about them (especially in public). I think the x-risk and alignment communities have in the past had 100:1 outsized impact with actions taken and research directions pursued.

In sum, I continue to be generally terrified, but this was an excellent post and one of the relatively least terrifying things. Thank you very much for the post.

This isn't an objection to the research direction, just a response to how you're framing it:

If you think GPT-3 is "narrowly superhuman" at medical advice, what topic don't you think it's narrowly superhuman in? It seems like you could similarly argue that GPT-3 knows more than the average human about mechanics, chemistry, politics, and just about anything that language is good at describing. (EG, not walking, riding a bike, the concrete skills needed for painting, etc.)

A tool capable of getting GPT-3 to give good medical advice would, probably, be a tool to get GPT-3 to give good advice.

(I am not denying that give good medical advice is a better initial goal/framing.)

This seems to imply that GPT-3 is broadly superhuman, IE, GPT-3 knows more than the average human about a very broad range of things (although GPT-3 might not know more than the best human in any domain). Going further: the implication is that GPT is a kind of mild superintelligence, currently misaligned in a benign way (it just wants to mimic humans) which hides an unknown portion of its intelligence (making it seem subhuman).

I'm not saying this is exactly true. Maybe GPT-3 really is only narrowly superhuman, in the sense that it basically only knows what it needs to know to mimic humans to this level, and essentially doesn't know anything about medicine etc. In this world, its apparent knowledge of medicine is so mixed with all its other ideas that you can't extract the truth: it's not operating on a "true medical stuff + mistakes" model, it just has models of a bunch of possible statements with no way to differentiate good advice from nonsense. In that case, you can only train GPT-3 to give good medical advice by providing an external truth filter of some kind; your project would be basically doomed.

(I think the truth is some unknown point between those two extremes, and I'm quite curious to know exactly where.)

You consider whether AlphaGo could serve a similar role as a test case of aligning narrowly superhuman models, and you reject this idea. I think AlphaGo really is a narrowly superhuman model, and I think your rejection of it is related to this. Because it really is narrowly superhuman, it doesn't seem like it has this kind of hidden knowledge you want to bring out -- it only knows about Go.

So it seems like "narrowly superhuman" might be the wrong framing.

This seems like it's using the wrong ontology to me.

Like, in my mind, there are things like medical diagnostics or predictions of pharmaceutical reactions, which are much easier cognitive tasks than general conversation, but which humans are specialized away from.

For example, imagine the severity of side effects from a specific medication. can be computed by figuring out 15 variables about the person and putting them into a neural network with 5000 parameters, and the output is somewhere in a six-dimensional space, and this model is part of a general model of human reactions to chemicals.

Then GPT-3 would be in a great position to use people's reddit posts talking about medication side effects to find this network. I doubt that medical science in our current world could figure that out meaningfully. It would be strongly superhuman in this important medical task, but nowhere near superhuman in any other conversational task.

My intuition is that most professional occupations are dominated by problems like this, that are complex enough that we as humans can only capture them as intuitions, but simple enough that the "right" computational solution would be profoundly superhuman in that narrow domain, without being broadly superhuman in any autonomous sense.

Maybe a different reading of your comment is something like, there are so many of these things that if a human had access to superhuman abilities across all these individual narrow domains, that human could use it to create a decisive strategic advantage for themself, which does seem possibly very concerning.

Let's see if I can properly state the nature of the disagreement.

I stated that there's a spectrum between "GPT knows more than the average human across a broad variety of domains, but only uses this knowledge to imitate humans, so it's not obvious" and "GPT really knows very little, and its apparent stupidity is stupidity-in-fact".

I somewhat operationalized the difference as one of internal representation: to what extent is GPT using a truth+noise model (where it knows a lot of stuff about reality, and then filters it through the biases of particular perspectives) vs a model where everything is thrown together and it's not very possible to extract truth without having more information yourself to know what is truth vs noise.

This model has an implication, that Ajeya's project will work to the extent that we're toward the smart-GPT end of the spectrum and won't work to the extent that we're toward the other end.

I think you're disagreeing with this implication?

So you're saying: even if GPT doesn't internally use anything like a truth+noise model, it's possible to extract a great deal of useful information about the world by observing the statistics of GPT's imitation of internet users. For example, because people talk a lot about diseases online, it should be possible to extract statistics about this from GPT. This can produce a useful diagnostic model, even if GPT isn't internally representing something so useful.

Is this roughly what you are saying?

If that's what you're saying, then I agree that such a thing could be possible, but I am unsure if this should count as success in Ajeya's terms.

If GPT knows a lot of stuff but isn't telling us because it's not trying to be helpful, that's misalignment. Getting it to try to communicate those things to us would be a kind of alignment work.

If the statistics of GPT's text model can be used to infer useful things about the world, this doesn't seem related to alignment.

But maybe I'm totally mis-identifying the disagreement you were trying to point at.

My intuition is that most professional occupations are dominated by problems like this, that are complex enough that we as humans can only capture them as intuitions, but simple enough that the "right" computational solution would be profoundly superhuman in that narrow domain, without being broadly superhuman in any autonomous sense.

Your phrase "in any autonomous sense" makes me think that perhaps you think GPT does have an internal model like the medical model you describe, plus similar models in many different domains, but lacks an "autonomy" property which would be required to make it broadly superhuman in a significant sense. Under this hypothesis, your disagreement with me is that you think I think GPT has "autonomy".

I guess my response to that would be that GPT probably does lack some kind of "autonomy" (if it means independently pursuing goals by planning, anticipating the consequences of its words) but does have significant planning capacity if asked (eg could construct coherent plans involving its medical knowledge, and in doing so, fluidly match up its narrow medical knowledge with its narrow knowledge in a variety of different areas).

I think this is obscuring (my perception of) the disagreement a little bit.

I think what I'm saying is, GPT-3 probably doesn't have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple.

I then expect GPT-3 to "secretly" have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills.

But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple.

In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya's approach to be both effective, because "narrowly superhuman" can exist, and reasonably safe, because the gap between "narrowly superhuman" or even "narrowly superhuman in many ways" and "broadly superhuman" is large so GPT-3 being broadly superhuman is unlikely.

Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks--becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.

Thanks for trying further to bridge the gap!

(It would be nice if you flagged a little better which things you think I think / which things you think I disagree with)

I think what I'm saying is, GPT-3 probably doesn't have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple.

OK, that makes sense. So you're not saying that GPT contains useful diagnostic models in the overall statistics of its models of Reddit users (EG that someone complaining of one symptom will often complain of another), nor are you saying that GPT contains a good model of disease which it then feeds through noise (EG it decides that a particular user is a diabetic, which shapes how it plays that character going forward, but the character itself doesn't know it is diabetic, so may say some confused things); indeed, you are denying the latter. But what you are saying is that GPT plays the role of users who do have their own internal models, so it must mimic those models (in cases where that's not too hard to learn).

I find this hard to square with your earlier statement:

For example, imagine the severity of side effects from a specific medication. can be computed by figuring out 15 variables about the person and putting them into a neural network with 5000 parameters, and the output is somewhere in a six-dimensional space, and this model is part of a general model of human reactions to chemicals.

Then GPT-3 would be in a great position to use people's reddit posts talking about medication side effects to find this network. I doubt that medical science in our current world could figure that out meaningfully.

Where it sounds like you think GPT will know something medical science does not know.

As for me, I find all of these to be broadly possible. I'd have to think more to give a meaningful plausibility ranking.

I then expect GPT-3 to "secretly" have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills.

How many? I am thinking of "medical diagnostics" as just one example of many many areas of expertise which border on GPT's competence. I wasn't thinking there was any special reason to single out medicine in particular as something GPT might have implicit knowledge about.

On my model, if GPT contains implicit medical competence, it probably contains similar competence in "every area", although I'm not sure how to quantify. Maybe a similar hidden competence in at least 50% of professions at least as numerous as, say, physicist? (Really, what matters is how much discussion of a profession there is online, not how numerous that profession is, but maybe it's an OK proxy.)

My crux would be something special about medical diagnosis such that we especially expect GPT to have implicit talent there.

But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple.

It seems like you think planning capacity might be some important difference in our positions?

Personally, I think it's plausible that GPT does something to plan ahead: it seems broadly useful to think about what could come later in the text (eg where a sentence is going), and potentially, it's useful to think about that in some detail (to notice when options which seem consistent at the high level are actually not consistent when you try to put all the pieces together (where by "consistent" I mean plausible in terms of everything GPT knows about text)).

But I don't see this as fundamental to the view I'm expressing in any way.

In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya's approach to be both effective, because "narrowly superhuman" can exist, and reasonably safe, because the gap between "narrowly superhuman" or even "narrowly superhuman in many ways" and "broadly superhuman" is large so GPT-3 being broadly superhuman is unlikely.

Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks--becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.

I think I agree with a heuristic that says something like "GPT isn't magic, GPT-n will scale the way things usually scale, the highest-probability projection for the near future is the smooth extrapolation from the past". Not as something I'm confident about, but as the default.

But I do have a big disagreement with what you wrote above.

First I'm going to try to make a very general argument in favor of my spectrum. Then I'm going to give a very concrete scenario, which I think argues for "putting knowledge together" competence.

General Argument

Let's forget the difference between the truth+noise model and various other models, and just deal with whether GPT has "implicit knowledge". What exactly "implicit knowledge" means will depend on the extraction technology we invent; I define "implicit knowledge" functionally as any expertise which can be brought out (by something broadly in line with Ajeya's research program).

My broad argument is just that absent any specific reason to expect implicit knowledge about medicine in particular, conditional on such knowledge, we should expect similar implicit knowledge across a broad variety of domains.

My smartness-spectrum is just the latent variable of how much implicit knowledge GPT has. The argument for the existence of such a spectrum is just the argument that if we see it in one domain, we would expect it in others. If we don't see it in one, we less expect to see it in others.

Specific Scenario

Suppose the general alignment technology we develop resembles "learning to summarize from human feedback", the example Ajeya cited of work that looks like what Ajeya wants to point toward.

More specifically, suppose it works like this:

1. We collect a lot of data of humans judging GPT-3 as being smart and helpful, vs dumb or not helpful.
2. We train a model (using features from GPT-3 to give the network a good start) to replicate those human judgments. Call this JUDGE.
3. We fine-tune GPT-3 using JUDGE as our training signal; ie, fine-tune it to be as smart and helpful as possible. Let's call this GPT-nice.

This procedure may not extract all implicit knowledge, or extract it well, etc etc. However, I fully expect that this procedure will extract some. I just see no reason to think this procedure wouldn't work. Simply put, I expect GPT-nice to be a legitimately more helpful and intelligent fine-tuning of GPT-3.

(Whether this procedure is safe for, say, GPT-7 is another question.)

Let's say for the sake of argument that this procedure brings out the kind of medical competence we've been discussing, plus similar competence in at least a few other domains.

I generally expect that GPT-nice will have decent "putting knowledge together" skills, mainly because GPT-3 is already not too bad at this. Yes, sometimes GPT misses common-sense implications. However, by and large, if you put facts from different domains into the text history, GPT will come up with continuations which make sense. So I would postulate that GPT-nice will be at least as good as GPT-3 with the relevant facts placed in history.

Suppose for the sake of argument that GPT-nice is good at medical diagnosis and separately good at giving dietary advice. Further suppose for the sake of argument that GPT-3 is OK at telling you what dietary changes are implied by medical conditions. Then I would suppose GPT-3 is at least OK at giving dietary advice tailored to your medical conditions.

Or suppose GPT-nice is good at diagnosing psychological disorders, and good at giving social advice. Then suppose GPT-3 is already halfway decent at anticipating social problems associated with psychological disorders, when prompted correctly. Then I would suppose that GPT-nice would be halfway decent at tailoring its social advice to any psychological problems a person has.

I'm replying on my phone right now because I can't stop thinking about it. I will try to remember to follow up when I can type more easily.

I think the vague shape of what I think I disagree about is how dense GPT-3's sets of implicit knowledge are.

I do think we agree that GPT-5000 will be broadly superhuman, even if it just has a grab bag of models in this way, for approximately the reasons you give.

I'm thinking about "intelligent behavior" as something like the set of real numbers, and "human behavior" as covering something like rational numbers, so we can get very close to most real numbers but it takes some effort to fill in the decimal expansion. Then I'm thinking of GPT-N as being something like integers+1/N. As N increases, this becomes close enough to the rational numbers to approximate real numbers, and can be very good at approximating some real numbers, but can't give you incomputable numbers (unaligned outcomes) and usually won't give you duplicitous behavior (numbers that look very simple at first approximation but actually aren't, like .2500000000000004, which seems to be 1/4 but secretly isn't). I'm not sure where that intuition comes from but I do think I endorse it with moderate confidence.

Basically I think for minimal circuit reasons that if "useful narrowly" emerges in GPT-N, then "useful in that same domain but capable of intentionally doing a treacherous turn" emerges later. My intuition is that this won't be until GPT-(N+3) or more, so if you are able to get past unintentional turns like "the next commenter gives bad advice" traps, this alignment work is very safe, and important to do as fast as possible (because attempting it later is dangerous!)

In a world where GPT-(N+1) can do a treacherous turn, this is very dangerous, because you might accidentally forget to check if GPT-(N-1) can do it, and get the treacherous turn.

My guess is that you would agree that "minimal circuit that gives good advice" is smaller than "circuit that gives good advice but will later betray you", and therefore there exist two model sizes where one is dangerous and one is safe but useful. I know I saw posts on this a while back, so there may be relevant math about what that gap might be, or it might be unproven but with some heuristics of what the best result probably is.

My intuition is that combining narrow models is multiplicative, so that adding a social manipulation model will always add an order of magnitude of complexity. My guess is that you don't share this intuition. You may think of model combination as additive, in which case any model bigger than a model that can betray you is very dangerous, or you might think the minimal circuit for betrayal is not very large, or you might think that GPT-2-nice would be able to give good advice in many ways so GPT-3 is already big enough to contain good advice plus betrayal in many ways.

In particular if combining models is multiplicative in complexity, a model could easily learn two different skills at the same time, while being many orders of magnitude away from being able to use those skills together.

My guess is that you would agree that "minimal circuit that gives good advice" is smaller than "circuit that gives good advice but will later betray you", and therefore there exist two model sizes where one is dangerous and one is safe but useful. I know I saw posts on this a while back, so there may be relevant math about what that gap might be, or it might be unproven but with some heuristics of what the best result probably is.

There was indeed a post posing this question a while back, and discussion in the comments included a counterexample: a construction of a minimal circuit that would be malign.

To my eye, the whole crux of the inner alignment problem is that we have no results saying things like:

• The simplest program which solves a problem is not an inner optimizer
• The minimal circuit which solves a problem is not an inner optimizer
• The fastest program solving a problem is not an inner optimizer

Or any such thing. If we had such a result, then we'd have a grip on the problem. But we don't currently have any result like that, nor any plausible direction for proving such a result. And indeed, thought on the problem suggests that these hypotheses are probably not true; rather, it seems surprisingly plausible, once you think about it, that indeed minimal solutions may sometimes be inner optimizers.

My intuition is that combining narrow models is multiplicative, so that adding a social manipulation model will always add an order of magnitude of complexity. My guess is that you don't share this intuition. You may think of model combination as additive, in which case any model bigger than a model that can betray you is very dangerous, or you might think the minimal circuit for betrayal is not very large, or you might think that GPT-2-nice would be able to give good advice in many ways so GPT-3 is already big enough to contain good advice plus betrayal in many ways.

My thinking is that it's probably somewhere between the two. Multiplicative complexity suggests memorizing a lookup table. But there is regularity in the universe. There is transfer learning.

In particular if combining models is multiplicative in complexity, a model could easily learn two different skills at the same time, while being many orders of magnitude away from being able to use those skills together.

Right. I think transfer learning speaks pretty strongly against this multiplicative model.

Looks like the initial question was here and a result around it was posted here. At a glance I don't see the comments with counterexamples, and I do see a post with a formal result, which seems like a direct contradiction to what you're saying, though I'll look in more detail.

Coming back to the scaling question, I think I agree that multiplicative scaling over the whole model size is obviously wrong. To be more precise, if there's something like a Q-learning inner optimizer for two tasks, then you need the cross product of the state spaces, so the size of the Q-space could scale close-to-multiplicatively. But the model that condenses the full state space into the Q-space scales additively, and in general I'd expect the model part to be much bigger--like the Q-space has 100 dimensions and the model has 1 billion parameters, so going adding a second model of 1 billion parameters and increasing the Q-space to 10k dimensions is mostly additive in practice, even if it's also multiplicative in a technical sense.

I'm going to update my probability that "GPT-3 can solve X, Y implies GPT-3 can solve X+Y," and take a closer look at the comments on the linked posts. This also makes me think that it might make sense to try to find simpler problems, even already-mostly-solved problems like Chess or algebra, and try to use this process to solve them with GPT-2, to build up the architecture and search for possible safety issues in the process.

I do see a post with a formal result, which seems like a direct contradiction to what you're saying, though I'll look in more detail.

If you mean to suggest this post has a positive result, then I think you're just mis-reading it; the key result is

The conclusion of this post is the following: if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy, and there is some task for which that machine learning results in deceptive behavior, then there exists a natural task such that the minimal circuit that solves that task also produces deceptive behavior.

which says that under some assumptions, there exists a task for which the minimal circuit will engage in deceptive behavior (IE is a malign inner optimizer).

The comment with a counterexample on the original post is here.

Yeah, you're definitely pointing at an important way the framing is awkward. I think the real thing I want to say is "Try to use some humans to align a model in a domain where the model is better than the humans at the task", and it'd be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there's some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.

I don't want to just call it "align superhuman AI today" because people will be like "What? We don't have that", but at the same time I don't want to drop "superhuman" from the name because that's the main reason it feels like "practicing what we eventually want to do." I considered "partially superhuman", but "narrowly" won out.

I'm definitely in the market for a better term here.

I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.”

One response I generated was, "maybe it's just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice."

But I think my real response is: why is the superhuman part important, here? Maybe what's really important is being able to get answers (eg medical advice) without putting them in (eg without fine-tuning on medical advice filtered for high quality), and asking for superhuman ability is just a way of helping ensure that? Or perhaps more generally, there are other things like this which you expect people to do wrong if they're not dealing with a superhuman case, because you want the technology to eventually work for superhuman cases.

In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn't trying here to make something different sound like it's about practice. I don't think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I'd be similarly excited about or maybe more excited about.

In my mind, the "better than evaluators" part is kind of self-evidently intriguing for the basic reason I said in the post (it's not obvious how to do it, and it's analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn't strongly tied to a particular theoretical framing):

I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.

A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to "knowing everything the model knows" or "ascription universality"; the section "Why not focus on testing a long-term solution?" was written in response to Evan Hubinger and others. I think I'm still not convinced that's the right way to go.

I might be on board if "narrowly superhuman" were simply defined differently.

“Try to use some humans to align a model in a domain where the model is better than the humans at the task”

Isn't it something more like "the model has information sufficient to do better"? EG, in the GPT example, you can't reliably get good medical advice from it right now, but you strongly suspect it's possible. That's a key feature of the whole idea, right?

Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)

I don't feel confident enough in the frame of "inaccessible information" to say that the whole agenda is about it. It feels like a fit for "advice", but not a fit for "writing stories" or "solving programming puzzles" (at least not an intuitive fit -- you could frame it as "the model has inaccessible information about [story-writing, programming]" but it feels more awkward to me). I do agree it's about "strongly suspecting it has the potential to do better than humans" rather than about "already being better than humans." Basically, it's about trying to find areas where lackluster performance seems to mostly be about "misalignment" rather than "capabilities" (recognizing those are both fuzzy terms).

Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).

Right, ok, I like that framing better (it obviously fits, but I didn't generate it as a description before).

Planned summary for the Alignment Newsletter:

One argument against work on AI safety is that [it is hard to do good work without feedback loops](https://www.jefftk.com/p/why-global-poverty). So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical you may want to read the full post as your question might be answered there.

The author specifically suggests that we work on **aligning narrowly superhuman models** to make them more useful. _Aligning_ a model roughly means harnessing the full capabilities of the model and orienting these full capabilities towards helping humans. For example, GPT-3 presumably “knows” a lot about medicine and health. How can we get GPT-3 to apply this knowledge as best as possible to be maximally useful in answering user questions about health?

_Narrowly superhuman_ means that the model has more knowledge or “latent capability” than either its overseers or its users. In the example above, GPT-3 almost certainly has more medical knowledge than laypeople, so it is at least narrowly superhuman at “giving medical advice” relative to laypeople. (It might even be so relative to doctors, given how broad its knowledge is.)

<@Learning to Summarize with Human Feedback@> is a good example of what this could look like: that paper attempted to “bring out” GPT-3’s latent capability to write summaries, and outperformed the reference summaries written by humans. This sort of work will be needed for any new powerful model we train, and so it has a lot of potential for growing the field of people concerned about long-term risk.

Note that the focus here is on aligning _existing_ capabilities to make a model more useful, and so simply increasing capabilities doesn’t count. As a concrete example, just scaling up the model capacity or training data or compute would _not_ count as an example of “aligning narrowly superhuman models”, even though it might make the model more useful, since scaling increases raw capabilities without improving alignment. This makes it pretty different from what profit-maximizing companies would do by default: instead of baking in domain knowledge and simply scaling up models in order to solve the easiest profitable problems (as you would do if you wanted to maximize profit), work in this research area would look for general and scalable techniques, would not be allowed to scale up models, and would select interestingly difficult problems.

Why is this a fruitful area of research? The author points out four main benefits:
1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires.
2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback.
3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world.
4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.

Planned opinion:

I am very sympathetic to the argument that we should be getting experience with aligning powerful models right now, and would be excited to see more work along these lines. As the post mentions, I personally see this sort of work as a strong baseline, and while I currently think that the conceptual work I’m doing is more important, I wouldn’t be surprised if I worked on a project in this vein within the next two years.

I especially agree with the point that this is one of the most scalable forms of research, and am personally working on a [benchmark](https://docs.google.com/document/d/18MEmQ4aA1zdZHBKG5fLeISYoFAv1IfUozWZ5JYnq-bY/edit) meant to incentivize this sort of research for similar reasons.

Impression before reading LW post comments & MIRI comments: this strikes me as a valuable "fourth area" of core research that we could start growing now. I'm uncertain about the technical fruits of the research itself (I expect it to be somewhere between 'slightly positive' and 'moderate-high positive'), but it seems like we could indeed scale such research into its own healthy (& prestigious!) subfield in ML. This could diversify the alignment research portfolio in a way that scales sublinearly with long-termist research input: in the long run, we wouldn't need everyone involved to be 'core' alignment researchers.

I have a few notes of unease that I haven't yet sat down to figure out yet, so I may reply to this comment with more thoughts.

This is exactly what Ought is doing as we build Elicit into a research assistant using language models / GPT-3. We're studying researchers' workflows and identifying ways to productize or automate parts of them. In that process, we have to figure out how to turn GPT-3, a generalist by default, into a specialist that is a useful thought partner for domains like AI policy. We have to learn how to take feedback from the researcher and convert it into better results within session, per person, per research task, across the entire product. Another spin on it: we have to figure out how researchers can use GPT-3 to become expert-like in new domains.

We’re currently using GPT-3 for classification e.g. “take this spreadsheet and determine whether each entity in Column A is a non-profit, government entity, or company.” Some concrete examples of alignment-related work that have come up as we build this:

• One idea for making classification work is to have users generate explanations for their classifications. Then have GPT-3 generate explanations for the unlabeled objects. Then classify based on those explanations. This seems like a step towards “have models explain what they are doing.”
• I don’t think we’ll do this in the near future but we could explore other ways to make GPT-3 internally consistent, for example:
• Ask GPT-3 why it classified Harvard as a “center for innovation.”
• Then ask GPT-3 if that reason is true for Microsoft.
• Or just ask GPT-3 if Harvard is similar to Microsoft.
• Then ask GPT-3 directly if Microsoft is a “center for innovation.”
• And fine-tune results until we get to internal consistency.
• We eventually want to apply classification to the systematic review (SR) process, or some lightweight version of it. In the SR process, there is one step where two human reviewers identify which of 1,000-10,000 publications should be included in the SR by reviewing the title and abstract of each paper. After narrowing it down to ~50, two human reviewers read the whole paper and decide which should be included. Getting GPT-3 to skip these two human processes but be as good as two experts reading the whole paper seems like the kind of sandwiching task described in this proposal.

We'd love to talk to people interested in exploring this approach to alignment!

I haven't read this in detail (hope to in the future); I only skimmed based on section headers.
I think the stuff about "what kinds of projects count" and "advantages over other genres" seem to miss an important alternative, which is to build and study toy models of the phenomena we care about.  This is a bit like the gridworlds stuff, but I thought the description of that work missed its potential, and didn't provide much of an argument for why working at scale would be more valuable.

This approach (building and studying toy models) is popular in ML research, and the leaders of the field (e.g. Rich Sutton) are big proponents of it, and think it is undervalued in the current research climate.  I agree.
Shameless plug for my work that follows this approach: https://arxiv.org/abs/2009.09153

A relevant example would be to build toy models of "inaccessible information", and try to devise methods of extracting that information.

This type of research fails your criteria for what "counts" with flying colors, but in my mind it seems approximately equally valuable to the kind of experiments you seem to have in mind -- and much cheaper to perform!

The case in my mind for preferring to elicit and solve problems at scale rather than in toy demos (when that's possible) is pretty broad and outside-view, but I'd nonetheless bet on it: I think a general bias toward wanting to "practice something as close to the real thing as possible" is likely to be productive. In terms of the more specific benefits I laid out in this section, I think that toy demos are less likely to have the first and second benefits ("Practical know-how and infrastructure" and "Better AI situation in the run-up to superintelligence"), and I think they may miss some ways to get the third benefit ("Discovering or verifying a long-term solution") because some viable long-term solutions may depend on some details about how large models tend to behave.

I do agree that working with larger models is more expensive and time-consuming, and sometimes it makes sense to work in a toy environment instead, but other things being equal I think it's more likely that demos done at scale will continue to work for superintelligent systems, so it's exciting that this is starting to become practical.

Thanks for the response!
I see the approaches as more complimentary.
Again, I think this is in keeping with standard/good ML practice.

A prototypical ML paper might first describe a motivating intuition, then formalize it via a formal model and demonstrate the intuition in that model (empirically or theoretically), then finally show the effect on real data.

The problem with only doing the real data (i.e. at scale) experiments is that it can be hard to isolate the phenomena you wish to study.  And so a positive result does less to confirm the motivating intuition, as there are many other factors as play that might be responsible.  We've seen this happen rather a lot in Deep Learning and Deep RL, in part because of the focus on empirical performance over a more scientific approach.

One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.

A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.

Both of these approaches feel like they aren't quite solving the whole problem, because ultimately we want systems that help humans supervise tasks where they haven't developed the right concepts, or couldn't understand them even with years of study.

Thanks for the very in-depth case you're making! I especially liked the parts about the objections, and your take on some AI Alignment researcher's opinions of this proposal.

Personally, I'm enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you're pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be summarized by saying what I say in my post: that as long as we're not really sure that we got the terms of the problem well-defined, we cannot make the whole field into this Solving part.

As a quick summary of what I get into in my detailed feedback, I think more work on this kind of problems will be net positive and very useful if:

• we are able to get reasonably good guarantees that doing a specific experiment doesn't present too big of a risk;
• this kind of work stays in conversation with what you call conceptual work;
• this kind of work doesn't replace other kinds of AI Alignment research completely.

Also, I think a good example of a running research project doing something similar is Tournesol. I have a post explaining what it is, but the idea boils down to building a database of expert feedback on Youtube videos on multiple axes, and leverage it to train a more aligned recommendation algorithm for Youtube. One difference is that their idea does probably make the model more competent (it's not already using a trained model like GPT-3); yet the similarities are numerous enough that you might find it interesting.

In general, it seems to me like building and iterating on prototypes is a huge part of how R&D progress is made in engineering fields, and it would be exciting if AI alignment could move in that direction.

I agree with the general idea that getting more experimental work will be immensely valuable, but I’m worried about the comparison with traditional engineering. AI Alignment cannot just follow engineering paradigms and wisdom of just prototyping stuff willy-nilly because every experiment could explode in our face. It seems closer to nuclear engineering, which required AFAIK a preliminary work and understanding of nuclear physics.

To summarize, I’m for finding constrained and safe ways to gather more experimental understanding, but pushing for more experiments without heeding the risks seems like one of the worst things we could do.

Holistically, this seems like a much safer situation to be in than one where the world has essentially procrastinated on figuring out how to align systems to fuzzy goals, doing only the minimum necessary to produce commercial products.

Is it the correct counterfactual, though? You seem to compare your proposed approach with a situation where no AI Alignment research is done. That hardly seems fair or representative of a plausible counterfactual.

Aligning narrowly superhuman models today could help build up tools, infrastructure, best practices, and tricks of the trade. I expect most of this will eventually be developed anyway, but speeding it up and improving its quality could still be quite valuable, especially in short timelines worlds  where there's a lot less time for things to take their natural course.

Well, it depends whether it’s easier to get from the conceptual details to the implementation details, or the other way around. My guess would be the former, which means that working on implementation details before knowing what we want to implement is at best a really unproductive use of research time (even more in short timelines), at worse a waste of time. I'm curious if you have argument for the opposite take.

Note that I’m specifically responding to this specific argument. I still think that experimental work can be tremendously valuable for solving the conceptual issues.

All this seems like it would make the world safer on the eve of transformative AI or AGI, and give humans more powerful and reliable tools for dealing with the TAI / AGI transition.

Agreed. That being said, pushing in this direction might also place us in a worse situation, for example by putting a lot of pressure on AIs to build human models which then make deception/manipulation significantly more accessible and worthwhile. I don’t really know how to think about this risk, but I certainly would want follow-up discussions on it.

More broadly, “doing empirical science on the alignment problem” -- i.e. systematically studying what the main problem(s) are, how hard they are, what approaches are viable and how they scale, etc -- could help us discover a number of different avenues for reducing long-run AI x-risk that we aren’t currently thinking of, one-shot technical solutions or otherwise.

Yes, yes and yes. Subject to preliminary thinking about the risks involved in such experimental research, that’s definitely a reason to push more for this kind of work.

Compared to conceptual research, I’d guess aligning narrowly superhuman models will feel meatier and more tractable to a number of people. It also seems like it would be easier for funders and peers to evaluate whether particular papers constitute progress, which would probably help create a healthier and more focused field where people are broadly more on the same page and junior researchers can get stronger mentorship. Related to both of these, I think it provides an easier opportunity for people who care about long-run x-risk to produce results that are persuasive and impressive to the broader ML community, as I mentioned above.

You present this as a positive, but I instead see a pretty big issue here. Because of everything you point out, most incentives will push towards doing only this kind of research. You’ll have more prestige, a better chance at a job, recognition by a bigger community. All of which is definitely good from a personal standpoint. Which means both that all newcomers will go on to the experimental type of work, and that such experiments will bear less and less relationship with the actual aligning of AI (and more and more with the specific kind of problems for which we find experimental solutions without the weird conceptual work).

In particular, you say that the field will be healthier because “people are more broadly on the same page”. That for me falls into the trap of believing that a paradigm is necessary the right way to structure a field of research trying to solve a problem. As I argue here, a paradigm in this case basically means that you think you have circumscribed the problem well enough to not question it any more, and work single-mindedly on it. We’re amazingly far from that point in AI Alignment, and so that looks really dangerous, especially because shorter timelines won’t allow more than one or two such paradigms to unfold.

When it’s possible to demonstrate an issue at scale, I think that’s usually a pretty clear win.

Agreed, with the caveat I’ve been repeating about the check for risks due to the scale.

I think we have a shot at eventually supplying a lot of people to work on it too. In the long run, I think more EAs could be in a position to contribute to this type of work than to either conceptual research or mainstream ML safety.

This looks about right. Although I wonder if it wouldn’t be dangerous to have a lot of people working on the topic that don’t get the conceptual risks and/or the underlying ML technology. So I’m wondering if having people without the conceptual or ML skills work on that kind of project is safe.

I (conceptual person) broadly do agree that this is valuable.

It's possible that we won't need this work - that alignment research can develop AI that doesn't benefit from the same sort of work you'd do to get GPT-3 to do tricks on command. But it's also possible that this really would be practice for "the same sort of thing we want to eventually do."

My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly more sophistication, fine-tune it to predict advice plus a score for how good the advice was, then condition on the score being high!

These sorts of methods are important (and getting more important fast), but by their success they might shade out "understanding-based" approaches, or architecture tweaks that can't take advantage of a pretrained NN.

This wouldn't be an issue if I thought we could just ride supervised learning all the way to aligned AGI. But there are extra problems not confronted by the GPT-3 medical advice example - state of the art systems for acting in the real world might use reinforcement learning, which is a different kettle of fish to try to align, and we want to ensure AGI that acts sensibly even off-distribution (at least in the ways the far future can be off-distribution), and with many times the computing power.

There's also some unavoidable conceptual progress needed (You can fine-tune GPT-3 for medical advice with little philosophical worry, but how do you fine-tune GPT-3 for moral advice? Okay, now that you thought of the obvious answer, what's wrong with it?). Maybe this goes faster in the world where people actually are trying to fine-tune GPT-3 to give specific, consistent moral advice. Or maybe not, I dunno.

My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly more sophistication, fine-tune it to predict advice plus a score for how good the advice was, then condition on the score being high!

I don't think you can get away with supervised learning if you're holding yourself to the standard of finding fuzzy tasks where the model is narrowly superhuman. E.g. the Stiennon et al., 2020 paper involved using RL from human feedback: roughly speaking, that's how it was possible for the model to actually improve upon humans rather than simply imitating them. And I think in some cases, the model will be capable of doing better than (some) humans' evaluations, meaning that to "get models to the best they can to help us" we will probably need to do things like decomposition, training models to explain their decisions, tricks to amplify or de-noise human feedback, etc.

There's also some unavoidable conceptual progress needed (You can fine-tune GPT-3 for medical advice with little philosophical worry, but how do you fine-tune GPT-3 for moral advice? Okay, now that you thought of the obvious answer, what's wrong with it?)

I don't agree that there's obviously conceptual progress that's necessary for moral advice which is not necessary for medical advice — I'd expect a whole class of tasks to require similar types of techniques, and if there's a dividing line I don't think it is going to be "whether it's related to morality", but "whether it's difficult for the humans doing the evaluation to tell what's going on." To answer your question for both medical and moral advice, I'd say the obvious first thought is RL from human feedback, and the second thought I had to go beyond that is trying to figure out how to get less-capable humans to replicate the training signal produced by more-capable humans, without using any information/expertise from the latter to help the former (the "sandwiching" idea). I'm not sure if it'll work out though.

Re: part 1 -

Good points, I agree. Though I think you could broadly replicate the summarization result using supervised learning - the hope for using supervised learning in superhuman domains is that your model learns a dimension of variation for "goodness" that can generalize well even if you condition on "goodness" being slightly outside any of the training examples.

Re: part 2 -

What it boils down to is that my standards (and I think the practical standards) for medical advice are low, while my standards for moral advice are high (as in, you could use this to align AGI). I agree that there's no magic property a moral question has that no medical question could have. But there are non-magical properties I expect to be relevant.

With medical advice from a text model, I'm not expecting it to learn a detailed model of the human body and be able to infer new medical conditions and treatments that human experts haven't figured out yet. I'm just expecting it to do verbal reasoning to arrive at the same substantive advice a human expert would give, maybe packaged in a slightly superhuman good explanation.

With moral advice, though, ask 3 human experts and you'll get 4 opinions. This is made worse by the fact that I've sneakily increased the size of the problem - "moral advice" can be about almost anything. Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

Medical advice seems to be in the "supervisable regime," where it's fulfilled its promise by merely telling us things that human experts know. Moral advice is very not, because humans aren't consistent about morality in the same way they can be about medicine.

If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide - are the humans allowed to say "hold on, I don't want that," or are we just going to accept that as what peak performance looks like? So anyhow I'm pessimistic about sandwiching for moral questions.

Getting better at eliciting human preferences does seem important, but again it has more wrinkles than for medicine. We have metapreferences (preferences about our preferences, or about how to resolve our own inconsistencies) that have few analogues in medicine. This immediately thrusts us into the domain beyond human capacity for direct evaluation. So I absolutely agree with you that we should be seeking out problems in this domain and trying to make progress on them. But I'm still pretty confident that we're missing some conceptual tools for doing well on these problems.

Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.

Nice post. The one thing I'm confused about is:

Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objections you outlined? But your responses to them seem persuasive to me - and more generally, the objections don't seem to address the fact that a bunch of people who are trying to solve long-term alignment problems actually ended up doing this research. So I'd be interested to hear elaborations and defences of those objections from people who find them compelling.

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objections you outlined?

It's important to distinguish between:

• "We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it to a bunch of ML people who may or may not be aligned with us"
• "We (Open Phil) are not sure whether we would fund a person who seems smart, is generally aligned with us, and thinks that the best thing to do is reward modeling work"

My guess is that Ajeya means the former but you're interpreting it as the latter, though I could easily be wrong about either of those claims.

We're simply not sure where "proactively pushing to make more of this type of research happen" should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money).

already seen as a standard way to make progress on the full alignment problem

It might be a standard way to make progress, but I don't feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It's possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn't seem that profitable yet.)

Also, if we use a stricter definition of "narrowly superhuman" (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I'd argue that there hasn't been any work published on that so far.