Frequent arguments about alignment

John Schulman

Here, I’ll review some arguments that frequently come up in discussions about alignment research, involving one person skeptical of the endeavor (called Skeptic) and one person advocating to do more of it (called Advocate). I mostly endorse the views of the Advocate, but the Skeptic isn't a strawman and makes some decent points. The dialog is mostly based on conversations I've had with people who work on machine learning but don't specialize in safety and alignment.

This post has two purposes. First, I want to cache good responses to these questions, so I don't have to think about them each time the topic comes up. Second, I think it's useful for people who work on safety and alignment to be ready for the kind of pushback they'll get when pitching their work to others.

Just to introduce myself, I'm a cofounder of OpenAI and lead a team that works on developing and applying reinforcement learning methods; we're working on improving truthfulness and reasoning abilities of language models.

1. Does alignment get solved automatically as our models get smarter?

Skeptic: I think the alignment problem gets easier as our models get smarter. When we train sufficiently powerful generative models, they'll learn the difference between human smiles and human wellbeing; the difference between the truth and common misconceptions; and various concepts they'll need for aligned behavior. Given all of this internal knowledge, we just have to prompt them appropriately to get the desired behavior. For example, to get wise advice from a powerful language model, I just have to set up a conversation between myself and "a wise and benevolent AI advisor."

Advocate: The wise AI advisor you described has some basic problems, and I'll get into those shortly. But more generally, prompting an internet-trained generative model (like raw GPT-3) is a very poor way of getting aligned behavior, and we can easily do much better. It'll occasionally do something reasonable, but that's not nearly good enough.

Let's start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn't mean that it'll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it'll repeat the misconceptions and flaws of typical authors, even if it knows that they're mistaken about something.

Another problem with prompting is that it's a an unreliable method. Coming up with the perfect prompt is hard, and it requires evaluating each candidate prompt on a dataset of possible inputs. But if we do that, we're effectively training the prompt on this dataset, so we're hardly "just prompting" the model, we're training it (poorly). A nice recent paper studied the issue quantitatively.

So there's no getting around the fact that we need a final training step to get the model to do what we want (even if this training step just involves searching over prompts). And we can do much better than prompt design at selecting and reinforcing the correct behavior.

Fine-tune to imitate high-quality data from trusted human experts
Optimize the right objective, which is usually hard to measure and optimize, and is not the logprob of the human-provided answer. (We'll need to use reinforcement learning.)
Leverage models' own capabilities to help humans to demonstrate correct behavior and judge the models' behavior as in (1) and (2). Proposals for how to do this include debate, IDA, and recursive reward modeling. One early instantiation of this class of ideas involves retrieving evidence to help human judges.

Honing these techniques will require a lot of thought and practice, regardless of the performance improvements we get from making our models bigger.

So far, my main point has been that just prompting isn't enough -- there are better ways of doing the final alignment step that fine-tunes models for the right objective. Returning to the original question, there was the claim that alignment gets easier as the models get smarter. It does get easier in some ways, but it also gets harder in others. Smarter models will be better at gaming our reward functions in unexpected and clever ways -- for example, producing the convincing illusion of being insightful or helpful, while actually being the opposite. And eventually they'll be capable of intentionally deceiving us.

2. Is alignment research distinct from capabilities research?

Skeptic: A lot of research that's called "alignment" or "safety" could've easily been motivated by the goal of making an AI-based product work well. And there's a lot of overlap between safety research, and other ML research that's not explicitly motivated by safety. For example, RL from human preferences can be used for many applications like email autocomplete; it's biggest showcase so far is summarization, which is a long-standing problem that's not safety-specific. AI alignment is about "getting models to do what we really want", but isn't the rest of AI research about that too? Is it meaningful to make a distinction between alignment and other non-safety ML research?

Advocate: That's true that it's impossible to fully disentangle alignment advances from other capabilities advances. For example, fine-tuning GPT-3 to answer questions or follow instructions is a case study of alignment, but it's also useful for many commercial applications.

While alignment research is useful for building products, it's usually not the lowest-hanging fruit. That's especially true for the hard alignment problems, like aligning superhuman models. To ensure that we keep making progress on these problems, it's important to have a research community (like Alignment Forum) that values this kind of work. And it's useful for organizations like OpenAI to have alignment-focused teams that can champion this work even when it doesn't provide the best near-term ROI.

While alignment and capabilities aren't distinct, they correspond to different directions that we can push the frontier of AI. Alignment advances make it easier to optimize hard-to-measure objectives like being helpful or truthful. Capabilities advances also sometimes make our models more helpful and more accurate, but they also make the models more potentially dangerous. For example, if someone fine-tunes a powerful model to maximize an easy-to-measure objective like ad click-through rate, or maximize the time users spend talking to a chatbot, or persuade people to support a political agenda, it can do a lot more damage than a weak model.

3. Is it worth doing alignment research now?

Skeptic: It seems like alignment advances are bottlenecked by model power. We can't make useful progress on alignment until our models are powerful enough to exhibit the relevant phenomena. For example, one of the big questions is how to align super-human models that are hard to supervise. We can think about this problem now, but it's going to be pretty speculative. We might as well just wait until we have such models.

Advocate: It's true that we can make faster progress on alignment problems when we can observe the relevant phenomena. But we can also make progress preemptively, with a little effort and cleverness. What if we don't?

In the short term, companies will deploy products that optimize simple objectives like revenue and engagement. Doing this responsibly while looking out for customer wellbeing is harder and requires a more sophisticated methodology, including alignment techniques. Unless the methodology is well-known, companies won't bother, and no public or regulatory pressure can force them to use something that doesn't exist.
In the long term, aligning superhuman AIs might end up being very hard, and it might require major conceptual advances. If those advances aren't ready in time, the consequences could be severe. If AI progress continues at its current fast pace, we might end up in a situation that AI is so useful, that we're obligated to use it to run companies and make many decisions for us. But if we don't have sufficiently aligned AI by that point, then this could turn out badly.

In my experience, the "tech tree" of alignment research isn't bottlenecked by the scale of training. For example, I think it's possible to do a lot of useful research on improving the RL from human preferences methodology using existing models. "The case for aligning narrowly superhuman models" proposes some some directions that seem promising to me.

Is there an even better critique that the Skeptic could make? Are Advocate's responses convincing? Does the Advocate have a stronger response to any of these questions? Let me know.

Thanks to Avery Pan, Beth Barnes, Jacob Hilton, and Jan Leike for feedback on earlier drafts.

Skeptic: It seems to me that the distinction between "alignment" and "misalignment" has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: "AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)". Now people are using the word in sense 2: "AIs not quite doing what we want them to do". But when our current AIs aren't doing quite what we want them to do, is that mainly evidence that future, more general systems will be misaligned1 (which I agree is bad) or misaligned2?

Advocate: Concepts like agency are continuous spectra. GPT-3 is a little bit agentic, and we'll eventually build AGIs that are much more agentic. Insofar as GPT-3 is trying to do something, it's trying to do the wrong thing. So we should expect future systems to be trying to do the wrong thing in a much more worrying way (aka be misaligned1) for approximately the same reason: that we trained them on loss functions that incentivised the wrong thing.

Skeptic: I agree that this is possible. But what should our update be after observing large language models? You could look at the difficulties of making GPT-3 do exactly what we want, and see this as evidence that misalignment is a big deal. But actually, large language models seem like evidence against misalignment1 being a big deal (because they seem to be quite intelligent without being very agentic, but the original arguments for worrying about misalignment1 relied on the idea that intelligence and agency are tightly connected, making it very hard to build superintelligent systems which don't have large-scale goals).

Advocate: Even if that's true for the original arguments, it's not for more recent arguments.

Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven't been vetted very much.

Advocate: These assumptions seem like common sense to me - e.g. lots of people are already worried about the excesses of capitalism. But even if they're speculative, they're worth putting a lot of effort into understanding and preparing for.

In case it wasn't clear from inside the dialogue, I'm quite sympathetic to both sides of this conversation (indeed, it's roughly a transcript of a debate that I've had with myself a few times). I think more clarity on these topics would be very valuable.

On 2, I would probably make a stronger additional claim, that even the parts of alignment research that are "just capabilities" don't seem to happen by default, and the vast majority of work done in this space (at least with large models) seems to have been driven by people motivated to work on alignment. Yes, in some ideal world, those working towards AI-based products would have done things like RL from human feedback, but empirically they don't seem to be doing that.

(I also agree with the response you've given in the post.)

Yeah that's also good point, though I don't want to read too much into it, since it might be a historical accident.

Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there's room for reasonable disagreement on this question, although I favour the former.

Thanks for writing this. I've been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.

Is there an even better critique that the Skeptic could make?

Focusing first on human preference learning as a subset of alignment research: I think most ML researchers "should" agree on the importance of simple human preference learning, both from a safety and capabilities perspective. If we take the narrower question "should we do human preference learning, or is pretraining + minimal prompt engineering enough?", I feel confident in the answer you give as Advocate: To the extent prompt engineering works, it's because it's preference learning in disguise, and leaning into preference learning (including supervised / RL finetuning) will work much better. Both the theoretical and empirical pictures to-date agree with this.

(My sense is that not all ML researchers immediately agree with this / maybe just haven't considered the question in this frame, but that most researchers are pretty receptive to it and will agree in discussion.)

So I think a more challenging Skeptic might say: "Perhaps simple human preference learning is enough, and we can focus all alignment research there. Why do we need the other research directions in the alignment portfolio like handling inaccessible information, deceptive mesa-optimizers, or interpretability?" Here, "simple" human preference learning is referring to something like supervised (your step 1 for Question 1) + RL finetuning (step 2) + ad hoc ways of making it easier for humans to supervise models (limited versions of step 3).

I again side with Advocate here, but I think making the case is more difficult (and also perhaps requires different arguments for different research directions). I don't have a response for this as short or convincing as what you have here. My typical response would expand on your points that more capable models will be more dangerous and that alignment might turn out to be very hard, so it's important to consider these potential difficulties in advance. The hardness claim would probably involve failure stories (along these lines) or more abstract hardness arguments (along these lines).

Agree with what you've written here -- I think you put it very well.

I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.

You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models

yup, added a sentence about it

Planned summary for the Alignment Newsletter:

This post outlines three AI alignment skeptic positions and corresponding responses from an advocate. Note that while the author tends to agree with the advocate’s view, they also believe that the skeptic makes good points.
1. The alignment problem gets easier as models get smarter, since they start to learn the difference between, say, human smiles and human well-being. So all we need to do is to prompt them appropriately, e.g. by setting up a conversation with “a wise and benevolent AI advisor”.
_Response:_ We can do a lot better than prompting: in fact, <@a recent paper@>(@True Few-Shot Learning with Language Models@) showed that prompting is effectively (poor) finetuning, so we might as well finetune. Separately from prompting itself, alignment does get easier in some ways as models get smarter, but it also gets harder: for example, smarter models will game their reward functions in more unexpected and clever ways.
2. What’s the difference between alignment and capabilities anyway? Something like <@RL from human feedback for summarization@>(@Learning to Summarize with Human Feedback@) could equally well have been motivated through a focus on AI products.
_Response:_ While there’s certainly overlap, alignment research is usually not the lowest-hanging fruit for building products. So it’s useful to have alignment-focused teams that can champion the work even when it doesn’t provide the best near-term ROI.
3. We can’t make useful progress on aligning superhuman models until we actually have superhuman models to study. Why not wait until those are available?
_Response:_ If we don’t start now, then in the short term, companies will deploy products that optimize simple objectives like revenue and engagement, which could be improved by alignment work. In the long term, it is plausible that alignment is very hard, such that we need many conceptual advances that we need to start on now to have them ready by the point that we feel obligated to use powerful AI systems. In addition, empirically there seem to be many alignment approaches that aren’t bottlenecked by the capabilities of models -- see for example <@this post@>(@The case for aligning narrowly superhuman models@).

Planned opinion:

I generally agree with these positions and responses, and in particular I’m especially happy about the arguments being specific to the actual models we use today, which grounds out the discussion a lot more and makes it easier to make progress. On the second point in particular, I’d also [say](https://www.alignmentforum.org/posts/6ccG9i5cTncebmhsH/frequent-arguments-about-alignment?commentId=cwgpCBfwHaYravLpo) that empirically, product-focused people don’t do e.g. RL for human feedback, even if it could be motivated that way.

This post has two purposes. First, I want to cache good responses to these questions, so I don't have to think about them each time the topic comes up. Second, I think it's useful for people who work on safety and alignment to be ready for the kind of pushback they'll get when pitching their work to others.

Great idea, thanks for writing this!

Thanks for the post, it's a great idea to have both arguments.

My personal preference would be to have both arguments to be the same length to properly compare the strength of the arguments (skeptic is one paragraph, advocate is 3-6x longer), and not always in the same order skeptic then advocate, but also advocate -> skeptic or even skeptic -> advocate --> skeptic -> ..., so it does not appear like one is the "haven't thought about it much" view.

Thanks for this post! I have to admit that I took some time to read it because I believed that it would be basic, but I really like the focus on more current techniques (which makes sense since you cofounded and work at OpenAI).

Let's start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn't mean that it'll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it'll repeat the misconceptions and flaws of typical authors, even if it knows that they're mistaken about something.

That doesn't feel as bad as you describe it for me. Sure, if you literally call a "wise old man" from the literature (or god forbid, reddit), that might end up pretty badly. But we might go for a tighter control around the sort of "language producer" were trying to instantiate. Or go microscope AI.

All these do require more alignment focused work though. I'm particularly excited of perspectives of language models as simulators of many small models of things producting/influencing language, and of techniques related to that view, like meta-prompts or counterfactual parsing.

I also feel like this answer from the Advocate disparage a potentially very big deal for language models: the fact that they might pick up the human abstractions because they learn to model language, and our use of language is littered with these abstractions. This is a potentially strong version of the natural abstraction hypothesis, which seems like it makes the problem easier in some ways. For example, we have more chance of understanding what the model might do because it's trying to predict a system (language) that we use constantly at that level of granularity, as opposed to images that we never think of pixels by pixels.

Optimize the right objective, which is usually hard to measure and optimize, and is not the logprob of the human-provided answer. (We'll need to use reinforcement learning.)

I want to point out that from an alignment standpoint, this looks like a very dangerous step. One thing Language Models have for them is that what they optimize for isn't what we use them for exactly, and so they avoid potential issues like goodharting. This would be completely destroyed by adding an explicit optimization step at the end.

Returning to the original question, there was the claim that alignment gets easier as the models get smarter. It does get easier in some ways, but it also gets harder in others. Smarter models will be better at gaming our reward functions in unexpected and clever ways -- for example, producing the convincing illusion of being insightful or helpful, while actually being the opposite. And eventually they'll be capable of intentionally deceiving us.

I think this is definitely an important point that goes beyond the special case of language models that you mostly discuss before.

While alignment and capabilities aren't distinct, they correspond to different directions that we can push the frontier of AI. Alignment advances make it easier to optimize hard-to-measure objectives like being helpful or truthful. Capabilities advances also sometimes make our models more helpful and more accurate, but they also make the models more potentially dangerous.

On thing I would want to point out is that another crucial difference lies in the sort of conceptual research that is done in alignment. Deconfusion of ideas like power-seeking, enlightened judgment and goal-directedness are rarely that useful for capabilities, but I'm pretty convinced they are crucial for understanding better the alignment risks and how to deal with them.

For #2, not sure if this is a skeptic or an advocate point: why have a separate team at all? When designing a bridge you don't have one team of engineers making the bridge, and a separate team of engineers making sure the bridge doesn't fall down. Within openAI, isn't everyone committing to good things happening, and not just strictly picking the lowest-hanging fruit? If alignment-informed research is better long-term, why isn't the whole company the "safety team" out of simple desire to do their job?

We could make this more obviously skeptical by rephrasing it as a wisdom-of-the-crowds objection. You say we need people focused on alignment because it's not always the lowest-hanging fruit. But other people aren't dumb and want things to go well - are you saying they're making a mistake?

And then you have to either say yes, they're making a mistake because people are (e.g.) both internally and externally over-incentivized to do things that have flashy results now, or no, they're not making a mistake, in fact having a separate alignment group is a good idea even in a perfect world because of (e.g.) specialization of basic research, or some combination of the two.

In my experience, you need separate teams doing safety research because specialization is useful -- it's easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.

Advocate: Even if that's true for the original arguments, it's not for more recent arguments.

Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven't been vetted very much.

(I also agree with the response you've given in the post.)

Yeah that's also good point, though I don't want to read too much into it, since it might be a historical accident.

Thanks for writing this. I've been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.

Is there an even better critique that the Skeptic could make?

Agree with what you've written here -- I think you put it very well.

I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.

You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models

yup, added a sentence about it

Planned summary for the Alignment Newsletter:

This post outlines three AI alignment skeptic positions and corresponding responses from an advocate. Note that while the author tends to agree with the advocate’s view, they also believe that the skeptic makes good points.
1. The alignment problem gets easier as models get smarter, since they start to learn the difference between, say, human smiles and human well-being. So all we need to do is to prompt them appropriately, e.g. by setting up a conversation with “a wise and benevolent AI advisor”.
_Response:_ We can do a lot better than prompting: in fact, <@a recent paper@>(@True Few-Shot Learning with Language Models@) showed that prompting is effectively (poor) finetuning, so we might as well finetune. Separately from prompting itself, alignment does get easier in some ways as models get smarter, but it also gets harder: for example, smarter models will game their reward functions in more unexpected and clever ways.
2. What’s the difference between alignment and capabilities anyway? Something like <@RL from human feedback for summarization@>(@Learning to Summarize with Human Feedback@) could equally well have been motivated through a focus on AI products.
_Response:_ While there’s certainly overlap, alignment research is usually not the lowest-hanging fruit for building products. So it’s useful to have alignment-focused teams that can champion the work even when it doesn’t provide the best near-term ROI.
3. We can’t make useful progress on aligning superhuman models until we actually have superhuman models to study. Why not wait until those are available?
_Response:_ If we don’t start now, then in the short term, companies will deploy products that optimize simple objectives like revenue and engagement, which could be improved by alignment work. In the long term, it is plausible that alignment is very hard, such that we need many conceptual advances that we need to start on now to have them ready by the point that we feel obligated to use powerful AI systems. In addition, empirically there seem to be many alignment approaches that aren’t bottlenecked by the capabilities of models -- see for example <@this post@>(@The case for aligning narrowly superhuman models@).

Planned opinion:

I generally agree with these positions and responses, and in particular I’m especially happy about the arguments being specific to the actual models we use today, which grounds out the discussion a lot more and makes it easier to make progress. On the second point in particular, I’d also [say](https://www.alignmentforum.org/posts/6ccG9i5cTncebmhsH/frequent-arguments-about-alignment?commentId=cwgpCBfwHaYravLpo) that empirically, product-focused people don’t do e.g. RL for human feedback, even if it could be motivated that way.

This post has two purposes. First, I want to cache good responses to these questions, so I don't have to think about them each time the topic comes up. Second, I think it's useful for people who work on safety and alignment to be ready for the kind of pushback they'll get when pitching their work to others.

Great idea, thanks for writing this!

Thanks for the post, it's a great idea to have both arguments.

Let's start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn't mean that it'll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it'll repeat the misconceptions and flaws of typical authors, even if it knows that they're mistaken about something.

Optimize the right objective, which is usually hard to measure and optimize, and is not the logprob of the human-provided answer. (We'll need to use reinforcement learning.)

Returning to the original question, there was the claim that alignment gets easier as the models get smarter. It does get easier in some ways, but it also gets harder in others. Smarter models will be better at gaming our reward functions in unexpected and clever ways -- for example, producing the convincing illusion of being insightful or helpful, while actually being the opposite. And eventually they'll be capable of intentionally deceiving us.

I think this is definitely an important point that goes beyond the special case of language models that you mostly discuss before.

While alignment and capabilities aren't distinct, they correspond to different directions that we can push the frontier of AI. Alignment advances make it easier to optimize hard-to-measure objectives like being helpful or truthful. Capabilities advances also sometimes make our models more helpful and more accurate, but they also make the models more potentially dangerous.

49

Frequent arguments about alignment

49

1. Does alignment get solved automatically as our models get smarter?

2. Is alignment research distinct from capabilities research?

3. Is it worth doing alignment research now?