Alignment As A Bottleneck To Usefulness Of GPT-3

by johnswentworth3 min read21st Jul 202029 comments

40

GPTOuter AlignmentAI
Curated

So there’s this thing where GPT-3 is able to do addition, it has the internal model to do addition, but it takes a little poking and prodding to actually get it to do addition. “Few-shot learning”, as the paper calls it. Rather than prompting the model with

Q: What is 48 + 76? A:

… instead prompt it with

Q: What is 48 + 76? A: 124

Q: What is 34 + 53? A: 87

Q: What is 29 + 86? A:

The same applies to lots of other tasks: arithmetic, anagrams and spelling correction, translation, assorted benchmarks, etc. To get GPT-3 to do the thing we want, it helps to give it a few examples, so it can “figure out what we’re asking for”.

This is an alignment problem. Indeed, I think of it as the quintessential alignment problem: to translate what-a-human-wants into a specification usable by an AI. The hard part is not to build a system which can do the thing we want, the hard part is to specify the thing we want in such a way that the system actually does it.

The GPT family of models are trained to mimic human writing. So the prototypical “alignment problem” on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Assuming that GPT has a sufficiently powerful and accurate model of human writing, it should then generate the thing you want.

Viewed through that frame, “few-shot learning” just designs a prompt by listing some examples of what we want - e.g. listing some addition problems and their answers. Call me picky, but that seems like a rather primitive way to design a prompt. Surely we can do better?

Indeed, people are already noticing clever ways to get better results out of GPT-3 - e.g. TurnTrout recommends conditioning on writing by smart people, and the right prompt makes the system complain about nonsense rather than generating further nonsense in response. I expect we’ll see many such insights over the next month or so.

Capabilities vs Alignment as Bottleneck to Value

I said that the alignment problem on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Important point: this is worded to be agnostic to the details GPT algorithm itself; it’s mainly about predictive power. If we’ve designed a good prompt, the current generation of GPT might still be unable to solve the problem - e.g. GPT-3 doesn’t understand long addition no matter how good the prompt, but some future model with more predictive power should eventually be able to solve it.

In other words, there’s a clear distinction between alignment and capabilities:

  • alignment is mainly about the prompt, and asks whether human writing which started with that prompt would be likely to contain the thing you want
  • capabilities are mainly about GPT’s model, and ask about how well GPT-generated writing matches realistic human writing

Interesting question: between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?

In the short term, it seems like capabilities are still pretty obviously the main bottleneck. GPT-3 clearly has pretty limited “working memory” and understanding of the world. That said, it does seem plausible that GPT-3 could consistently do at least some economically-useful things right now, with a carefully designed prompt - e.g. writing ad copy or editing humans’ writing.

In the longer term, though, we have a clear path forward for better capabilities. Just continuing along the current trajectory will push capabilities to an economically-valuable point on a wide range of problems, and soon. Alignment, on the other hand, doesn’t have much of a trajectory at all yet; designing-writing-prompts-such-that-writing-which-starts-with-the-prompt-contains-the-thing-you-want isn’t exactly a hot research area. There’s probably low-hanging fruit there for now, and it’s largely unclear how hard the problem will be going forward.

Two predictions on this front:

  • With this version of GPT and especially with whatever comes next, we’ll start to see a lot more effort going into prompt design (or the equivalent alignment problem for future systems)
  • As the capabilities of GPT-style models begin to cross beyond what humans can do (at least in some domains), alignment will become a much harder bottleneck, because it’s hard to make a human-mimicking system do things which humans cannot do

Reasoning for the first prediction: GPT-3 is right on the borderline of making alignment economically valuable - i.e. it’s at the point where there’s plausibly some immediate value to be had by figuring out better ways to write prompts. That means there’s finally going to be economic pressure for alignment - there’s going to be ways to make money by coming up with better alignment tricks. That won’t necessarily mean economic pressure for generalizable or robust alignment tricks, though - most of the economy runs on ad-hoc barely-good-enough tricks most of the time, and early alignment tricks will likely be the same. In the longer run, focus will shift toward more robust alignment, as the low-hanging problems are solved and the remaining problems have most of their value in the long tail.

Reasoning for the second prediction: how do I write a prompt such that human writing which began with that prompt would contain a workable explanation of a cheap fusion power generator? In practice, writing which claims to contain such a thing is generally crackpottery. I could take a different angle, maybe write some section-headers with names of particular technologies (e.g. electric motor, radio antenna, water pump, …) and descriptions of how they work, then write a header for “fusion generator” and let the model fill in the description. Something like that could plausibly work. Or it could generate scifi technobabble, because that’s what would be most likely to show up in such a piece of writing today. It all depends on which is "more likely" to appear in human writing. Point is: GPT is trained to mimic human writing; getting it to write things which humans cannot currently write is likely to be hard, even if it has the requisite capabilities.

40