# 46

So there’s this thing where GPT-3 is able to do addition, it has the internal model to do addition, but it takes a little poking and prodding to actually get it to do addition. “Few-shot learning”, as the paper calls it. Rather than prompting the model with

Q: What is 48 + 76? A:

Q: What is 48 + 76? A: 124

Q: What is 34 + 53? A: 87

Q: What is 29 + 86? A:

The same applies to lots of other tasks: arithmetic, anagrams and spelling correction, translation, assorted benchmarks, etc. To get GPT-3 to do the thing we want, it helps to give it a few examples, so it can “figure out what we’re asking for”.

This is an alignment problem. Indeed, I think of it as the quintessential alignment problem: to translate what-a-human-wants into a specification usable by an AI. The hard part is not to build a system which can do the thing we want, the hard part is to specify the thing we want in such a way that the system actually does it.

The GPT family of models are trained to mimic human writing. So the prototypical “alignment problem” on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Assuming that GPT has a sufficiently powerful and accurate model of human writing, it should then generate the thing you want.

Viewed through that frame, “few-shot learning” just designs a prompt by listing some examples of what we want - e.g. listing some addition problems and their answers. Call me picky, but that seems like a rather primitive way to design a prompt. Surely we can do better?

Indeed, people are already noticing clever ways to get better results out of GPT-3 - e.g. TurnTrout recommends conditioning on writing by smart people, and the right prompt makes the system complain about nonsense rather than generating further nonsense in response. I expect we’ll see many such insights over the next month or so.

## Capabilities vs Alignment as Bottleneck to Value

I said that the alignment problem on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Important point: this is worded to be agnostic to the details GPT algorithm itself; it’s mainly about predictive power. If we’ve designed a good prompt, the current generation of GPT might still be unable to solve the problem - e.g. GPT-3 doesn’t understand long addition no matter how good the prompt, but some future model with more predictive power should eventually be able to solve it.

In other words, there’s a clear distinction between alignment and capabilities:

• alignment is mainly about the prompt, and asks whether human writing which started with that prompt would be likely to contain the thing you want
• capabilities are mainly about GPT’s model, and ask about how well GPT-generated writing matches realistic human writing

Interesting question: between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?

In the short term, it seems like capabilities are still pretty obviously the main bottleneck. GPT-3 clearly has pretty limited “working memory” and understanding of the world. That said, it does seem plausible that GPT-3 could consistently do at least some economically-useful things right now, with a carefully designed prompt - e.g. writing ad copy or editing humans’ writing.

In the longer term, though, we have a clear path forward for better capabilities. Just continuing along the current trajectory will push capabilities to an economically-valuable point on a wide range of problems, and soon. Alignment, on the other hand, doesn’t have much of a trajectory at all yet; designing-writing-prompts-such-that-writing-which-starts-with-the-prompt-contains-the-thing-you-want isn’t exactly a hot research area. There’s probably low-hanging fruit there for now, and it’s largely unclear how hard the problem will be going forward.

Two predictions on this front:

• With this version of GPT and especially with whatever comes next, we’ll start to see a lot more effort going into prompt design (or the equivalent alignment problem for future systems)
• As the capabilities of GPT-style models begin to cross beyond what humans can do (at least in some domains), alignment will become a much harder bottleneck, because it’s hard to make a human-mimicking system do things which humans cannot do

Reasoning for the first prediction: GPT-3 is right on the borderline of making alignment economically valuable - i.e. it’s at the point where there’s plausibly some immediate value to be had by figuring out better ways to write prompts. That means there’s finally going to be economic pressure for alignment - there’s going to be ways to make money by coming up with better alignment tricks. That won’t necessarily mean economic pressure for generalizable or robust alignment tricks, though - most of the economy runs on ad-hoc barely-good-enough tricks most of the time, and early alignment tricks will likely be the same. In the longer run, focus will shift toward more robust alignment, as the low-hanging problems are solved and the remaining problems have most of their value in the long tail.

Reasoning for the second prediction: how do I write a prompt such that human writing which began with that prompt would contain a workable explanation of a cheap fusion power generator? In practice, writing which claims to contain such a thing is generally crackpottery. I could take a different angle, maybe write some section-headers with names of particular technologies (e.g. electric motor, radio antenna, water pump, …) and descriptions of how they work, then write a header for “fusion generator” and let the model fill in the description. Something like that could plausibly work. Or it could generate scifi technobabble, because that’s what would be most likely to show up in such a piece of writing today. It all depends on which is "more likely" to appear in human writing. Point is: GPT is trained to mimic human writing; getting it to write things which humans cannot currently write is likely to be hard, even if it has the requisite capabilities.

# 46

New Comment

I wonder how long we'll be in the "prompt programming" regime. As Nick Cammarata put it:

We should actually be programming these by manipulating the hidden layers and prompts are a stand-in until we can.

My guess is that OpenAI will pretty quickly (within the next year) find a much better way to interface with what GPT-3 has learned.

Do others agree? Any reason to think that wouldn't be possible (or wouldn't give significant benefits)?

The problem with directly manipulating the hidden layers is reusability. If we directly manipulate the hidden layers, then we have to redo that whenever a newer, shinier model comes out, since the hidden layers will presumably be different. On the other hand, a prompt is designed so that human writing which starts with that prompt will likely contain the thing we want - a property mostly independent of the internal structure of the model, so presumably the prompt can be reused.

I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the "corresponding" internal notion. Assuming that the first model has learned a real pattern which is actually present in the environment, we should expect that "better" models will also have some structure corresponding to that pattern - otherwise they'd lose predictive power on at least the cases where that pattern applies. Ideally, this would all happen in such a way that the second model can be more accurate, and that increased accuracy would be used.

In the shorter term, I agree OpenAI will probably come up with some tricks over the next year or so.

Planned summary for the Alignment Newsletter:

Currently, many people are trying to figure out how to prompt GPT-3 into doing what they want -- in other words, how to align GPT-3 with their desires. GPT-3 may be capable of the task, but that doesn’t mean it will do it (potential example). This suggests that alignment will soon be a bottleneck on our ability to get value from large language models.
Certainly GPT-3 isn’t perfectly capable yet. The author thinks that in the immediate future the major bottleneck will still be its capability, but we have a clear story for how to improve its capabilities: just scale up the model and data even more. Alignment on the other hand is much harder: we don’t know how to <@translate@>(@Alignment as Translation@) the tasks we want into a format that will cause GPT-3 to “try” to accomplish that task.
As a result, in the future we might expect a lot more work to go into prompt design (or whatever becomes the next way to direct language models at specific tasks). In addition, once GPT is better than humans (at least in some domains), alignment in those domains will be particularly difficult, as it is unclear how you would get a system trained to mimic humans <@to do better than humans@>(@The easy goal inference problem is still hard@).

Planned opinion:

The general point of this post seems clearly correct and worth pointing out. I’m looking forward to the work we’ll see in the future figuring out how to apply these broad and general methods to real tasks in a reliable way.

LGTM

I think it's important to take a step back and notice how AI risk-related arguments are shifting.

In the sequences, a key argument (probably the key argument) for AI risk was the complexity of human value, and how it would be highly anthropomorphic for us to believe that our evolved morality was embedded in the fabric of the universe in a way that any intelligent system would naturally discover.  An intelligent system could just as easily maximize paperclips, the argument went.

No one seems to have noticed that GPT actually does a lot to invalidate the original complexity-of-value-means-FAI-is-super-difficult argument.

You write:

To get GPT-3 to do the thing we want, it helps to give it a few examples, so it can “figure out what we’re asking for”.

This is an alignment problem.

We've gotten from "the alignment problem is about complexity of value" to "the alignment problem is about programming by example" (also known as "supervised learning", or Machine Learning 101).

There's actually a long history of systems which combine

observing-lots-of-data-about-the-world (GPT-3's training procedure, "unsupervised learning")

with

programming-by-example ("supervised learning")

The term for this is "semi-supervised learning".  When I search for it on Google Scholar, I get almost 100K results. ("Transfer learning" is a related literature.)

The fact that GPT-3's API only does text completion is, in my view, basically just an API detail that we shouldn't particularly expect to be true of GPT-4 or GPT-5.  There's no reason why OpenAI couldn't offer an API which takes in a list of (x, y) pairs and then given some x it predicts y.  I expect if they chose to do this as a dedicated engineering effort, getting into the guts of the system as needed, and collected a lot of user feedback on whether the predicted y was correct for many different problems, they could exceed the performance gains you can currently get by manipulating the prompt.

In other words, there’s a clear distinction between alignment and capabilities:

• alignment is mainly about the prompt, and asks whether human writing which started with that prompt would be likely to contain the thing you want
• capabilities are mainly about GPT’s model, and ask about how well GPT-generated writing matches realistic human writing

I'm wary of a world where "the alignment problem" becomes just a way to refer to "whatever the difference is between our current system and the ideal system". (If I trained a supervised learning system to classify word vectors based on whether they're things that humans like or dislike, and the result didn't work very well, I can easily imagine some rationalists telling me this represented a failure to "solve the alignment problem"--even if the bottleneck was mainly in the word vectors themselves, as evidenced e.g. by large performance improvements on switching to higher-dimensional word vectors.)  I'm reminded of a classic bad argument.

As the capabilities of GPT-style models begin to cross beyond what humans can do (at least in some domains), alignment will become a much harder bottleneck, because it’s hard to make a human-mimicking system do things which humans cannot do

If it's hard to make a human-mimicking system do things which humans cannot do, why should we expect the capabilities of GPT-style models to cross beyond what humans can do in the first place?

My steelman of what you're saying:

Over the course of GPT's training procedure, it incidentally acquires superhuman knowledge, but then that superhuman knowledge gets masked as it sees more data and learns which specific bits of its superhuman knowledge humans are actually ignorant of (and even after catastrophic forgetting, some bits of superhuman knowledge remain at the end of the training run).  If that's the case, it seems like we could mitigate the problem by restricting GPT's training to textbooks full of knowledge we're very certain in (or fine-tuning GPT on such textbooks after the main training run, or simply increasing their weight in the loss function).  Or replace every phrase like "we don't yet know X" in GPT's training data with "X is a topic for a more advanced textbook", so GPT never ends up learning what humans are actually ignorant about.

Or simply use a prompt which starts with the letterhead for a university press release: "Top MIT scientists have made an important discovery related to X today..." Or a prompt which looks like the beginning of a Nature article. Or even: "Google has recently released a super advanced new AI system which is aligned with human values; given X, it says Y." (Boom! I solved the alignment problem! We thought about uploading a human, but uploading an FAI turned out to work much better.)

(Sorry if this comment came across as grumpy.  I'm very frustrated that so much upvoting/downvoting on LW seems to be based on what advances AI doom as a bottom line. It's not because I think superhuman AI is automatically gonna be safe. It's because I'd rather we did not get distracted by a notion of "the alignment problem" which OpenAI could likely solve with a few months of dedicated work on their API.)

My main response to this needs a post of its own, so I'm not going to argue it in detail here, but I'll give a summary and then address some tangential points.

Summary: the sense in which human values are "complex" is not about predictive power. A low-level physical model of some humans has everything there is to know about human values embedded within it; it has all the predictive power which can be had by a good model of human values. The hard part is pointing to the thing we consider "human values" embedded within that model. In large part, that's hard because it's not just a matter of predictive power.

Looking at it as semi-supervised learning: it's not actually hard for an unsupervised learner to end up with some notion of human values embedded in its world-model, but finding the embedding is hard, and it's hard in a way which cannot-even-in-principle be solved by supervised learning (because that would reduce it to predictive power).

On to tangential points...

We've gotten from "the alignment problem is about complexity of value" to "the alignment problem is about programming by example" (also known as "supervised learning", or Machine Learning 101).

Tangential point 1: a major thing which I probably didn't emphasize enough in the OP is the difference between "the problem of aligning with X" and "the problem of aligning with human values". For any goal, we can talk about the problem of aligning a system with that goal. So e.g. if I want a system to solve addition problems, then I can talk about aligning GPT-3 with that particular goal.

This is not the same problem as aligning with human values. And for very powerful AI, the system has to be aligned with human values, because aligning it with anything else gives us paperclip-style problems.

That said, it does seem like alignment problems in general have certain common features - i.e. there are some major common subproblems between aligning a system with the goal of solving addition problems and aligning a system with human values. That's why it makes sense to talk about "alignment problems" as a natural category of problems. To the extent that supervised/semi-supervised learning are ways of specifying what we want, we can potentially learn useful things about alignment problems in general by thinking about those approaches - in particular, we can notice a lot of failure modes that way!

The fact that GPT-3's API only does text completion is, in my view, basically just an API detail that we shouldn't particularly expect to be true of GPT-4 or GPT-5.  There's no reason why OpenAI couldn't offer an API which takes in a list of (x, y) pairs and then given some x it predicts y.  I expect if they chose to do this as a dedicated engineering effort, getting into the guts of the system as needed, and collected a lot of user feedback on whether the predicted y was correct for many different problems, they could exceed the performance gains you can currently get by manipulating the prompt.

Tangential point 2: interfaces are a scarce resource, and "programming by example" is a bad interface. Sure, the GPT team could offer this, but it would be a step backward for problems which are currently hard, not a step forward. The sort of problems which can be cheaply and usefully expressed as supervised learning problems already have lots of support; it's the rest of problem-space that's interesting here.

Specifically in the context of alignment-to-human-values, this goes back to the main point: "human values" are complex in a way which does not easily reduce to predictive power, so offering a standard supervised-learning-style interface does not really get us any closer to solving that problem.

(If I trained a supervised learning system to classify word vectors based on whether they're things that humans like or dislike, and the result didn't work very well, I can easily imagine some rationalists telling me this represented a failure to "solve the alignment problem"--even if the bottleneck was mainly in the word vectors themselves, as evidenced e.g. by large performance improvements on switching to higher-dimensional word vectors.)

Tangential point 3: the formulation in the OP specifically avoids that failure mode, and that's not an accident. Any failure which can be solved by using a system with better general-purpose predictive power is not an alignment failure. If a system is trained to predict X, then the "alignment problem" (at least as I'm using the phrase) is about aligning X with what we want, not about aligning the system itself. (I think this is also what we usually mean by "outer alignment".)

If it's hard to make a human-mimicking system do things which humans cannot do, why should we expect the capabilities of GPT-style models to cross beyond what humans can do in the first place?

Tangential point 4: at a bare minimum, GPT-style models should be able to do a lot of different things which a lot of different people can do, but which no single person can do. I find it plausible that GPT-3 has some such capabilities already - e.g. it might be able to translate between more different languages with better fluency than any single human.

Sorry if this comment came across as grumpy.

It was usefully and intelligently grumpy, which plenty good enough for me.

Summary: the sense in which human values are "complex" is not about predictive power. A low-level physical model of some humans has everything there is to know about human values embedded within it; it has all the predictive power which can be had by a good model of human values. The hard part is pointing to the thing we consider "human values" embedded within that model. In large part, that's hard because it's not just a matter of predictive power.

1. This still sounds like a shift in arguments to me. From what I remember, the MIRI-sphere take on uploads is (was?): "if uploads come before AGI, that's probably a good thing, as long as it's a sufficiently high-fidelity upload of a benevolent individual, and the technology is not misused prior to that person being uploaded". (Can probably dig up some sources if you don't believe me.)

2. I still don't buy it. Your argument proves too much--how is it that transfer learning works? Seems that pointing to relevant knowledge embedded in an ML model isn't super hard in practice.

Tangential point 1: a major thing which I probably didn't emphasize enough in the OP is the difference between "the problem of aligning with X" and "the problem of aligning with human values". For any goal, we can talk about the problem of aligning a system with that goal. So e.g. if I want a system to solve addition problems, then I can talk about aligning GPT-3 with that particular goal.

This is not the same problem as aligning with human values.

Is there a fundamental difference? You say: 'The hard part is pointing to the thing we consider "human values" embedded within that model.' What is it about pointing to the thing we consider "human values" which makes it fundamentally different from pointing to the thing we consider a dog?

The main possible reason I can think of is because a dog is in some sense a more natural category than human values. That there are a bunch of different things which are kind of like human values, but not quite, and one has to sort through a large number of them in order to pinpoint the right one ("will the REAL human values please stand up?") (I'm not sure I buy this, but it's the only way I see for your argument to make sense.)

As an example of something which is not a natural category, consider a sandwich. Or, to an even greater degree: "Tasty ice cream flavors" is not a natural category, because everyone has their own ice cream preferences.

And for very powerful AI, the system has to be aligned with human values

Disagree, you could also align it with corrigibility.

"programming by example" is a bad interface

A big part of this post is about how people are trying to shoehorn programming by example into text completion, wasn't it? What do you think a good interface would be?

The sort of problems which can be cheaply and usefully expressed as supervised learning problems already have lots of support

I think perhaps you're confusing the collection of ML methods that conventionally fall under the umbrella of "supervised learning", and the philosophical task of predicting (x, y) pairs. As an example, from a philosophical perspective, I could automate away most software development if I could train a supervised learning system where x=the README of a Github project and y=the code for that Github project. But most of the ML methods that come to the mind of an average ML person when I say "supervised learning" are not remotely up to that task.

(BTW note that such a collection of README/code pairs comes pretty close to pinpointing the notion of "do what I mean". Which could be a very useful building block--remember, a premise of the classic AI safety arguments is that "do what I mean" doesn't exist. Also note that quality code on Github is a heck of a lot more interpretable than the weights in a neural net--and restricting the training set to quality code seems pretty easy to do.)

It was usefully and intelligently grumpy, which plenty good enough for me.

Glad to hear it. I get so demoralized commenting on LW because it so often feels like a waste of time in retrospect.

The main possible reason I can think of is because a dog is in some sense a more natural category than human values. That there are a bunch of different things which are kind of like human values, but not quite, and one has to sort through a large number of them in order to pinpoint the right one ("will the REAL human values please stand up?")

This is close, though not exactly what I want to claim. It's not that "dogs" are a "more natural category" in the sense that there are fewer similar categories which are hard to tell apart. Rather, it's that "dogs" are a less abstract natural category. Like, "human" is a natural category in roughly the same way as "dog", but in order to get to "human values" we need a few more layers of abstraction on top of that, and some of those layers have different type signatures than the layers below - e.g. we need not just a notion of "human", but a notion of humans "wanting things", which requires an entirely different kind of model from recognizing dogs.

And that's exactly the kind of complexity which is hard for something based on predictive power. Lower abstraction levels should generally perform better in terms of raw prediction, but the thing we want to point to lives at a high abstraction level.

We are able to get systems to learn some abstractions just by limiting compute - today's deep nets have nowhere near the compute to learn to do Bayesian updates on low-level physics, so it needs to learn some abstraction. But the exact abstraction level learned is always going to be a tradeoff between available compute and predictive power. I do think there's probably a wide range of parameters which would end up using the "right" level of abstraction for human values to be "natural", but we don't have a good way to recognize when that has/hasn't happened, and relying on it happening would be a crapshoot.

(Also, sandwiches are definitely a natural category. Just because a cluster has edge cases does not make it any less of a cluster, even if a bunch of trolls argue about the edge cases. "Tasty ice cream flavors" is also a natural category if we know who the speaker is, which is exactly how humans understand the phrase in practice.)

I think perhaps you're confusing the collection of ML methods that conventionally fall under the umbrella of "supervised learning", and the philosophical task of predicting (x, y) pairs...

Part of what I mean here by "already have lots of support" is that there's already a path to improvement on these sorts of problems, not necessarily that they're already all solved. The point is that we already have a working interface for this sort of thing, and for an awful lot of problems, the interface is what's hard.

As for how close a collection of README/code pairs comes to pinpointing "do what I mean"... imagine picking a random github repo, giving its README to a programmer, and asking them to write code implementing the behavior described. This is basically equivalent to a step in the waterfall model of development: create a specification, then have someone implement it in a single step without feedback. The general consensus among developers seems to be that this works very badly.

Indeed, this is an example of interfaces being hard: you'd think "I give you a spec, you give me code" would be a good interface, but it actually works pretty badly.

And that's exactly the kind of complexity which is hard for something based on predictive power. Lower abstraction levels should generally perform better in terms of raw prediction, but the thing we want to point to lives at a high abstraction level.

You told me that "it's not actually hard for an unsupervised learner to end up with some notion of human values embedded in its world-model".  Now you're saying that things based on "predictive power" have trouble learning things at high abstraction levels.  Doesn't this suggest that your original statement is wrong, and the predict-the-next-word training method used by GPT-3 means it will not develop notions such as human value?

(BTW, I think this argument proves too much.  Recognizing a photo of a dog does require learning various lower-level abstractions such as legs, nose, ears, etc. which in turn require even lower-level abstractions such as fur textures and object boundaries.  In any case, if you think things based on "predictive power" have trouble learning things at high abstraction levels, that suggests that it should also have trouble understanding e.g. Layer 7 in the OSI networking model.)

We are able to get systems to learn some abstractions just by limiting compute - today's deep nets have nowhere near the compute to learn to do Bayesian updates on low-level physics, so it needs to learn some abstraction. But the exact abstraction level learned is always going to be a tradeoff between available compute and predictive power. I do think there's probably a wide range of parameters which would end up using the "right" level of abstraction for human values to be "natural", but we don't have a good way to recognize when that has/hasn't happened, and relying on it happening would be a crapshoot.

It sounds like you're talking about the bias/variance tradeoff?  The standard solution is to use cross validation, do you have any reason to believe it wouldn't work here?

The point is that we already have a working interface for this sort of thing, and for an awful lot of problems, the interface is what's hard.

I'm very unpersuaded that interfaces are the hard part of creating superhuman AI.

As for how close a collection of README/code pairs comes to pinpointing "do what I mean"... imagine picking a random github repo, giving its README to a programmer, and asking them to write code implementing the behavior described. This is basically equivalent to a step in the waterfall model of development: create a specification, then have someone implement it in a single step without feedback. The general consensus among developers seems to be that this works very badly.

I mean, if you don't like the result, tweak the README and run the code-generating AI a second time.

The reason waterfall sucks is because human programmers are time-consuming and costly, and people who hire them don't always know precisely what they want.  And testing intermediate versions of the software can help them figure that out.  But if generating a final version of the software is costless, there's no reason not to test the final version instead of an intermediate version.

You told me that "it's not actually hard for an unsupervised learner to end up with some notion of human values embedded in its world-model".  Now you're saying that things based on "predictive power" have trouble learning things at high abstraction levels.  Doesn't this suggest that your original statement is wrong, and the predict-the-next-word training method used by GPT-3 means it will not develop notions such as human value?

The original example is a perfect example of what this looks like: an unsupervised learner, given crap-tons of data and compute, should have no difficulty learning a low-level physics model of humans. That model will have great predictive power, which is why the model will learn it. Human values will be embedded in that model in exactly the same way that they're embedded in physical humans.

Likewise, GPT-style models should have no trouble learning some model with human values embedded in it. But that embedding will not necessarily be simple; there won't just be a neuron that lights up in response to humans having their values met. The model will have a notion of human values embedded in it, but it won't actually use "human values" as an abstract object in its internal calculations; it will work with some lower-level "components" which themselves implement/embed human values.

It sounds like you're talking about the bias/variance tradeoff?  The standard solution is to use cross validation, do you have any reason to believe it wouldn't work here?

I am definitely not talking about bias-variance tradeoff. I am talking about compute-accuracy tradeoff. Again, think about the example of Bayesian updates on a low-level physical model: there is no bias-variance tradeoff there. It's the ideal model, full stop. The reason we can't use it is because we don't have that much compute. In order to get computationally tractable models, we need to operate at higher levels of abstraction than "simulate all these quantum fields".

I'm very unpersuaded that interfaces are the hard part of creating superhuman AI.

Aligning superhuman AI, not just creating it. If you're unpersuaded, you should go leave feedback on Alignment as Translation, which directly talks about alignment as an interface problem.

Likewise, GPT-style models should have no trouble learning some model with human values embedded in it. But that embedding will not necessarily be simple; there won't just be a neuron that lights up in response to humans having their values met. The model will have a notion of human values embedded in it, but it won't actually use "human values" as an abstract object in its internal calculations; it will work with some lower-level "components" which themselves implement/embed human values.

If it's read moral philosophy, it should have some notion of what the words "human values" mean.

In any case, I still don't understand what you're trying to get at. Suppose I pretrain a neural net to differentiate lots of non-marsupial animals. It doesn't know what a koala looks like, but it has some lower-level "components" which would allow it to characterize a koala. Then I use transfer learning and train it to differentiate marsupials. Now it knows about koalas too.

This is actually a tougher scenario than what you're describing (GPT will have seen human values yet the pretrained net hasn't seen koalas in my hypothetical), but it's a boring application of transfer learning.

Locating human values might be trickier than characterizing a koala, but the difference seems quantitative, not qualitative.

If it's read moral philosophy, it should have some notion of what the words "human values" mean.

GPT-3 and systems like it are trained to mimic human discourse. Even if (in the limit of arbitrary computational power) it manages to encode an implicit representation of human values somewhere in its internal state, in actual practice there is nothing tying that representation to the phrase "human values", since moral philosophy is written by (confused) humans, and in human-written text the phrase "human values" is not used in the consistent, coherent manner that would be required to infer its use as a label for a fixed concept.

This is essentially the "tasty ice cream flavors" problem, am I right?  Trying to check if we're on the same page.

If so: John Wentsworth said

"Tasty ice cream flavors" is also a natural category if we know who the speaker is

So how about instead of talking about "human values", we talk about what a particular moral philosopher endorses saying or doing, or even better, what a committee of famous moral philosophers would endorse saying/doing.

No, this is not the "tasty ice cream flavors" problem. The problem there is that the concept is inherently relative to a person. That problem could apply to "human values", but that's a separate issue from what dxu is talking about.

The problem is that "what a committee of famous moral philosophers would endorse saying/doing", or human written text containing the phrase "human values", is a proxy for human values, not a direct pointer to the actual concept. And if a system is trained to predict what the committee says, or what the text says, then it will learn the proxy, but that does not imply that it directly uses the concept.

Well, the moral judgements of a high-fidelity upload of a benevolent human are also a proxy for human values--an inferior proxy, actually.  Seems to me you're letting the perfect be the enemy of the good.

It doesn't matter how high-fidelity the upload is or how benevolent the human is, I'm not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that. "Don't let the perfect be the enemy of the good" is advice for writing emails and cleaning the house, not nuclear security.

The capabilities of powerful AGI will be a lot more dangerous than nukes, and merit a lot more perfectionism.

Humans themselves are not aligned enough that I would be happy giving them the sort of power that AGI will eventually have. They'd probably be better than many of the worst-case scenarios, but they still wouldn't be a best or even good scenario. Humans just don't have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.

It doesn't matter how high-fidelity the upload is or how benevolent the human is, I'm not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that.

Here are some of the people who have the power to set off nukes right now:

• Donald Trump

• Kim Jong-un

• Both parties in this conflict

"Don't let the perfect be the enemy of the good" is advice for writing emails and cleaning the house, not nuclear security.

“A good plan violently executed now is better than a perfect plan executed at some indefinite time in the future.” - George Patton

Just because it's in your nature (and my nature, and the nature of many people who read this site) to be a cautious nerd, does not mean that the cautious nerd orientation is always the best orientation to have.

In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome. It's a classic motte-and-bailey:

"It's very hard to build an AGI which isn't a paperclipper!"

"Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI..."

"Yeah but we gotta be super perfectionistic because there is so much at stake!"

Your final "humans will misuse AI" worry may be justified, but I think naive deployment of this worry is likely to be counterproductive. Suppose there are two types of people, "cautious" and "incautious". Suppose that the "humans will misuse AI" worry discourages cautious people from developing AGI, but not incautious people. So now we're in a world where the first AGI is most likely controlled by incautious people, making the "humans will misuse AI" worry even more severe.

Humans just don't have the processing power to avoid shooting themselves (and the rest of the universe) in the foot sooner or later, given that kind of power.

If you're willing to grant the premise of the technical alignment problem being solved, shooting oneself in the foot would appear to be much less of a worry, because you can simply tell your FAI "please don't let me shoot myself in the foot too badly", and it will prevent you from doing that.

It's a classic motte-and-bailey:

"It's very hard to build an AGI which isn't a paperclipper!"

"Well actually here are some straightforward ways one might be able to create a helpful non-paperclipper AGI..."

"Yeah but we gotta be super perfectionistic because there is so much at stake!"

There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper. Yes, there are straightforward ways one might be able to create a helpful non-paperclipper AGI. But that "might" is carrying a lot of weight. All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don't know exactly what those parameter ranges are.

It's sort of like saying:

"It's very hard to design a long bridge which won't fall down!"

"Well actually here are some straightforward ways one might be able to create a long non-falling-down bridge..." <shows picture of a wooden truss>

What I'm saying is, that truss is design is 100% going to fail once it gets big enough, and we don't currently know how big that is. When I say "it's hard to design a long bridge which won't fall down", I do not mean a bridge which might not fall down if we're lucky and just happen to be within the safe parameter range.

In any case, it may be that the annual amount of xrisk is actually quite low, and no one outside the rationalist community is smart enough to invent AGI, and we have all the time in the world. In which case, yes, being perfectionistic is the right strategy. But this still seems to represent a major retreat from the AI doomist position that AI doom is the default outcome.

These are sufficient conditions for a careful strategy to make sense, not necessary conditions. Here's another set of sufficient conditions, which I find more realistic: the gains to be had in reducing AI risk are binary. Either we find the "right" way of doing things, in which case risk drops to near-zero, or we don't, in which case it's a gamble and we don't have much ability to adjust the chances/payoff. There are no significant marginal gains to be had.

There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper.

This is simultaneously

• a major retreat from the "default outcome is doom" thesis which is frequently trotted out on this site (the statement is consistent with a AGI design that's is 99.9% likely to be safe, which is very much incompatible with "default outcome is doom")
• unrelated to our upload discussion (an upload is not an AGI, but you said even a great upload wasn't good enough for you)

You've picked a position vaguely in between the motte and the bailey and said "the motte and the bailey are both equivalent to this position!"  That doesn't look at all true to me.

All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don't know exactly what those parameter ranges are.

This is a very strong claim which to my knowledge has not been well-justified anywhere.  Daniel K agreed with me the other day that there isn't a standard reference for this claim.  Do you know of one?

There are a couple problems I see here:

• Simple is not the same as obvious.  Even if someone at some point tried to think of every obvious solution and justifiably discarded them all, there are probably many "obvious" solutions they didn't think of.
• Nothing ever gets counted as evidence against this claim.  Simple proposals get rejected on the basis that everyone knows simple proposals won't work.

A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety.  Maybe there are good arguments for that, but the problem is that if you're not careful, your view of reality is gonna get distorted.  Which means community wisdom on claims such as "simple solutions never work" is likely to be systematically wrong.  "Everyone knows X", without a good written defense of X, or a good answer to "what would change the community's mind about X", is fertile ground for information cascades etc.  And this is on top of standard ideological homophily problems (the AI safety community is very self-selected subset of the broader AI research world).

What I'm saying is, that truss is design is 100% going to fail once it gets big enough, and we don't currently know how big that is. When I say "it's hard to design a long bridge which won't fall down", I do not mean a bridge which might not fall down if we're lucky and just happen to be within the safe parameter range.

My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks.  This is logically rude.  And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not.  From my perspective, you've pulled this conversational move multiple times in this thread.  It seems to be pretty common when I have discussions about AI safety people.  That's part of why I find the discussions so frustrating.  My view is that this is a cultural problem which has to be solved for the AI safety community to do much useful AI safety work (as opposed to "complaining about how hard AI safety is" work, which is useful but insufficient).

Anyway, I'll let you have the last word in this thread.

This is a very strong claim which to my knowledge has not been well-justified anywhere.  Daniel K agreed with me the other day that there isn't a standard reference for this claim.  Do you know of one?

There isn't a standard reference because the argument takes one sentence, and I've been repeating it over and over again: what would Bayesian updates on low-level physics do? That's the unique solution with best-possible predictive power, so we know that anything which scales up to best-possible predictive power in the limit will eventually behave that way.

My perception of your behavior in this thread is: instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks.  This is logically rude.  And it makes it impossible to have an in-depth discussion about whether the bridge design can actually be extended or not.

The "what would Bayesian updates on a low-level model do?" question is exactly the argument that the bridge design cannot be extended indefinitely, which is why I keep bringing it up over and over again.

This does point to one possibly-useful-to-notice ambiguous point: the difference between "this method would produce an aligned AI" vs "this method would continue to produce aligned AI over time, as things scale up". I am definitely thinking mainly about long-term alignment here; I don't really care about alignment on low-power AI like GPT-3 except insofar as it's a toy problem for alignment of more powerful AIs (or insofar as it's profitable, but that's a different matter).

I've been less careful than I should be about distinguishing these two in this thread. All these things which we're saying "might work" are things which might work in the short term on some low-power AI, but will definitely not work in the long term on high-power AI. That's probably part of why it seems like I keep switching positions - I haven't been properly distinguishing when we're talking short-term vs long-term.

A second comment on this:

instead of talking about whether the bridge can be extended, you changed the subject and explained that the real problem is that the bridge has to support very heavy trucks

If we want to make a piece of code faster, the first step is to profile the code to figure out which step is the slow one. If we want to make a beam stronger, the first step is to figure out where it fails. If we want to extend a bridge design, the first step is to figure out which piece fails under load if we just elongate everything.

Likewise, if we want to scale up an AI alignment method, the first step is to figure out exactly how it fails under load as the AI's capabilities grow.

I think you currently do not understand the failure mode I keep pointing to by saying "what would Bayesian updates on low-level physics do?". Elsewhere in the thread, you said that optimizing "for having a diverse range of models that all seem to fit the data" would fix the problem, which is my main evidence that you don't understand the problem. The problem is not "the data underdetermines what we're asking for", the problem is "the data fully determines what we're asking for, and we're asking for a proxy rather than the thing we actually want".

Locating human values might be trickier than characterizing a koala, but the difference seems quantitative, not qualitative.

I generally agree with this. The things I'm saying about human values also apply to koala classification. As with koalas, I do think there's probably a wide range of parameters which would end up using the "right" level of abstraction for human values to be "natural". On the other hand, for both koalas and humans, we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute - again, because Bayesian updates on low-level physics are just better in terms of predictive power.

Right now, we have no idea when that line will be crossed - just an extreme upper bound. We have no idea how wide/narrow the window of training parameters is in which either "koalas" or "human values" is a natural level of abstraction.

It doesn't know what a koala looks like, but it has some lower-level "components" which would allow it to characterize a koala. Then I use transfer learning and train it to differentiate marsupials. Now it knows about koalas too.

Ability to differentiate marsupials does not imply that the system is directly using the concept of koala. Yet again, consider how Bayesian updates on low-level physics would respond to the marsupial-differentiation task: it would model the entire physical process which generated the labels on the photos/videos. "Physical process which generates the label koala" is not the same as "koala", and the system can get higher predictive power by modelling the former rather than the latter.

When we move to human values, that distinction becomes a lot more important: "physical process which generates the label 'human values satisfied'" is not the same as "human values satisfied". Confusing those two is how we get Goodhart problems.

We don't need to go all the way to low-level physics models in order for all of that to apply. In order for a system to directly use the concept "koala", rather than "physical process which generates the label koala", it has to be constrained on compute in a way which makes the latter too expensive - despite the latter having higher predictive power on the training data. Adding in transfer learning on some lower-level components does not change any of that; it should still be possible to use those lower-level components to model the physical process which generates the label koala without directly reasoning about koalas.

I've now written essentially the same response at least four times to your objections, so I recommend applying the general pattern yourself:

• Consider how Bayesian updates on a low-level physics model would behave on whatever task you're considering. What would go wrong?
• Next, imagine a more realistic system (e.g. current ML systems) failing in an analogous way. What would that look like?
• What's preventing ML systems from failing in that way already? The answer is probably "they don't have enough compute to get higher predictive power from a less abstract model" - which means that, if things keep scaling up, sooner or later that failure will happen.

You say: "we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute".  I think this depends on specific details of how the system is engineered.

"Physical process which generates the label koala" is not the same as "koala", and the system can get higher predictive power by modelling the former rather than the latter.

Suppose we use classification accuracy as our loss function.  If all the koalas are correctly classified by both models, then the two models have equal loss function scores.  I suggested that at that point, we use some kind of active learning scheme to better specify the notion of "koala" or "human values" or whatever it is that we want.  Or maybe just be conservative, and implement human values in a way that all our different notions of "human values" agree with.

You seem to be imagining a system that throws out all of its more abstract notions of "koala" once it has the capability to do Bayesian updates on low-level physics.  I don't see why we should engineer our system in this way.  My expectation is that human brains have many different computational notions of any given concept, similar to an ensemble (for example, you might give me a precise definition of a sandwich, and I show you something and you're like "oh actually that is/is not a sandwich, guess my definition was wrong in this case"--which reveals you have more than one way of knowing what "a sandwich" is), and AGI will work the same way (at least, that's how I would design it!)

I've now written essentially the same response at least four times to your objections

I was trying to understand what you were getting at.  This new argument seems pretty different from the "alignment is mainly about the prompt" thesis in your original post--another shift in arguments?  (I don't necessarily think it is bad for arguments to shift, I just think people should acknowledge that's going on.)

You seem to be imagining a system that throws out all of its more abstract notions of "koala" once it has the capability to do Bayesian updates on low-level physics.  I don't see why we should engineer our system in this way.

It's certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:

• if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
• if we're not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.

In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the "right" abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.

or example, you might give me a precise definition of a sandwich, and I show you something and you're like "oh actually that is/is not a sandwich, guess my definition was wrong in this case"--which reveals you have more than one way of knowing what "a sandwich" is

(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich - which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.

Also, even if there's some sort of ensembling, the concept "sandwich" still needs to specify one particular ensemble.

This new argument seems pretty different from the "alignment is mainly about the prompt" thesis in your original post--another shift in arguments?

We've shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We've shifted to talking about what alignment means in general, and what's hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.

Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal - goals which are abstract in an analogous way to human values.

if we're not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.

Optimize for having a diverse range of models that all seem to fit the data.

How would that fix any of the problems we've been talking about?

To put it another way:

What semisupervised learning and transfer learning have in common is: You find a learning problem you have a lot of data for, such that training a learner for that problem will incidentally cause it to develop generally useful computational structures (often people say "features" but I'm trying to take more of an open-ended philosophical view).  Then you re-use those computational structures in a supervised learning context to solve a problem you don't have a lot of data for.

From an AI safety perspective, there are a couple obvious ways this could fail:

• Training a learner for the problem with lots of data might cause it to develop the wrong computational structures.  (Example: GPT-3 learns a meaning of the word "love" which is subtly incorrect.)
• While attempting to re-use the computational structures, you end up pinpointing the wrong one, even though the right one exists.  (Example: computational structures for both "Effective Altruism" and "maximize # grandchildren" have been learned correctly, but your provided x/y pairs which are supposed to indicate human values don't allow for differentiating between the two, and your system arbitrarily chooses "maximize # grandchildren" when what you really wanted was "Effective Altruism").

I don't think this post makes a good argument that we should expect the second problem to be more difficult in general.  Note that, for example, it's not too hard to have your system try to figure out where the "Effective Altruism" and "maximize # grandchildren" theories of how (x, y) arose differ, and query you on those specific data points ("active learning" has 62,000 results on Google Scholar).

Incidentally, I'm most worried about non-obvious failure modes, I expect obvious failure modes to get a lot of attention.  (As an example of a non-obvious thing that could go wrong, imagine a hypothetical super-advanced AI that queries you on some super enticing scenario where you become global dictator, in order to figure out if the (x, y) pairs it's trying to predict correspond to a person who outwardly behaves in an altruistic way, but is secretly an egoist who will succumb to temptation if the temptation is sufficiently strong.  In my opinion the key problem is to catalogue all the non-obvious ways in which things could fail like this.)

This is almost, but not quite, the division of failure-modes which I see as relevant. If my other response doesn't clarify sufficiently, let me know and I'll write more of a response here.

I think it's important to take a step back and notice how AI risk-related arguments are shifting.
In the sequences, a key argument (probably the key argument) for AI risk was the complexity of human value, and how it would be highly anthropomorphic for us to believe that our evolved morality was embedded in the fabric of the universe in a way that any intelligent system would naturally discover.  An intelligent system could just as easily maximize paperclips, the argument went.
No one seems to have noticed that GPT actually does a lot to invalidate the original complexity-of-value-means-FAI-is-super-difficult argument.

As far as I see, GPT-3 did absolutely nothing to invalidate the argument about complexity of value. GPT-3 is able to predict correctly the kind of things we want it to predict in a context-window of at most 1000 words in very small time scale. So it can predict what we want it to do in basically one or a couple of abstract steps. That seems to guarante nothing whatsoever about the ability of GPT-3 to infer our exact values for the time scales and complexity even relevant to human level AI, let alone AGI.

But I'm very interested for any experience that seems to invalidate this point.

I'm not claiming GPT-3 understands human values, I'm saying it's easy to extrapolate from GPT-3 to a future GPT-N system which basically does.

Curated. Simple, crucially important point, I'm really glad you wrote it up.

There are infinitely many distributions from which the training data of GPT could have been sampled from [EDIT: including ones that could be catastrophic as the distribution our AGI learns], so it's worth mentioning an additional challenge on this route: making the future AGI-level-GPT learn the "human writing distribution" that we have in mind.