the scaling “inconsistency”: openAI’s new insight

[-]gwern5y*100

This makes sense to me and is what I've been considering as the implication of sample-efficiency (one of the blessings of scale), coming at it from another direction of meta-learning/Bayesian RL: if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze blood from a stone; once you approach the intrinsic entropy, there's not much to learn. Steeply diminishing returns is built into compiling large text datasets and just training on random samples. It looks like the former is the regime we've been in up to GPT-3 and beyond, and the latter is when the slower data-only scaling kicks in.

Aside from multimodal approaches, the crossover raises the question of whether it becomes time to invest in improvements like active learning. What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?

[-]David Scott Krueger (formerly: capybaralet)5y30

if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze blood from a stone; once you approach the intrinsic entropy, there's not much to learn.

I found this confusing. It sort of seems like you're assuming that a Bayes-optimal learner achieves the Bayes error rate (are you ?), which seems wrong to me.

What do you mean "the Bayes-limit"? At first, I assumed you were talking about the Bayes error rate (https://en.wikipedia.org/wiki/Bayes_error_rate), but that is (roughly) the error you coule expect to achieve with infinite data, and we're still talking about finite data.
What do you mean "Bayes-optimal learner"? I assume you just mean something that performs Bayes rule exactly (so depends on the prior/data).
I'm confused by you talking about "approach[ing] the intrinsic entropy"... it seems like the figure in OP shows L(C) approaching L(D). But is L(D) supposed to represent intrinsic entropy? should we trust it as an estimate of intrinsic entropy?

I also don't see how active learning is supposed to help (unless you're talking about actively generating data)... I thought the whole point you were trying to make is that once you reach the Bayes error rate there's literally nothing you can do to keep improving without more data.
You talk about using active learning to throw out data-points... but I thought the problem was not having enough data? So how is throwing out data supposed to help with that?

[-]nostalgebraist5y30

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?

IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss). This will help if

you get most of the way to intrinsic entropy L(D) on your first pass over D points
you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch over them

I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress).

With text data, though, I'm concerned that (2) will fail soon.

The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven't seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc. The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

[-]gwern5y*40

The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

[-]nostalgebraist5y50

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Sure -- my point to contrast two cases

a counterfactual world with a much larger "regular" web, so WebText and Common Crawl are 1000x their real size
the real world, where we have to go beyond "regular" web scrapes to add orders of magnitude

Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free. This includes domains the research would never have come up with themselves.

If we switch to manually hunting down large specialized datasets, this will definitely help, but we're no longer getting broad domain coverage for free. At best we get broad domain coverage through manual researcher effort and luck, at worst we don't get it at all.

I see your point about active learning "telling us" when we need more data -- that's especially appealing if it can point us to specific domains where more coverage would help.

[-]gwern5y*40

I think I see 'domain-specific datasets' as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly 'everything'), by millions of people, doing things like uploading banned books for evading the Great Firewall or organizing protests against local officials, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know 'source code' well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from 'just' Github, or 'just' Arxiv. (Literotica, I'm less sure about.) I don't see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC's general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3's output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!

(It's true Codex does not do this sort of thing beyond what it inherits from GPT-3 pretraining. But that's because it is aimed solely at programming, and so they deliberately filter out most of Github by trying to detect Python source files and throw away everything else etc etc, not because there's not an extremely diverse set of data available on raw Github.)

The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it - aside from EleutherAI's version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they've done it and added it to the Pile, that's no longer a problem.

[-]moridinamael5y80

I really appreciated the degree of clarity and the organization of this post.

I wonder how much the slope of L(D) is a consequence of the structure of the dataset, and whether we have much power to meaningfully shift the nature of L(D) for large datasets. A lot of the structure of language is very repetitive, and once it is learned, the model doesn't learn much from seeing more examples of the same sort of thing. But, within the dataset are buried very rare instances of important concept classes. (In other words, the Common Crawl data has a certain perplexity, and that perplexity is a function of both how much of the dataset is easy/broad/repetitive/generic and how much is hard/narrow/unique/specific.) For example: I can't, for the life of me, get GPT-3 to give correct answers on the following type of prompt:

You are facing north. There is a house straight ahead of you. To your left is a mountain. In what cardinal direction is the mountain?

No matter how much priming I give or how I reframe the question, GPT-3 tends to either give a basically random cardinal direction, or just repeat whatever direction I mentioned in the prompt. If you can figure out how to do it, please let me know, but as far as I can tell, GPT-3 really doesn't understand how to do this. I think this is just an example of the sort of thing which simply occurs so infrequently in the dataset that it hasn't learned the abstraction. However, I fully suspect that if there were some corner of the Internet where people wrote a lot about the cardinal directions of things relative to a specified observer, GPT-3 would learn it.

It also seems that one of the important things that humans do but transformers do not, is actively seek out more surprising subdomains of the learning space. The big breakthrough in transformers was attention, but currently the attention is only within-sequence, not across-dataset. What does L(D) look like if the model is empowered to notice, while training, that its loss on sequences involving words like "west" and "cardinal direction" is bad, and then to search for and prioritize other sequences with those tokens, rather than simply churning through the next 1000 examples of sequences from which it has essentially already extracted the maximum amount of information. At a certain point, you don't need to train it on "The man woke up and got out of {bed}", it knew what the last token was going to be long ago.

It would be good to know if I'm completely missing something here.

[-]nostalgebraist5y20

I don't think you're completely missing something. This is the active learning approach, which gwern also suggested -- see that thread for more.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

50

the scaling “inconsistency”: openAI’s new insight

50

1. L(C) and L(D)

2. C sets E, and E bounds D

3. The inconsistency

4. The resolution

Implications