the scaling “inconsistency”: openAI’s new insight

I think I see 'domain-specific datasets' as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly 'everything'), by millions of people, doing things like uploading banned books for evading the Great Firewall, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know 'source code' well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from 'just' Github, or 'just' Arxiv. (Literotica, I'm less sure about.) I don't see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC's general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3's output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!

The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it - aside from EleutherAI's version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they've done it and added it to the Pile, that's no longer a problem.

the scaling “inconsistency”: openAI’s new insight

Yes, my hypothesis is that active learning should have a different asymptotic because in a lot of admittedly-simple scenarios like logistic regression, active learning has a much nicer asymptotic. Right now, it's not too hard to run in <=1 epoch, and GPT-3 did, and that's using just CC and books1/2/WP. There's loads of other text datasets. (I think someone in EleutherAI was saying that Literotica alone was 500GB...?) Even if active learning 'runs out' of useful data before available compute, that will (a) save a whole lot of compute/time. and (b) tell us explicitly that we've 'used up' the default data and need to revise our approaches.

The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Filtering for difficulty like that is tricky. In particular the most difficult samples are random noise or Chinese or something that the model can't begin to comprehend.

I would point out that GPT-2 nontrivially, and GPT-3 surprisingly well, understand Chinese. And see my link: GPT-2 is able to filter out garbage really well. It doesn't have to be perfect. Even a ratio of, say, only 99:1 garbage:good data deleted is a big win. You're trying to filter out really egregious horrible nonsense data of the sort that you can't even imagine exists until you've actually waded through the sewer of Common Crawl and understood what garbage data really is out there. (Another fun example is: when you go looking for rare languages in Common Crawl, given the base rate, what do even really good natural-language identifier models pull up for rare models? Mostly garbage/natural adversarial examples...)

the scaling “inconsistency”: openAI’s new insight

This makes sense to me and is what I've been considering as the implication of sample-efficiency (one of the blessings of scale), coming at it from another direction of meta-learning/Bayesian RL: if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze blood from a stone; once you approach the intrinsic entropy, there's not much to learn. Steeply diminishing returns is built into compiling large text datasets and just training on random samples. It looks like the former is the regime we've been in up to GPT-3 and beyond, and the latter is when the slower data-only scaling kicks in.

Aside from multimodal approaches, the crossover raises the question of whether it becomes time to invest in improvements like active learning. Bayesian RL is so sample-efficient because it actively optimizes choice of data points to acquire, it doesn't just passively acquire ever-more-redundant i.i.d. samples. Active learning is the supervised equivalent, and active learning has different and much better asymptotics than random sampling.

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted? (A forward pass of something like GPT-2-1.5b will cost <<1% of the cost of forwards+backwards GPT-3, and so on, and is practically free if we consider scaling a GPT-4 to where the crossover used to be. I've suggested this to EleutherAI to optimize their Pile dataset, and even as simple an approach as looking at gzip compression ratios to throw out extremely poorly/highly-compressed data to trim the outliers seems to work fairly well in throwing away spam but not ham; however, they've been too busy getting the first version working to experiment with any real refinements.)

One interesting example from my 'data' category is "Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study", Bahri et al 2020:

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

Mesa-Search vs Mesa-Control

Because you don't train the inputs, you're trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it's abusing the term 'stochastic' (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don't understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can't meta-learn in the same way as RNNs internally.

Mesa-Search vs Mesa-Control

An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it's not, I don't see why that would make a difference, or why a Transformer couldn't be 'not deterministic' in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can't learn 'Monte Carlo iterations' internally (say, one per head).

Reply to Jebari and Lundborg on Artificial Superintelligence

I skimmed the paper when they announced it on Twitter. It seemed like it fundamentally ignores every possibility vaguely like mesa-optimization or imitation learning, and can't deal with things like, say, GPT-3 meta-learning agency to better predict data derived from agents (ie. humans). They leave themselves an out by handwaving away all such inconveniences as 'iron ore agents', but then it's thoroughly useless and circular; "what's an iron ore agent?" "It's one which has dangerous outcomes due to hidden agency." "OK, which agents are those, how can you tell AlphaZero from GPT-3 from AGI?" "Well, try them and see!"

[AN #120]: Tracing the intellectual roots of AI and AI alignment

The authors then develop their own method, Maia. They talk about it as a “modification of the AlphaZero architecture”, but as far as I can tell it is simply behavior cloning using the neural net architecture used by Leela. As you might expect, this does significantly better, and finally satisfies the property we would intuitively want: the best predictive model for a human of some skill level is the one that was trained on the data from humans at that skill level.

Yeah, I think that's all they mean: the CNN and input/output are the same as Leela the same as AlphaZero. But it does differ from behavioral cloning in that they stratify the samples - typically, behavior cloning dumps in all available expert samples (perhaps with a minimum cutoff rating, which is how AlphaGo filtered its KGS pretraining) and trains on them all equally.

Personally, I would've trained a single conditional model with a specified player-Elo for each move, instead of arbitrarily bucketing into 9 levels of Elo ranges, but perhaps they have so many games that each bucket is enough (12m each as they emphasize) and they preferred to keep it simple and spend data/compute instead of making the training & runtime more complicated.

Environments as a bottleneck in AGI development

"Blessings of scale" observations aside, it seems like right now, environments are not the bottleneck to DL/DRL work. No one failed to solve Go because gosh darn it, they just lacked a good Go simulator which correctly implemented the rules of the game; the limits to solving ALE-57 (like Montezuma's Revenge) in general or as a single multi-task agent do not seem to be lack of Atari games where what we really need is ALE-526*; Procgen performance is not weak because of insufficient variation in levels; OpenAI Universe failed not for lack of tasks, to say the least; the challenge in creating or replicating GPT-3 is not in scraping the text (and GPT-3 didn't even run 1 epoch!). Datasets/environments sometimes unlock new performance, like ImageNet, but even when one saturates, there's typically more datasets which are not yet solved and cannot be solved simultaneously (JFT-300M, for example), and in the case of RL of course compute=data. If you went to any DRL researcher, I don't think many of them would name "we've solved all the existing environments to superhuman level and have unemployed ourselves!" as their biggest bottleneck.

Is it really the case that at some point we will be drowning in so many GPUs and petaflops that our main problem will become coming up with ever more difficult tasks to give them something useful to train on? Or is this specifically a claim about friendly AGI, where we lack any kind of environment which would seem to force alignment for maximum score?

* Apparently the existing ALE suite was chosen pretty haphazardly:

Our testing set was constructed by choosing semi-randomly from the 381 games listed on Wikipedia at the time of writing. Of these games, 123 games have their own Wikipedia page, have a single player mode, are not adult-themed or prototypes, and can be emulated in ALE. From this list, 50 games were chosen at random to form the test set.

I wonder how the history of DRL would've changed if they had happened to select from the other 73, or if Pitfall & Montezuma's Revenge had been omitted? I don't however, think it would've been a good use of their time in 2013 to work on adding more ALE games rather than, say, debugging GPU libraries to make it easier to run NNs at all...

Why GPT wants to mesa-optimize & how we might change this

It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.

Why GPT wants to mesa-optimize & how we might change this

Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn't improve text generation, and didn't bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲

Beam search has never worked for likelihood-trained NNs, since at least char-RNNs back in 2015. Beam search does trigger repetition and other pathologies in GPT, see "The Curious Case of Neural Text Degeneration", Holtzman et al 2019. And while unlikelihood training seems to help, it's not a silver bullet, and is a bit ad hoc (especially if you think of it in terms of reinforcement learning).

Load More