What happens to variance as neural network training is scaled? What does it imply about "lottery tickets"?

'Variance' is used in an amusing number of ways in these discussions.You use 'variance' in one sense (the bias-variance tradeoff), but "Explaining Neural Scaling Laws", Bahri et al 2021 talks about a difference kind of variance limit in scaling, while "Learning Curve Theory", Hutter 2001's toy model provides statements on yet others kinds of variances about scaling curves themselves (and I think you could easily dig up a paper from the neural tangent kernel people about scaling approximating infinite width models which only need to make infinitesimally small linear updates or something like that because variance in a different sense goes down...) Meanwhile, my original observation was about the difficulty of connecting benchmarks to practical real-world capabilities: regardless of whether the 'variance of increases in practical real-world capabilities' goes up or down with additional scaling, we still have no good way to say that an X% increase on benchmarks ought to yield qualitatively new capability Y - almost a year later, still no one has shown how you would have predicted in advance that pushing GPT-3 to a particular likelihood loss would yield all these cool new things. As we cannot predict that at all, it would not be of terribly much use to say whether it either increases or decreases as we continue scaling (since either way, we may wind up being surprised).

2020 AI Alignment Literature Review and Charity Comparison

OpenAI was initially funded with money from Elon Musk as a not-for-profit.

This is commonly said on the basis of his $1b pledge, but AFAICT Musk wound up contributing little or nothing before he resigned ~2018. If you look at the OA Form 990s, Musk is never listed as a donor, only a board member; the only entities that are listed as contributing money or loans are Sam Altman, Y Combinator Research, and OpenAI LP.

Extrapolating GPT-N performance

Finally, the scramble task is about shuffling around letters in the right way, and arithmetic is about adding, subtracting, dividing, and multiplying numbers. The main interesting thing about these tasks is that performance doesn’t improve at all in the beginning, and then starts improving very fast. This is some evidence that we might expect non-linear improvements on particular tasks, though I mostly interpret it as these tasks being quite narrow, such that when a model starts getting the trick, it’s quite easy to systematically get right.

To beat my usual drum: I think the Arithmetic/Scramble task curves are just due to BPEs. The 'trick' here is not that scrambling or arithmetic are actually all that difficult, but that it needs to memorize enough of the encrypted number/word representations to finally crack the BPE code and then once it's done that, the task itself is straightforward. The 'breakthrough', so to speak, is seeing through the scrambled BPE representations. I predict that using tricks like rewriting numbers to individual digits or BPE-dropout to expose all possible tokenizations, or better yet, character-level representations, would show much smoother learning curves and that much smaller models would achieve the GPT-3-175b performance.

the scaling “inconsistency”: openAI’s new insight

I think I see 'domain-specific datasets' as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly 'everything'), by millions of people, doing things like uploading banned books for evading the Great Firewall, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know 'source code' well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from 'just' Github, or 'just' Arxiv. (Literotica, I'm less sure about.) I don't see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC's general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3's output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!

The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it - aside from EleutherAI's version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they've done it and added it to the Pile, that's no longer a problem.

the scaling “inconsistency”: openAI’s new insight

Yes, my hypothesis is that active learning should have a different asymptotic because in a lot of admittedly-simple scenarios like logistic regression, active learning has a much nicer asymptotic. Right now, it's not too hard to run in <=1 epoch, and GPT-3 did, and that's using just CC and books1/2/WP. There's loads of other text datasets. (I think someone in EleutherAI was saying that Literotica alone was 500GB...?) Even if active learning 'runs out' of useful data before available compute, that will (a) save a whole lot of compute/time. and (b) tell us explicitly that we've 'used up' the default data and need to revise our approaches.

The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Filtering for difficulty like that is tricky. In particular the most difficult samples are random noise or Chinese or something that the model can't begin to comprehend.

I would point out that GPT-2 nontrivially, and GPT-3 surprisingly well, understand Chinese. And see my link: GPT-2 is able to filter out garbage really well. It doesn't have to be perfect. Even a ratio of, say, only 99:1 garbage:good data deleted is a big win. You're trying to filter out really egregious horrible nonsense data of the sort that you can't even imagine exists until you've actually waded through the sewer of Common Crawl and understood what garbage data really is out there. (Another fun example is: when you go looking for rare languages in Common Crawl, given the base rate, what do even really good natural-language identifier models pull up for rare models? Mostly garbage/natural adversarial examples...)

the scaling “inconsistency”: openAI’s new insight

This makes sense to me and is what I've been considering as the implication of sample-efficiency (one of the blessings of scale), coming at it from another direction of meta-learning/Bayesian RL: if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze blood from a stone; once you approach the intrinsic entropy, there's not much to learn. Steeply diminishing returns is built into compiling large text datasets and just training on random samples. It looks like the former is the regime we've been in up to GPT-3 and beyond, and the latter is when the slower data-only scaling kicks in.

Aside from multimodal approaches, the crossover raises the question of whether it becomes time to invest in improvements like active learning. Bayesian RL is so sample-efficient because it actively optimizes choice of data points to acquire, it doesn't just passively acquire ever-more-redundant i.i.d. samples. Active learning is the supervised equivalent, and active learning has different and much better asymptotics than random sampling.

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted? (A forward pass of something like GPT-2-1.5b will cost <<1% of the cost of forwards+backwards GPT-3, and so on, and is practically free if we consider scaling a GPT-4 to where the crossover used to be. I've suggested this to EleutherAI to optimize their Pile dataset, and even as simple an approach as looking at gzip compression ratios to throw out extremely poorly/highly-compressed data to trim the outliers seems to work fairly well in throwing away spam but not ham; however, they've been too busy getting the first version working to experiment with any real refinements.)

One interesting example from my 'data' category is "Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study", Bahri et al 2020:

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

Mesa-Search vs Mesa-Control

Because you don't train the inputs, you're trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it's abusing the term 'stochastic' (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don't understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can't meta-learn in the same way as RNNs internally.

Mesa-Search vs Mesa-Control

An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it's not, I don't see why that would make a difference, or why a Transformer couldn't be 'not deterministic' in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can't learn 'Monte Carlo iterations' internally (say, one per head).

Reply to Jebari and Lundborg on Artificial Superintelligence

I skimmed the paper when they announced it on Twitter. It seemed like it fundamentally ignores every possibility vaguely like mesa-optimization or imitation learning, and can't deal with things like, say, GPT-3 meta-learning agency to better predict data derived from agents (ie. humans). They leave themselves an out by handwaving away all such inconveniences as 'iron ore agents', but then it's thoroughly useless and circular; "what's an iron ore agent?" "It's one which has dangerous outcomes due to hidden agency." "OK, which agents are those, how can you tell AlphaZero from GPT-3 from AGI?" "Well, try them and see!"

[AN #120]: Tracing the intellectual roots of AI and AI alignment

The authors then develop their own method, Maia. They talk about it as a “modification of the AlphaZero architecture”, but as far as I can tell it is simply behavior cloning using the neural net architecture used by Leela. As you might expect, this does significantly better, and finally satisfies the property we would intuitively want: the best predictive model for a human of some skill level is the one that was trained on the data from humans at that skill level.

Yeah, I think that's all they mean: the CNN and input/output are the same as Leela the same as AlphaZero. But it does differ from behavioral cloning in that they stratify the samples - typically, behavior cloning dumps in all available expert samples (perhaps with a minimum cutoff rating, which is how AlphaGo filtered its KGS pretraining) and trains on them all equally.

Personally, I would've trained a single conditional model with a specified player-Elo for each move, instead of arbitrarily bucketing into 9 levels of Elo ranges, but perhaps they have so many games that each bucket is enough (12m each as they emphasize) and they preferred to keep it simple and spend data/compute instead of making the training & runtime more complicated.

Load More