Compute Trends Across Three eras of Machine Learning

Pablo Villalobos; lennart; Marius Hobbhahn; Tamay Besiroglu; anson.ho

This is a very helpful resource and an insightful analysis! It would also be interesting to study computing trends for research that leverages existing large models whether through fine-tuning, prefix tuning, prompt design, e.g., "Fine-Tuning Language Models from Human Preferences", "Training language models to follow instructions with human feedback", "Prefix-Tuning: Optimizing Continuous Prompts for Generation", "Improving language models by retrieving from trillions of tokens" (where they retrofit baseline models) and indeed work referenced in Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Ron

[-]StellaAthena4y00

The distinction between "large scale era" and the rest of DL looks rather suspicious to me. You don't give a meaningful defense of which points you label "large scale era" in your plot and largely it looks like you took a handful of the most expensive models each year to give a different label to.

On what basis can you conclude that Turing NLG, GPT-J, GShard, and Switch Transformers aren't part of the "large scale era"? The fact that they weren't literally the largest models trained that year?

There's also a lot of research that didn't make your analysis, including work explicitly geared towards smaller models. What exclusion criteria did you use? I feel like if I was to perform the same analysis with a slightly different sample of papers I could come to wildly divergent conclusions.

[-]Jsevillamol4y00

There's also a lot of research that didn't make your analysis, including work explicitly geared towards smaller models. What exclusion criteria did you use? I feel like if I was to perform the same analysis with a slightly different sample of papers I could come to wildly divergent conclusions.

It is not feasible to do an exhaustive analysis of all milestone models. We necessarily are missing some important ones, either because we are not aware of them, because they did not provide enough information to deduce the training compute or because we haven't gotten to annotate them yet.

Our criteria for inclusion is outlined in appendix A. Essentially it boils down to ML models that have been cited >1000 times, models that have some historical significance and models that have been deployed in an important context (eg something that was deployed as part of Bing search engine would count). For models in the last two years we were more subjective, since there hasn't been enough time for the more relevant work to stand out the test of time.

We also excluded 5 models that have abnormally low compute, see figure 4.

We tried playing around with the selection of papers that was excluded and it didn't significantly change our conclusions, though obviously the dataset is biased in many ways. Appendix G discusses the possible biases that may have crept in.

[-]Jsevillamol4y00

Great questions! I think it is reasonable to be suspicious of the large-scale distinction.

I do stand by it - I think the companies discontinuously increased their training budgets around 2016 for some flagship models.^[1] If you mix these models with the regular trend, you might believe that the trend was doubling very fast up until 2017 and then slowed down. It is not an entirely unreasonable interpretation, but it explains worse the discontinuous jumps around 2016. Appendix E discusses this in-depth.

The way we selected the large-scale models is half intuition and half convenience. We compare the compute of each model to the log compute of nearby papers (within 2 years), and we call it large scale if its log compute exceeds 0.72 standard deviations of the mean of that sample.

I think there is a reasonable case for including NASv3, Libratus, Megatron-LM, T5-3B, OpenAI Five, Turing NLG, iGPT-XL, GShard (dense), Switch, DALL-E, Pangu-α, ProtT5-XXL and HyperClova on either side of this division.

Arguably we should have been more transparent about the effects of choosing a different threshold - we will try to look more into this in the next update of the paper.

^{^}
See appendix F for a surface discussion

[-]Jsevillamol4y30

Following up on this: we have updated appendix F of our paper with an analysis of different choices of the threshold that separates large-scale and regular-scale systems. Results are similar independently of the threshold choice.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

33

Compute Trends Across Three eras of Machine Learning

33