This is a linkpost for

New LM scaling paper from DeepMind (abs, pdf).  

Abstract (my emphasis):

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

Brief comments on my blog here.

Presumably has implications for Bio Anchors?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 5:05 PM

The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.

Overall I guess this should shorten timelines, because the effect you explain here is counteracted by the other first-order effect of "oh geez it looks like our earlier scaling projections were inefficient; for any performance level, we now know how to reach that level for less compute cost than the earlier projections said." What do you think?

It ought to shorten actual timelines, for the reason you say.  (Except insofar as data sourcing could actually become a practical problem.)

However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed.  (It's the parameter count of a model that uses about as much inference compute as the brain.)

This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline.  It's always waiting for its "sufficiently expensive model" (and it does not care that this model keeps "getting better" in terms of loss/etc as the efficiency improvements roll in).

Anyway, I'd forgotten the prior used for dataset scaling in Bio Anchors, but it's pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling.  So this news is less relevant than I had thought.

I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.

OK, good to know. I look forward to seeing the performance trends updated with the new scaling paradigm/law.

(In terms of the neural network model, this means lowering our estimate for how many parameters will be needed.)

Thinking back to the "inconsistency" from the Kaplan et al papers...

  • In Appendix E of the new paper, we see the loss-vs-compute frontier start to "bend" from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
  • I suspect this bending is the transition from the faster "L(C) law" to the slower "L(D) law."
    • A brief recap of that below:
      • Adding more params can help in two ways: it makes your model's loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
      • As models get bigger, the first effect dies off -- the loss curves converge to a fixed shape, rather than getter ever steeper.  The second effect keeps going, but with it alone, the overall rate of return is lower.
  • Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
    • The issue made Kaplan et al underestimate optimal model performance.  The underestimate was worst when considering models for which the optimal number of training steps was small.
    • The L(D) law came from early stopping experiments.  The early stopping step is lower for smaller data sizes.
    • So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values.  Thus the estimated L(D) curve declines faster than the true L(D) curve.
    • If this is correct, then L(D) improves more slowly with data than we had believed.
    • Note that this does contradict the "use more data!" result from the paper -- that is about the relative rate at which N and D affect L(N, D).