New LM scaling paper from DeepMind (abs, pdf).

Abstract (my emphasis):

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.

We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens,wefind that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data.Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage.As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

Brief comments on my blog here.

Presumably has implications for Bio Anchors?

The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.

Overall I guess this should shorten timelines, because the effect you explain here is counteracted by the other first-order effect of "oh geez it looks like our earlier scaling projections were inefficient; for any performance level, we now know how to reach that level for less compute cost than the earlier projections said." What do you think?

It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)

However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It's the parameter count of a model that uses about as much inference compute as the brain.)

This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the

compute required to run them, so efficiency improvements of various kinds will lengthen its timeline. It's always waiting for its "sufficiently expensive model" (and it does not care that this model keeps "getting better" in terms of loss/etc as the efficiency improvements roll in).Anyway, I'd forgotten the prior used for dataset scaling in Bio Anchors, but it's pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.

I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.

OK, good to know. I look forward to seeing the performance trends updated with the new scaling paradigm/law.

(In terms of the neural network model, this means lowering our estimate for how many parameters will be needed.)

Thinking back to the "inconsistency" from the Kaplan et al papers...

alsoaffected their estimated L(D) law.