This is a linkpost for https://arxiv.org/abs/2203.15556

[Link] Training Compute-Optimal Large Language Models

23Jacob Hilton

5Daniel Kokotajlo

13nostalgebraist

4Jacob Hilton

2Daniel Kokotajlo

3Daniel Kokotajlo

15nostalgebraist

New Comment

7 comments, sorted by Click to highlight new comments since: Today at 5:05 PM

The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.

Overall I guess this should shorten timelines, because the effect you explain here is counteracted by the other first-order effect of "oh geez it looks like our earlier scaling projections were inefficient; for any performance level, we now know how to reach that level for less compute cost than the earlier projections said." What do you think?

It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)

However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It's the parameter count of a model that uses about as much inference compute as the brain.)

This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the *compute required to run them*, so efficiency improvements of various kinds will lengthen its timeline. It's always waiting for its "sufficiently expensive model" (and it does not care that this model keeps "getting better" in terms of loss/etc as the efficiency improvements roll in).

Anyway, I'd forgotten the prior used for dataset scaling in Bio Anchors, but it's pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.

Thinking back to the "inconsistency" from the Kaplan et al papers...

- In Appendix E of the new paper, we see the loss-vs-compute frontier start to "bend" from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
- I suspect this bending is the transition from the faster "L(C) law" to the slower "L(D) law."
- A brief recap of that below:
- Adding more params can help in two ways: it makes your model's loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
- As models get bigger, the first effect dies off -- the loss curves converge to a fixed shape, rather than getter ever steeper. The second effect keeps going, but with it alone, the overall rate of return is lower.

- A brief recap of that below:
- Presumably, the learning rate issue in Kaplan et. al.
*also*affected their estimated L(D) law.- The issue made Kaplan et al underestimate optimal model performance. The underestimate was worst when considering models for which the optimal number of training steps was small.
- The L(D) law came from early stopping experiments. The early stopping step is lower for smaller data sizes.
- So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values. Thus the estimated L(D) curve declines faster than the true L(D) curve.
- If this is correct, then L(D) improves more slowly with data than we had believed.
- Note that this does contradict the "use more data!" result from the paper -- that is about the relative rate at which N and D affect L(N, D).

New LM scaling paper from DeepMind (abs, pdf).

Abstract (my emphasis):

Brief comments on my blog here.

Presumably has implications for Bio Anchors?