tl;dr: The shape of the curve probably doesn't change, but the compute-optimal LM training will use less data than the Chinchilla scaling law suggests.

One of the takeaways from the last two years of LM progress is that GPT-3/Chinchilla's next-token-prediction objective is not the most efficient way to use data.* Instead, objectives require the model to infill missing tokens in the middle of a text string, like the T5 objective or the UL2 objective, are much more efficient per unit data.

Figure 2 of the Tay et al UL2R paper shows how UL2 finetuning serves as either a multiple or a constant increase in training flops. Assuming that the improvement holds across the board, this means that UL2 finetuning makes models ~1.5-3x more data efficient. So if before, the optimal trade off for X flops was Y params times Z tokens, with a better objective (or finetuning the objective better), we might see 1.5 Y params and 0.66 Z tokens.

It's worth noting that this still implies a linear relationship between the optimal param count and token count, it's just that if you use a better objective it's better to use more params and fewer tokens than what the next-token log loss--based Chinchilla scaling laws would predict.

* Arguably, we knew this from BERT, where you'd get better finetuned performance on downstream tasks if you pretrained with bidirectional objectives, but I think the result that the next-token prediction objective is worse for text generation tasks is new.

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

3

[ Question ]

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

3

1 Answers sorted by
top scoring

Nov 21, 2022

3

[ Question ]

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

3

1 Answers sorted by top scoring

Nov 21, 2022

1 Answers sorted by
top scoring