Comments

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

That does not seem true to me and as much of a leap as OP. A priori, if I see a smooth curve in one metric and a discontinuous or abrupt change in another, I do not see how that should make me more confident that it is 'about behavior or evaluation'. Why should I conclude that? Why can't it reflect a non-smooth underlying change in the model first? I would only conclude that if I had already ruled out internal changes because I was already committed to the position that NNs can only learn and change internally in smooth small ways... which unfortunately we already know is a false position, because of things like Anthropic's induction bump, which show phase transitions in the internals of the model which is nearly invisible on the loss. (And also, incidentally, because the bump is so small and the training curve still so smooth, falsifies the more modest claim that small changes in perplexity must reflect small changes in the model internals - maybe usually small changes do not reflect non-smooth underlying changes, but nevertheless, it is entirely possible and does happen, and we would surely find many more routine examples if we had better interpretability so examining a single instance didn't take man-years.) And also a priori, from the old statistical mechanics literature, you should expect abrupt phase changes of various sorts in NN models (which may or may not be visible in the training curve), like parity models, where the task is so simple and clearly defined that it cannot have anything to do with the 'behavior' or 'evaluation' being wrong, and comes from effects like symmetry-breaking (often associated with plateaus and flat curves...).

Yes, you could definitely have misleading perplexities, like improving on a subset which is rare but vital and does not overcome noise in the evaluation (you are stacking multiple layers of measurement error/variance when you evaluate a single checkpoint on a single small heldout set of datapoints); after all, this is in fact the entire problem to begin with, that our overall perplexity has very unclear relationships to various kinds of performance, and so your overall Big-Bench perplexity would tell you little about whether there are any jaggies when you break it down to individual Bench components, and there is no reason to think the individual components are 'atomic', so the measurement regress continues... The fact that someone like Paul can come along afterwards and tell you "ah, but the perplexity would have been smooth if only you had chosen the right subset of datapoints to measure progress on as your true benchmark" would not matter.

I'm not sure that makes sense or is justified by anything here either. You aren't looking at all the other lines. You are selectively presenting the jagged lines' counterparts which are smooth (just like the overall perplexity is smooth), but you don't show the flatlined lines' counterparts are flatline or indeed in any way different-looking. (The Wason selection test comes to mind here.) Maybe all the perplexities look similar in being smooth, and if you shuffled them, no one would be able to tell you which perplexity line matched up with which jag or non jagged line. If all you can say is that a smooth perplexity line is a necessary condition but otherwise not even correlated weakly with jag or non-jag, then that seems to boil down to the claim "a broken model which isn't improving in any ways also won't improve in some ways, and a model which is improving many things may or may not improve some things", which is not useful or interesting. (I do not worry about subhuman harmless models which are not improving, such as n-grams, and I do not think anyone else spends much time worrying about them either.)

The authors actually observe smooth increases in answer log-likelihood, even for tasks which showed emergent behavior according to the natural performance metric for the task (e.g. accuracy). These results are evidence that we can predict that emergent behaviors will occur in the future before models are actually “capable” of those behaviors.

So by what perplexity should one predict each of those having a sharp left turn at future scale-ups, exactly? What is the critical point on each smooth line from which you think you can predict the abrupt jagged line, and why do you think these two plots show that one can be predicted from the other, instead of showing the opposite?

Interesting new paper: "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes", Garg et al 2022:

In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data.

To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class?

We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference.

We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms.

Our code and models are available at this https URL.

...Curriculum: ...Notably, when training Transformers without curriculum, there is an initial—relatively long—period in training where the loss does not decrease, followed by a period of sharp decrease. The length of this period varies with training randomness and seems to increase on average with problem dimension. Understanding the model just before and after this transition moment is a promising future direction, which can give insights into the emergence of in-context learning. Interestingly, Olsson et al 2022 observe a similar jump in the in-context learning ability of a language model which they attribute to the formation of “induction heads”.

(Is that just because they get attacked and killed by other chimp groups?)

For instance, it seems plausible that if “adding arabic numerals” and “translating words into arabic numerals” are two groups but “adding numbers written as words” is not, performance on the latter could nonetheless develop smoothly as the model gets better at the others. It would certainly be weird if performance ”adding numbers written as words” advanced as a sudden leap in this case.

I wouldn't say this is weird. This is kind of the point of meta-learning, or 'transfer' in a broad sense: you train on X, and Y gets better! Or look at emergent capabilities: they don't spike because of additional data being added (the token count is similar or identical), so it has to be because of larger models in some way transferring from other datapoints.

There also seems to be a premise running through this proposal that learning is simple and independent, in some sense, and that you are mostly just oversampling/undersampling as a throttle, as it were, to avoid spikes by throttling each task individually instead of only the global loss which is too loose and leaves too much wiggle room because individual tasks are a minuscule fraction of the overall average 'task'. But we have plenty of evidence that how you weight or group data would change the dynamics and capabilities quantitatively and qualitatively; the most striking recent research result which implies that how you group data can change what is learned qualitatively is DM's "Data Distributional Properties Drive Emergent In-Context Learning in Transformers", Chan et al 2022:

Large transformer-based models are able to perform in-context few-shot learning, without being explicitly trained for it. This observation raises the question: what aspects of the training regime lead to this emergent behavior? Here, we show that this behavior is driven by the distributions of the training data itself.

In-context learning emerges when the training data exhibits particular distributional properties such as burstiness (items appear in clusters rather than being uniformly distributed over time) and having large numbers of rarely occurring classes. In-context learning also emerges more strongly when item meanings or interpretations are dynamic rather than fixed. These properties are exemplified by natural language, but are also inherent to naturalistic data in a wide range of other domains. They also depart significantly from the uniform, i.i.d. training distributions typically used for standard supervised learning.

In our initial experiments, we found that in-context learning traded off against more conventional weight-based learning, and models were unable to achieve both simultaneously. However, our later experiments uncovered that the two modes of learning could co-exist in a single model when it was trained on data following a skewed Zipfian distribution -- another common property of naturalistic data, including language. In further experiments, we found that naturalistic data distributions were only able to elicit in-context learning in transformers, and not in recurrent models.

In sum, our findings indicate how the transformer architecture works together with particular properties of the training data to drive the intriguing emergent in-context learning behaviour of large language models, and how future work might encourage both in-context and in-weights learning in domains beyond language.

Here, the distribution of tasks (known image classes) affects the kind of learning of other tasks (classes): the presence of a common class or a rare class, as opposed to a middle class, skews the model as a whole, across all future classes, away from meta-learning.

I take this as implying that if you did something like extract the implicit tasks of a big Internet scrape and did the obvious thing of rebalancing classes away from Zipfian distribution to a uniform distribution closer to something like ImageNet with 1000 classes roughly the same size, you would get models which might be much more efficient to train or might have the same or lower training loss, but would have a very different set of strengths and weaknesses - possibly, in the extreme case, they might have no few-shot capability at all! (This alternative model is probably very far away in model space from the normal meta-learning one, having learned a fundamentally different approach, so I doubt any considerations of local gradients or model properties is going to be useful.) This is a more extreme version of my concern with MoEs that using experts to solve specific problems rather than a single universal dense model will tend to sabotage learning of interesting capabilities: here, it's not merely that MoEs seem to do slightly better on memorization-heavy benchmarks than reasoning ones, it's that the meta-learning doesn't happen at all!

And the strangeness probably doesn't stop there. If you trained some large model in such a manner and it was completely crippled in some respects (while presumably having perhaps more than offsetting gains elsewhere), what would happen if you then further trained it on a Zipfian dataset which hadn't been rebalanced? I would hazard the guess that it might learn the suppressed capabilities relatively rapidly. This would be very bad for safety purposes if you thought you trained a safe model you could release publicly, say, which did all sorts of useful things but couldn't be made to do dangerous new things; and yet all you did was create a capabilities overhang for the first person to come along to unlock by finetuning.

We could probably use a term or a phrase for this concept since it keeps coming up and is a fundamental problem. How about:

Any model simple enough to be interpretable is too simple to be useful.

Corollary:

Any model which appears both useful and interpretable is uninterpretable.

If anyone was wondering whether DM planned to follow it up in the obvious way because of the obvious implications of its obvious generality and obvious scalability, Hassabis says on the Fridman podcast: " it's just the beginning really, it's our most general agent one could call it so far but um you know that itself can be scaled up massively more than we've done so far obviously we're in the in the middle of doing that."

Load More