Same person as nostalgebraist2point0, but now I have my account back.
I'm don't think this step makes sense:
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also.
In the picture, it looks like there's something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors' choice of units.
They define "one data point" as "one token," which is fine. But it seems equally defensible to define "one data point" as "what the model can process in one forward pass," which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems "have the same scaling law." Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I'm pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for "a data point" is. This is mostly for "Could a Neuroscientist Understand a Microprocessor?"-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
Actually, I think I spoke too soon about the visualization... I don't think your image of L(D) and L(N) is quite right.
Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it's easy to see indifference curves of the loss.
In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
To restate my earlier claims...
If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).
This is what motives the heuristic that you scale D with N, to stay on the diagonal line.
On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.
When I said that it's intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that's going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.
Asking "what could we do with a N=1e15 model?" (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region ... or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.
In Ajeya's work, this question means "let's assume we're using an N=1e15 model, and then let's assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let's figure out how big D has to be to get there."
So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as "the performance which you could only reach with N=1e15 params".
What feels weird to me -- which you touched on above -- is the way this lets the scaling relations "backset drive" the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it... we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way.
This is a subtle and confusing thing about the Kaplan et al papers. (It's also the subject of my post that I linked earlier, so I recommend you check that out.)
There are two things in the papers that could be called "optimal compute budgeting" laws:
I said the D vs N law was "not a heuristic for managing compute" because the S vs N law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.
However, the D vs N law does tell you about how to spend compute in an indirect way, for the exact reason you say, that D is related to how long you train. Comparing the two laws yields the "breakdown" or "kink point."
Do you agree or disagree? ... I take [you] to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D?
Sorry, why do you expect I disagree? I think I agree. But also, I'm not really claiming the scaling laws say or don't say anything about the brain, I'm just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems). We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.
Perhaps it would help me if I could visualize it in two dimensions
This part is 100% qualitatively accurate, I think. The one exception is that there are two "optimal compute" lines on the plot with different slopes, for the two laws referred to above. But yeah, I'm saying we won't be on either of those lines, but on the L(N) or the L(D) line.
The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance.
The scaling laws from the Kaplan et al papers do tell you this.
The relevant law is L(N,D), for the early-stopped test loss given parameter count N and data size D. It has the functional form
The result that you should scale D∝N0.74 comes from trying to keep the two terms in this formula about the same size.
This is not exactly a heuristic for managing compute (since D is not dependent on compute, it's dependent on how much data you can source). It's more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.
You always can train models that are "too large" on datasets that are "too small" according to the heuristic, and they won't diverge or do poorly or anything. They just won't improve much upon the results of smaller models.
In terms of the above, you are setting N∼1015 and then asking what D ought to be. If the heuristic gives you an answer that seems very high, that doesn't mean the model is "not as data efficient as you expected." Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼1015 rather than using a smaller model to get almost identical performance.
I find it more intuitive to think about the following, both discussed in the papers:
If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between N and D to ensure we are not in either limit.
Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold LAGI). Ajeya's approach essentially assumes that we'll cross this threshold at a particular value of N, and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.
I'm not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the L(N) or the L(D) curve until it hits LAGI.
See also my post here.
I don't think you're completely missing something. This is the active learning approach, which gwern also suggested -- see that thread for more.
I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.
Sure -- my point to contrast two cases
Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free. This includes domains the research would never have come up with themselves.
If we switch to manually hunting down large specialized datasets, this will definitely help, but we're no longer getting broad domain coverage for free. At best we get broad domain coverage through manual researcher effort and luck, at worst we don't get it at all.
I see your point about active learning "telling us" when we need more data -- that's especially appealing if it can point us to specific domains where more coverage would help.
What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?
IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss). This will help if
I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress).
With text data, though, I'm concerned that (2) will fail soon.
The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven't seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc. The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.
Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?
No, it's a more philosophical point. Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."
For example, the mere appearance of something like
Title: Why GPT wants to mesa-optimize & how we might change this
does not guarantee that the text following it bears that title, or was written by that author. (As I am illustrating right now.)
Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc. (GPT isn't like that, but Grover and CTRL kind of are, in different ways.)
But even that can only go so far. If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities? A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character? (Note that the character is a much better answer to the question "who does this sound like?") And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's?
Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words. You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world. And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).
My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.
Yeah, not for transformers I think.
Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.
capybaralet's point about conservation of expected evidence applies here -- GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.
If we then say "the mechanism for pricing them in is doing internal lookahead," then we are imagining that lookahead operating over some predictor that is otherwise good but hasn't priced in lookahead yet. But I don't know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.
I'm skeptical that internal beam search would help in language modeling.
Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking. So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.
Weather is like this because of chaotic dynamics. Language modeling is like this because
(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn't extrapolate from reading the first (100-X)%, or else they'd just stop and not write the remaining X%.
(b) By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom. So even if you were smart enough to guess what any individual human would say next (!), you don't know which human produced the text you're looking at. (Or even whether it was a human at all.)
Thus (IMO), language modeling is not really about thinking ahead to find some "objectively correct" next move as in Chess/Go. It's more about trying to guess what the author of this text will do in the very next step. The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn't find it very useful.
To make the point concrete, I don't think "orange" is necessarily a bad guess here -- among other things, it would be the correct guess if the author were trying to illustrate the point of your example!
And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis "...", which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in. (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )
I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.
What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.
I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won't really illuminate this. I wish I knew more about this topic -- are there standard techniques?