Same person as nostalgebraist2point0, but now I have my account back.
My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.
What is the mechanism you're imagining for this speedup? What happens that would not have happened without this post?
I'm struggling to imagine a situation where a relevant AI org is doing Chinchilla-like scaling experiments, yet somehow has managed to miss this paper (or to ignore/misunderstand it) for 4+ months. The paper is not exactly a secret, and it's not even especially difficult to read as these things go.
More broadly, I doubt LW has significant leverage to decrease the overall supply of these kinds of conversations. There are lots of venues for cutting-edge ML discussion, and the conversation is going to happen somewhere. (See Connor's comments here.)
Now I’m inclined to think that just automating most of the tasks in ML research and engineering -- enough to accelerate the pace of AI progress manyfold -- is sufficient.
This seems to assume that human labor is currently the limiting bottleneck in AI research, and by a large multiplicative factor.
That doesn't seem likely to me. Compute is a nontrivial bottleneck even in many small-scale experiments, and in particular is a major bottleneck for research that pushes the envelope of scale, which is generally how new SOTA results and such get made these days.
To be concrete, consider this discussion of "the pace of AI progress" elsewhere in the post:
But progress on some not-cherry-picked benchmarks was notably faster than what forecasters predicted, so that should be some update toward shorter timelines for me.
That post is about four benchmarks. Of the four, it's mostly MATH and MMLU that are driving the sense of "notably faster progress" here. The SOTAs for these were established by
In both cases, relatively simple and mostly non-original techniques were combined with massive compute. Even if you remove the humans entirely, the computers still only go as far as they go.
(Human labor is definitely a bottleneck in making the computers go faster -- like hardware development, but also specialized algorithms for large-scale training. But this is a much more specialized area than "AI research" generally, so there's less available pretraining data on it -- especially since a large[r] fraction of this kind of work is likely to be private IP.)
The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText."
How good is that, though? Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.
If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways. Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI. Both of these people are right, conditional on their other beliefs.
The key distinction here is that "1.69 loss" may not the best achievable loss on this dataset. It's just an estimate of the best loss achievable by this kind of model.
The question "what would a model be like, if it got the best achievable loss, period?" is more interesting, but nothing in this post or these papers really touches on it.
There are a few things in the calculation that seem wrong to me:
In any case, yeah, this does not seem like a huge amount of data. But there's enough order-of-magnitude fuzziness in the estimate that it does seem like it's worth someone's time to look into more seriously.
I definitely think it makes LM --> AGI less likely, although I didn't think it was very likely to begin with.
I'm not sure that the AI interacting with the world would help, at least with the narrow issue described here.
If we're talking about data produced by humans (perhaps solicited from them by an AI), then we're limited by the timescales of human behavior. The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).
All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, "for free." But once it's exhausted, producing more is slow.
IMO, most people are currently overestimating the potential of large generative models -- including image models like DALLE2 -- because of this fact.
There was all this massive data already sitting around from human activity (the web, Github, "books," Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.
When our compute finally began to catch up with our data, we effectively spent all the "stored-up potential energy" in that data all at once, and then confused ourselves into thinking that compute was only necessary input for the reaction.
But now compute has finally caught up with data, and it wants more. We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of "data startup capital."
I suppose the AI's interactions with the world could involve soliciting more data of the kind it needs to improve (ie active learning), which is much more valuable per unit than generic data.
I would still be surprised if this approach could get much of anywhere without requiring solicitation-from-humans on a massive scale, but it'd be nice to see a back-of-the-envelope calculation using existing estimates of the benefit of active learning.
I don't have anything especially insightful to contribute, but I wanted to thank you (TurnTrout and Quinton) for this post. I agree with it, and I often find myself thinking things like this when I read alignment posts by others on LW/AF.
When people present frameworks for thinking about AGIs or generic "intelligent agents," I often want to ask them: "are humans expressible in your framework?" Often it seems like the answer is "no."
And a common symptom of this is that the framework cannot express entities with human-level capabilities that are as well aligned with other such agents are humans are with one another. Deception, for example, is much less of a problem for humans in practice than it is claimed to be for AGIs in theory. Yes, we do engage in it sometimes, but we could do it a lot more than (most of us) do. Since this state of affairs is possible, and since it's desirable, it seems important to know how it can be achieved.
Thinking back to the "inconsistency" from the Kaplan et al papers...
It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)
However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It's the parameter count of a model that uses about as much inference compute as the brain.)
This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline. It's always waiting for its "sufficiently expensive model" (and it does not care that this model keeps "getting better" in terms of loss/etc as the efficiency improvements roll in).
Anyway, I'd forgotten the prior used for dataset scaling in Bio Anchors, but it's pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.
I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).
Like, I've read and re-read the first few sections a number of times, and I still can't come up with a mental model of HXU's structure that fits all of the described facts. By "HXU's structure" I mean things like:
Since I can't answer these questions in a way that makes sense, I also don't know how to read the various lines that describe "HXU" doing something, or attribute mental states to "HXU."
For instance, the thing in "1 Day" that has a world model -- is this a single rollout of the Meta-RNN-ish thing, which developed its world model as it chewed its way along a task sequence? In which case, the world model(s) are being continually discarded (!) at the end of every such rollout and then built anew from scratch in the next one? Are we doing the search problem of finding-a-world-model inside of a second search problem?
Where the outer search is (maybe?) happening through ES, which is stupid and needs gajillions of inner rollouts to get anywhere, even on trivial problems?
If the smart-thing-that-copies-itself called "HXU" is a single such rollout, and the 20XX computers can afford gajillions of such rollouts, then what are the slightly less meta 20XX models like, and why haven't they already eaten the world?
(Less important, but still jumped out at me: in "1 Day," why is HXU doing "grokking" [i.e. overfitting before the phase transition], as opposed to some other kind of discontinuous capability gain that doesn't involve overfitting? Like, sure, I suppose it could be grokking here, but this is another one of those paper references that doesn't seem to be "doing work" to motivate story events.)
I dunno, maybe I'm reading the whole thing more closely or literally than it's intended? But I imagine you intend the ML references to be taken somewhat more "closely" than the namedrops in your average SF novel, given the prefatory material:
grounded in contemporary ML scaling, self-supervised learning, reinforcement learning, and meta-learning research literature
And I'm not alleging that it is "just namedropping like your average SF novel." I'm taking the references seriously. But, when I try to view the references as load-bearing pieces in a structure, I can't make out what that structure is supposed to be.
I'm confused by your notation for feed-forward layers.
What justifies re-using the same labels ("apple" etc.) for
If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by A, or which vectors/semes they get mapped to by B.
But your labels don't correspond to either of these interpretations. Instead, it looks like you are following rules of the form "the 4th component of every basis is called 'yum'," which leads you to label a coordinate "yum" even though it's neither mapped from "yum" by A, nor mapped to "yum" by B.
This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case. In transformers, (2) is typically larger by a factor of 4. The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:
with some arbitrary choices about which multiplicative constants to absorb into A and a vs. which to absorb into B.