Same person as nostalgebraist2point0, but now I have my account back.


Wiki Contributions


My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

What is the mechanism you're imagining for this speedup?  What happens that would not have happened without this post?

Consider that

  • The Chinchilla paper was released over four months ago, on 3/29/22.
  • It did not take long for the paper to get noticed among people interested in ML scaling, including here on LW. 

I'm struggling to imagine a situation where a relevant AI org is doing Chinchilla-like scaling experiments, yet somehow has managed to miss this paper (or to ignore/misunderstand it) for 4+ months.  The paper is not exactly a secret, and it's not even especially difficult to read as these things go.

More broadly, I doubt LW has significant leverage to decrease the overall supply of these kinds of conversations.  There are lots of venues for cutting-edge ML discussion, and the conversation is going to happen somewhere.  (See Connor's comments here.)

Now I’m inclined to think that just automating most of the tasks in ML research and engineering -- enough to accelerate the pace of AI progress manyfold -- is sufficient.

This seems to assume that human labor is currently the limiting bottleneck in AI research, and by a large multiplicative factor.

That doesn't seem likely to me.  Compute is a nontrivial bottleneck even in many small-scale experiments, and in particular is a major bottleneck for research that pushes the envelope of scale, which is generally how new SOTA results and such get made these days.

To be concrete, consider this discussion of "the pace of AI progress" elsewhere in the post:

But progress on some not-cherry-picked benchmarks was notably faster than what forecasters predicted, so that should be some update toward shorter timelines for me.

That post is about four benchmarks.  Of the four, it's mostly MATH and MMLU that are driving the sense of "notably faster progress" here.  The SOTAs for these were established by

  • MATH: Minerva, which used a finetuned PaLM-540B model together with already existing (if, in some cases, relatively recently introduced) techniques like chain-of-thought
  • MMLU: Chinchilla, a model with the same design and (large) training compute cost as the earlier Gopher, but with different hyperparameters chosen through a conventional (if unusually careful) scaling law analysis

In both cases, relatively simple and mostly non-original techniques were combined with massive compute.  Even if you remove the humans entirely, the computers still only go as far as they go.

(Human labor is definitely a bottleneck in making the computers go faster -- like hardware development, but also specialized algorithms for large-scale training.  But this is a much more specialized area than "AI research" generally, so there's less available pretraining data on it -- especially since a large[r] fraction of this kind of work is likely to be private IP.)

The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText."

How good is that, though?  Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.

If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways.  Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI.  Both of these people are right, conditional on their other beliefs.

The key distinction here is that "1.69 loss" may not the best achievable loss on this dataset.  It's just an estimate of the best loss achievable by this kind of model.

The question "what would a model be like, if it got the best achievable loss, period?" is more interesting, but nothing in this post or these papers really touches on it.

Very interesting!

There are a few things in the calculation that seem wrong to me:

  • If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
  • I'd expect much less than 100% of Youtube video time to contain speech.  I don't know what a reasonable discount for this would be, though.
  • In the opposite direction, 1% useful seems too low.  IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.

In any case, yeah, this does not seem like a huge amount of data.  But there's enough order-of-magnitude fuzziness in the estimate that it does seem like it's worth someone's time to look into more seriously.

I definitely think it makes LM --> AGI less likely, although I didn't think it was very likely to begin with.

I'm not sure that the AI interacting with the world would help, at least with the narrow issue described here.

If we're talking about data produced by humans (perhaps solicited from them by an AI), then we're limited by the timescales of human behavior.   The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).

All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, "for free."  But once it's exhausted, producing more is slow.

IMO, most people are currently overestimating the potential of large generative models -- including image models like DALLE2 -- because of this fact.

There was all this massive data already sitting around from human activity (the web, Github, "books," Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.

When our compute finally began to catch up with our data, we effectively spent all the "stored-up potential energy" in that data all at once, and then confused ourselves into thinking that compute was only necessary input for the reaction.

But now compute has finally caught up with data, and it wants more.  We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of "data startup capital."

I suppose the AI's interactions with the world could involve soliciting more data of the kind it needs to improve (ie active learning), which is much more valuable per unit than generic data.

I would still be surprised if this approach could get much of anywhere without requiring solicitation-from-humans on a massive scale, but it'd be nice to see a back-of-the-envelope calculation using existing estimates of the benefit of active learning.

I don't have anything especially insightful to contribute, but I wanted to thank you (TurnTrout and Quinton) for this post.  I agree with it, and I often find myself thinking things like this when I read alignment posts by others on LW/AF.

When people present frameworks for thinking about AGIs or generic "intelligent agents," I often want to ask them: "are humans expressible in your framework?"  Often it seems like the answer is "no."

And a common symptom of this is that the framework cannot express entities with human-level capabilities that are as well aligned with other such agents are humans are with one another.  Deception, for example, is much less of a problem for humans in practice than it is claimed to be for AGIs in theory.  Yes, we do engage in it sometimes, but we could do it a lot more than (most of us) do.  Since this state of affairs is possible, and since it's desirable, it seems important to know how it can be achieved.

Thinking back to the "inconsistency" from the Kaplan et al papers...

  • In Appendix E of the new paper, we see the loss-vs-compute frontier start to "bend" from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
  • I suspect this bending is the transition from the faster "L(C) law" to the slower "L(D) law."
    • A brief recap of that below:
      • Adding more params can help in two ways: it makes your model's loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
      • As models get bigger, the first effect dies off -- the loss curves converge to a fixed shape, rather than getter ever steeper.  The second effect keeps going, but with it alone, the overall rate of return is lower.
  • Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
    • The issue made Kaplan et al underestimate optimal model performance.  The underestimate was worst when considering models for which the optimal number of training steps was small.
    • The L(D) law came from early stopping experiments.  The early stopping step is lower for smaller data sizes.
    • So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values.  Thus the estimated L(D) curve declines faster than the true L(D) curve.
    • If this is correct, then L(D) improves more slowly with data than we had believed.
    • Note that this does contradict the "use more data!" result from the paper -- that is about the relative rate at which N and D affect L(N, D).

It ought to shorten actual timelines, for the reason you say.  (Except insofar as data sourcing could actually become a practical problem.)

However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed.  (It's the parameter count of a model that uses about as much inference compute as the brain.)

This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline.  It's always waiting for its "sufficiently expensive model" (and it does not care that this model keeps "getting better" in terms of loss/etc as the efficiency improvements roll in).

Anyway, I'd forgotten the prior used for dataset scaling in Bio Anchors, but it's pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling.  So this news is less relevant than I had thought.

I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).

Like, I've read and re-read the first few sections a number of times, and I still can't come up with a mental model of HXU's structure that fits all of the described facts.  By "HXU's structure" I mean things like:

  • The researcher is running an "evolutionary search in auto-ML" method.  How many nested layers of inner/outer loop does this method (explicitly) contain?
  • Where in the nested structure are (1) the evolutionary search, and (2) the thing that outputs "binary blobs"?
  • Are the "binary blobs" being run like Meta RNNs, ie they run sequentially in multiple environments?
    • I assume the answer is yes, because this would explain what it is that (in the 1 Day section) remembers a "history of observation of lots of random environments & datasets."
  • What is the type signature of the thing-that-outputs-binary-blobs?  What is its input?  A task, a task mixture, something else?
    • Much of the story (eg the "history of observations" passage) makes it sound like we're watching a single Meta-RNN-ish thing whose trajectories span multiple environment/tasks.
    • If this Meta-RNN-ish thing is "a blob," what role is left for the thing-that-outputs-blobs?
    • That is: in that case, the thing-that-outputs-blobs just looks like .  It's simply a constant, we can eliminate it from the description, and we're really just doing optimization over blobs. Presumably that's not the case, so what is going on here?
  • What is it that's made of "GPU primitives"?
    • If the blobs (bytecode?) are being viewed as raw binary sequences and we're flipping their bits, that's a lower level than GPU primitives.
    • If instead the thing-that-outputs-blobs is made of GPU primitives which something else is optimizing over, what is that "something else"?
  • Is the outermost training loop (the explicitly implemented one) using evolutionary search, or (explicit) gradient descent?
    • If gradient descent: then what part of the system is using evolutionary search?
    • If evolutionary search (ES): then how does the outermost loop have a critical batch size?  Is the idea that ES exhibits a trend like eqn. 2.11 in the OA paper, w/r/t population size or something, even though it's not estimating noisy gradients?  Is this true?  (It could be true, and doesn't matter for the story . . . but since it doesn't matter for the story, I don't know why we'd bothering to assume it)
    • Also, if evolutionary search (ES): how is this an extrapolation of 2022 ML trends?  Current ML is all about finding ways to make things differentiable, and then do GD, which Works™.  (And which can be targeted specially by hardware development.  And which is assumed by all the ML scaling laws.  Etc.)  Why are people in 20XX using the "stupidest" optimization process out there, instead?
  • In all of this, which parts are "doing work" to motivate events in the story?
    • Is there anything in "1 Day" onward that wouldn't happen in a mere ginormous GPT / MuZero / whatever, but instead requires this exotic hybrid method?
    • (If the answer is "yes," then that sounds like an interesting implicit claim about what currently popular methods can't do...)

Since I can't answer these questions in a way that makes sense, I also don't know how to read the various lines that describe "HXU" doing something, or attribute mental states to "HXU."

For instance, the thing in "1 Day" that has a world model -- is this a single rollout of the Meta-RNN-ish thing, which developed its world model as it chewed its way along a task sequence?  In which case, the world model(s) are being continually discarded (!) at the end of every such rollout and then built anew from scratch in the next one?  Are we doing the search problem of finding-a-world-model inside of a second search problem?

Where the outer search is (maybe?) happening through ES, which is stupid and needs gajillions of inner rollouts to get anywhere, even on trivial problems?

If the smart-thing-that-copies-itself called "HXU" is a single such rollout, and the 20XX computers can afford gajillions of such rollouts, then what are the slightly less meta 20XX models like, and why haven't they already eaten the world?

(Less important, but still jumped out at me: in "1 Day," why is HXU doing "grokking" [i.e. overfitting before the phase transition], as opposed to some other kind of discontinuous capability gain that doesn't involve overfitting?  Like, sure, I suppose it could be grokking here, but this is another one of those paper references that doesn't seem to be "doing work" to motivate story events.)

I dunno, maybe I'm reading the whole thing more closely or literally than it's intended?  But I imagine you intend the ML references to be taken somewhat more "closely" than the namedrops in your average SF novel, given the prefatory material:

grounded in contemporary ML scaling, self-supervised learning, reinforcement learning, and meta-learning research literature

And I'm not alleging that it is "just namedropping like your average SF novel."  I'm taking the references seriously.  But, when I try to view the references as load-bearing pieces in a structure, I can't make out what that structure is supposed to be.

I'm confused by your notation for feed-forward layers.

What justifies re-using the same labels ("apple" etc.) for

  1. the coordinates of  
  2. the coordinates of , i.e. the basis in which the nonlinearity operates


If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by , or which vectors/semes they get mapped to by .

But your labels don't correspond to either of these interpretations.  Instead, it looks like you are following rules of the form "the 4th component of every basis is called 'yum'," which leads you to label a coordinate "yum" even though it's neither mapped from "yum" by , nor mapped to "yum" by .

This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case.  In transformers, (2) is typically larger by a factor of 4.   The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:

with some arbitrary choices about which multiplicative constants to absorb into  and  vs. which to absorb into .

Load More