This is a linkpost for https://www.gwern.net/Clippy

    This story was originally posted as a response to this thread.

    It might help to imagine a hard takeoff scenario using only known sorts of NN & scaling effects...

    In A.D. 20XX. Work was beginning. "How are you gentlemen !!"... (Work. Work never changes; work is always hell.)

    Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there's high variance in the old runs with a few anomalously high performance values. ("Really? Really? That's what you're worried about?") He can't see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment...

    Rest of story moved to gwern.net.

    New Comment
    2 comments, sorted by Click to highlight new comments since: Today at 3:08 PM

    I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).

    Like, I've read and re-read the first few sections a number of times, and I still can't come up with a mental model of HXU's structure that fits all of the described facts.  By "HXU's structure" I mean things like:

    • The researcher is running an "evolutionary search in auto-ML" method.  How many nested layers of inner/outer loop does this method (explicitly) contain?
    • Where in the nested structure are (1) the evolutionary search, and (2) the thing that outputs "binary blobs"?
    • Are the "binary blobs" being run like Meta RNNs, ie they run sequentially in multiple environments?
      • I assume the answer is yes, because this would explain what it is that (in the 1 Day section) remembers a "history of observation of lots of random environments & datasets."
    • What is the type signature of the thing-that-outputs-binary-blobs?  What is its input?  A task, a task mixture, something else?
      • Much of the story (eg the "history of observations" passage) makes it sound like we're watching a single Meta-RNN-ish thing whose trajectories span multiple environment/tasks.
      • If this Meta-RNN-ish thing is "a blob," what role is left for the thing-that-outputs-blobs?
      • That is: in that case, the thing-that-outputs-blobs just looks like .  It's simply a constant, we can eliminate it from the description, and we're really just doing optimization over blobs. Presumably that's not the case, so what is going on here?
    • What is it that's made of "GPU primitives"?
      • If the blobs (bytecode?) are being viewed as raw binary sequences and we're flipping their bits, that's a lower level than GPU primitives.
      • If instead the thing-that-outputs-blobs is made of GPU primitives which something else is optimizing over, what is that "something else"?
    • Is the outermost training loop (the explicitly implemented one) using evolutionary search, or (explicit) gradient descent?
      • If gradient descent: then what part of the system is using evolutionary search?
      • If evolutionary search (ES): then how does the outermost loop have a critical batch size?  Is the idea that ES exhibits a trend like eqn. 2.11 in the OA paper, w/r/t population size or something, even though it's not estimating noisy gradients?  Is this true?  (It could be true, and doesn't matter for the story . . . but since it doesn't matter for the story, I don't know why we'd bothering to assume it)
      • Also, if evolutionary search (ES): how is this an extrapolation of 2022 ML trends?  Current ML is all about finding ways to make things differentiable, and then do GD, which Works™.  (And which can be targeted specially by hardware development.  And which is assumed by all the ML scaling laws.  Etc.)  Why are people in 20XX using the "stupidest" optimization process out there, instead?
    • In all of this, which parts are "doing work" to motivate events in the story?
      • Is there anything in "1 Day" onward that wouldn't happen in a mere ginormous GPT / MuZero / whatever, but instead requires this exotic hybrid method?
      • (If the answer is "yes," then that sounds like an interesting implicit claim about what currently popular methods can't do...)

    Since I can't answer these questions in a way that makes sense, I also don't know how to read the various lines that describe "HXU" doing something, or attribute mental states to "HXU."

    For instance, the thing in "1 Day" that has a world model -- is this a single rollout of the Meta-RNN-ish thing, which developed its world model as it chewed its way along a task sequence?  In which case, the world model(s) are being continually discarded (!) at the end of every such rollout and then built anew from scratch in the next one?  Are we doing the search problem of finding-a-world-model inside of a second search problem?

    Where the outer search is (maybe?) happening through ES, which is stupid and needs gajillions of inner rollouts to get anywhere, even on trivial problems?

    If the smart-thing-that-copies-itself called "HXU" is a single such rollout, and the 20XX computers can afford gajillions of such rollouts, then what are the slightly less meta 20XX models like, and why haven't they already eaten the world?

    (Less important, but still jumped out at me: in "1 Day," why is HXU doing "grokking" [i.e. overfitting before the phase transition], as opposed to some other kind of discontinuous capability gain that doesn't involve overfitting?  Like, sure, I suppose it could be grokking here, but this is another one of those paper references that doesn't seem to be "doing work" to motivate story events.)

    I dunno, maybe I'm reading the whole thing more closely or literally than it's intended?  But I imagine you intend the ML references to be taken somewhat more "closely" than the namedrops in your average SF novel, given the prefatory material:

    grounded in contemporary ML scaling, self-supervised learning, reinforcement learning, and meta-learning research literature

    And I'm not alleging that it is "just namedropping like your average SF novel."  I'm taking the references seriously.  But, when I try to view the references as load-bearing pieces in a structure, I can't make out what that structure is supposed to be.

    Curated. I like fiction. I like that this story is fiction. I hope that all stories even at all vaguely like this one remain fiction.