Same person as nostalgebraist2point0, but now I have my account back.
Thinking back to the "inconsistency" from the Kaplan et al papers...
It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)
However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It's the parameter count of a model that uses about as much inference compute as the brain.)
This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline. It's always waiting for its "sufficiently expensive model" (and it does not care that this model keeps "getting better" in terms of loss/etc as the efficiency improvements roll in).
Anyway, I'd forgotten the prior used for dataset scaling in Bio Anchors, but it's pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.
I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).
Like, I've read and re-read the first few sections a number of times, and I still can't come up with a mental model of HXU's structure that fits all of the described facts. By "HXU's structure" I mean things like:
Since I can't answer these questions in a way that makes sense, I also don't know how to read the various lines that describe "HXU" doing something, or attribute mental states to "HXU."
For instance, the thing in "1 Day" that has a world model -- is this a single rollout of the Meta-RNN-ish thing, which developed its world model as it chewed its way along a task sequence? In which case, the world model(s) are being continually discarded (!) at the end of every such rollout and then built anew from scratch in the next one? Are we doing the search problem of finding-a-world-model inside of a second search problem?
Where the outer search is (maybe?) happening through ES, which is stupid and needs gajillions of inner rollouts to get anywhere, even on trivial problems?
If the smart-thing-that-copies-itself called "HXU" is a single such rollout, and the 20XX computers can afford gajillions of such rollouts, then what are the slightly less meta 20XX models like, and why haven't they already eaten the world?
(Less important, but still jumped out at me: in "1 Day," why is HXU doing "grokking" [i.e. overfitting before the phase transition], as opposed to some other kind of discontinuous capability gain that doesn't involve overfitting? Like, sure, I suppose it could be grokking here, but this is another one of those paper references that doesn't seem to be "doing work" to motivate story events.)
I dunno, maybe I'm reading the whole thing more closely or literally than it's intended? But I imagine you intend the ML references to be taken somewhat more "closely" than the namedrops in your average SF novel, given the prefatory material:
grounded in contemporary ML scaling, self-supervised learning, reinforcement learning, and meta-learning research literature
And I'm not alleging that it is "just namedropping like your average SF novel." I'm taking the references seriously. But, when I try to view the references as load-bearing pieces in a structure, I can't make out what that structure is supposed to be.
I'm confused by your notation for feed-forward layers.
What justifies re-using the same labels ("apple" etc.) for
If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by A, or which vectors/semes they get mapped to by B.
But your labels don't correspond to either of these interpretations. Instead, it looks like you are following rules of the form "the 4th component of every basis is called 'yum'," which leads you to label a coordinate "yum" even though it's neither mapped from "yum" by A, nor mapped to "yum" by B.
This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case. In transformers, (2) is typically larger by a factor of 4. The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:
with some arbitrary choices about which multiplicative constants to absorb into A and a vs. which to absorb into B.
I agree with Eliezer's recommendation to double-check results in papers that one finds surprising.
So, I looked into the claim of a 10x - 100x gain for transformers, using Table 2 from the paper. Detailed results are in this Colab.
Briefly, I don't think the claim of 10x - 100x is well supported. Depending on what exactly you compute, you get anywhere from "no speedup" to "over 300x speedup." All the estimates you can make have obvious problems, and all show a massive gap between French and German.
I imagine this question has been investigated much more rigorously outside the original paper. The first Kaplan scaling paper does this for LMs; I dunno who has done it for MT, but I'd be surprised if no one has.EDIT: something I want to know is why ensembling was popular before transformers, but not after them. If ensembling older models was actually better than scaling them, that would weaken my conclusion a lot.
I don't know if ensembling vs. scaling has been rigorously tested, either for transformers or older models.
It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.
All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like "giving someone a free-response math test."
The distinction I'm meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is "leaving money on the table" if the capability is present yet ends up unused.
I'm glad you liked the post! And, given that you are an avowed "enthusiast," I'm pleasantly surprised that we agree about as many things as we do.
The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.
Thanks for pointing out this argument -- I hadn't thought about it before. A few thoughts:
Ordinary text generation is also a multi-step process. (The token length generally isn't fixed in advance, but could be, i.e. we could define a task "write convincingly for N tokens.") So, why does generation quality scale so smoothly?
Part of the answer is that single-token success is not fully binary: there are choices that are suboptimal / "weird" without constituting instant failure. Due to the "delusion" phenomenon, weird choices can pile on themselves and lead to failure, but "weirdness" is a continuous variable so this effect can scale more gradually.
But also, part of the answer must be that generation is relatively easy, with single-token success probabilities very close to 1 even for small models.
(Why is generation easy, when it potentially includes every other task as a subtask? Well, it samples other tasks in proportion to their frequency in natural text, which≈ their relative volume in pre-training data, which≈ how easy they are for the model.)
This shows how the relevance of the argument depends on the success probabilities living in the right "transitional regime," like your 90% vs 99% vs 99.9%. More precisely, the argument is relevant at the point where, for a given task and set of model scales, the scaling moves us across this range. I suppose by continuity this has to happen somewhere for any multi-step task, which makes me wonder whether we could "induce" discontinuous scaling for any task by forcing it to be done in a multi-step way.
Last thought: this might explain why one-step arithmetic scales discontinuously. Suppose it can only be done by some sequential multi-step algorithm (and that this is not true of most tasks). Presumably the model implements the steps along the "time axis" of successive layers. The model has some failure probability at each step, and the argument goes through.
I wonder if your thoughts [on abstract reasoning] have changed since Codex was released after you originally drafted this post.
I didn't update much on Codex. Part of that was because I'd already seen this paper, which strikes me as a comparably impressive feat of abstraction in the code generation domain.
Also, the Codex model available in the API feels very much like GPT in the way it "reasons," and is roughly what I'd expect from a GPT extended to code. It has that same quality where it frequently but not predictably does the right thing, where I often see it doing many separate things right but I can't rely on it doing any one of them stably across all contexts. As with GPT, I get the best results when I stop asking "does it know X or not?" and instead ask "can I express X in a form likely to be common in the training data?"
I'm interested in knowing more about your reasons for thinking that little will come of scaled LLMs' abstract reasoning capabilities.[...] While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is [...], why does this imply that they cannot also learn the “real” patterns? What exactly are the "real" patterns?
This is going to get speculative and hand-wavey. I don't know what abstract reasoning really is, any more than anyone does. But I have some ideas :)
First, something I have noticed since I started working with these models is that my own mind contains a module much like GPT, and this module plays a role in my reasoning process.
When I reflect on my own thought processes, they often look like a game played between a GPT-like "babbler" and an evaluating "critic."
The babbler produces an interior monologue that sounds like my own voice, but (unlike when I'm speaking out loud) is only lightly conditioned at best on things like "concepts I want to express." Instead, it just . . . says words that sound like me, making some argument with the confidence I'd have if I actually believed it, but it's not trying to express an idea I already have -- it's just generating text that sounds like me.
I let the babbler run for a while, and then I step back and assess the monologue, asking "does this make sense? is this really a new idea? does this prove too much? can I think of counterexamples?" Like generating code and then checking if it compiles. Most babbler-monologues are rejected by the critic, at which point the babbler tries again, conditioned (in some way I don't understand) on the critic's rejection.
Most of my actually-believed-ideas originated in this game, I think. Also, I often do a short-range, purely linguistic variant of this when I'm writing: I ask the babbler for the next word or phrase, and there are several rounds of "no that doesn't work" before I pick one. Even my mathematical reasoning is often like this, though it also involves other babbler-like modules that eg generate mental imagery which can be interpreted (by the critic) as expressing a mathematical argument.
Now, I highly doubt this is the only way that one can do abstract reasoning. (I don't even think that all humans do it like this.) However, this is the source of my intuitions about the components involved in "true abstract reasoning" and how it differs from what LMs tend to do.
When I do "true abstract reasoning" as described above, there is a distinction between timesteps of candidate generation (inner loop), timesteps of candidate evaluation (outer loop), and timesteps of actually selecting the next idea (increments on some passes of the outer loop but not others). This seems important for avoiding "delusive" effects.
I have to run the babbler for a while to even get a coherent idea that's possible to assess. By that point, the babbler is already conditioning on its earlier output in a self-deluding way. Unlike in GPT, though, these earlier outputs are not irrevocably written in stone at the moment we receive the later outputs; the critic is free to reject the entire sequence. With GPT, by the time it would be possible to notice "hey, I'm making a bad argument," it's already ... making a bad argument, and there's no going back.
(I think there's an analogy here to AlphaZero/MuZero's value head vs. its MCTS rollouts, where GPT is like the value head / "intuitive hunches," lacking the slower search wrapper.)
Of course, in principle, you could imagine bundling this entire procedure inside an LM. Indeed, any sufficiently good LM would eventually have to solve the problems this procedure is designed to solve. Why don't I expect transformer LMs to develop this structure internally?
One reason: the existence of my babbler seems like (weak) evidence that it's better to use an LM inside a bigger non-LM algorithm.My babbler itself feels very much like a likelihood-trained causal generative model, with the same virtuosity at surface mimicry, and the same lack of conditioning latents besides its own output. I suspect that making these kinds of models comes naturally to the cerebral cortex, and that if the brain could just implement reasoning end-to-end with such a model, it would have done it that way.
A second reason is ... okay, this is a whole separate point and the comment's already long. I'll try to make this brief.
I think transformer LMs do a lot of what they do through a kind of "compressed memorization" of very large amounts of data. Early on, they learn many different ways that text is regular; some of this may look like "truly learning (eg syntactic) rules." This low-level knowledge allows them to store training sequences in a vastly compressed form. Then, a lot of what they do in training is actual memorization of the data, in a compressed and noisy/interleaved form. Inference looks like mapping the input to the compressed space, and then doing a shallow-ish ensemble in that space over a massive number of texts the input is "reminiscent of" along various dimensions. The huge model capacity allows for a huge ensemble, so many superficial patterns cancel out in the ensemble, while deeper patterns stack.
This perspective is inspired by the way logit lens looks in later layers, by this paper which is similar to logit lens, and also by work like this showing you can extract exact strings from trained models that were only seen a few times in training.
The key point here is that you can compress things you can't yet abstractively understand, using easier things you do understand. I can't use abstractive summarization to compress (say) Grothendieck's EGA, since I don't understand it . . . but I can still run gzip on it, and that goes a long way! Hence, the frontier of the model's apparent abstractive capability will outrun its actual abstractive capability: this frontier consists of texts the model can't compress via facility with their content, but can simply memorize in bulk using easier compression.
In something like your list sorting example, I suspect the model doesn't "have" an internal list sorter that looks anything like an algorithm. Instead, it has heavily compressed memories of many actual programming tutorials that included short example lists in unsorted and sorted form, and taking an average over these will usually "sort" a short list of small numbers -- with help from low-level abstract operations like "greater than over small numbers," but without any idea that a list can be arbitrary length / can contain any ordered type.
(EDIT to clarify: the context-dependence and flakiness of the capability is how we can tell it's coming from the compressed ensemble. Contrast with the reliability of something like English syntax, which I believe is part of the compressor itself. This is my distinction between abstraction that's "real" and "fake")
Anyway, I think transformers are very good at this kind of compressive memorization -- but not nearly as good at doing other kinds of computational work, like search or (obviously?) recursion. Like, whenever I think about how to "program" some routine using attn+FFs, I tend to despair. Even simple things often to be spread across >1 layer/"step" or >1 head, and the number of heads/layers in huge models feels tiny relative to the diversity of abstraction we expect out of them. (See this paper for some actual transformer "programs.")This is hand-wavey, but my intuition is that the "right abstractions for abstraction" are hard to fit in a transformer or similar modern NN, while memorizing instances of abstraction-use is far cheaper. And yes, eventually, at large enough scale, the models will have to do the right thing or they'll stop progressing. But there is so much more left to smoothly learn with memorization that I think this architectural deficit will be invisible for a long time, over which LMs will continue to (unreliably) imitate abstraction better and better.
I agree with the critiques you make of specific papers (in section 2), but I'm less convinced by your diagnosis that these papers are attempting to manage/combat hype in a misguided way.
IMO, "underclaiming" is ubiquitous in academic papers across many fields -- including fields unrelated to NLP or ML, and fields where there's little to no hype to manage. Why do academics underclaim? Common reasons include:
I suspect 1+2+3 above, rather than hype management, explains the specific mistakes you discuss.
For example, Zhang et al 2020 seems like a case of #2. They cite Jia and Liang as evidence about a problem with earlier models, a problem they are trying to solve with their new method. It would be strange to "manage hype" by saying NLP systems can't do X, and then in the same breath present a new system which you claim does X!
Jang and Lukasiewicz (2021) is also a case of #2, describing a flaw primarily in order to motivate their own proposed fix.
Meanwhile, Xu et al 2020 seems like #3: it's a broad review paper on "adversarial attacks" which gives a brief description of Jia and Liang 2017 alongside brief descriptions of many other results, many of them outside NLP. It's true that the authors should not have used the word "SOTA" here, but it seems more plausible that this is mere sloppiness (they copied other, years-old descriptions of the Jia and Liang result) rather than an attempt to push a specific perspective about NLP.
I think a more useful framing might go something like:
Most complexity measures give roughly similar values for the (relative) complexity of most objects
I'll write mostly about this statement, as I think it's the crux of our disagreement.
The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure.
However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects. Whether this is true depends on our choice of translation.
(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object. Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects. But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense. If we don't have any idea what "good enough" means, something is missing from the story.)
In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure. (More precisely, the hand-waviness about what metric structure survives the translation, if any.) If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one.
Basic information theory doesn't require a metric, only a measure. There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability." If you care about being approximately right according to some metric, this leads you to rate-distortion theory.
Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different. Consider two distributions on R:
According to basic information theory, these are equally simple/compressible. (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)
But in rate-distortion theory, (1) is way more simple/compressible than (2). If you're coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1]. Whereas if you're coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.
If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities. When you remove the metric, they're identical. So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.
Like Rohin, I'm not impressed with the information theoretic side of this work.
Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don't think this captures the kinds of function complexity we care about for NNs.
This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.
Let's see how this goes wrong in the Shannon entropy argument from this paper.
Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity. Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).
"Simple" functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,. What does it mean for a function to respond to binary vectors in a predictable way? It means that knowing the values of some of the bits provides information about the output, even if you don't know all of them. But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.