Same person as nostalgebraist2point0, but now I have my account back.


Wiki Contributions


[Link] Training Compute-Optimal Large Language Models

Thinking back to the "inconsistency" from the Kaplan et al papers...

  • In Appendix E of the new paper, we see the loss-vs-compute frontier start to "bend" from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
  • I suspect this bending is the transition from the faster "L(C) law" to the slower "L(D) law."
    • A brief recap of that below:
      • Adding more params can help in two ways: it makes your model's loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
      • As models get bigger, the first effect dies off -- the loss curves converge to a fixed shape, rather than getter ever steeper.  The second effect keeps going, but with it alone, the overall rate of return is lower.
  • Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
    • The issue made Kaplan et al underestimate optimal model performance.  The underestimate was worst when considering models for which the optimal number of training steps was small.
    • The L(D) law came from early stopping experiments.  The early stopping step is lower for smaller data sizes.
    • So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values.  Thus the estimated L(D) curve declines faster than the true L(D) curve.
    • If this is correct, then L(D) improves more slowly with data than we had believed.
    • Note that this does contradict the "use more data!" result from the paper -- that is about the relative rate at which N and D affect L(N, D).
[Link] Training Compute-Optimal Large Language Models

It ought to shorten actual timelines, for the reason you say.  (Except insofar as data sourcing could actually become a practical problem.)

However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed.  (It's the parameter count of a model that uses about as much inference compute as the brain.)

This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline.  It's always waiting for its "sufficiently expensive model" (and it does not care that this model keeps "getting better" in terms of loss/etc as the efficiency improvements roll in).

Anyway, I'd forgotten the prior used for dataset scaling in Bio Anchors, but it's pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling.  So this news is less relevant than I had thought.

It Looks Like You're Trying To Take Over The World

I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).

Like, I've read and re-read the first few sections a number of times, and I still can't come up with a mental model of HXU's structure that fits all of the described facts.  By "HXU's structure" I mean things like:

  • The researcher is running an "evolutionary search in auto-ML" method.  How many nested layers of inner/outer loop does this method (explicitly) contain?
  • Where in the nested structure are (1) the evolutionary search, and (2) the thing that outputs "binary blobs"?
  • Are the "binary blobs" being run like Meta RNNs, ie they run sequentially in multiple environments?
    • I assume the answer is yes, because this would explain what it is that (in the 1 Day section) remembers a "history of observation of lots of random environments & datasets."
  • What is the type signature of the thing-that-outputs-binary-blobs?  What is its input?  A task, a task mixture, something else?
    • Much of the story (eg the "history of observations" passage) makes it sound like we're watching a single Meta-RNN-ish thing whose trajectories span multiple environment/tasks.
    • If this Meta-RNN-ish thing is "a blob," what role is left for the thing-that-outputs-blobs?
    • That is: in that case, the thing-that-outputs-blobs just looks like .  It's simply a constant, we can eliminate it from the description, and we're really just doing optimization over blobs. Presumably that's not the case, so what is going on here?
  • What is it that's made of "GPU primitives"?
    • If the blobs (bytecode?) are being viewed as raw binary sequences and we're flipping their bits, that's a lower level than GPU primitives.
    • If instead the thing-that-outputs-blobs is made of GPU primitives which something else is optimizing over, what is that "something else"?
  • Is the outermost training loop (the explicitly implemented one) using evolutionary search, or (explicit) gradient descent?
    • If gradient descent: then what part of the system is using evolutionary search?
    • If evolutionary search (ES): then how does the outermost loop have a critical batch size?  Is the idea that ES exhibits a trend like eqn. 2.11 in the OA paper, w/r/t population size or something, even though it's not estimating noisy gradients?  Is this true?  (It could be true, and doesn't matter for the story . . . but since it doesn't matter for the story, I don't know why we'd bothering to assume it)
    • Also, if evolutionary search (ES): how is this an extrapolation of 2022 ML trends?  Current ML is all about finding ways to make things differentiable, and then do GD, which Works™.  (And which can be targeted specially by hardware development.  And which is assumed by all the ML scaling laws.  Etc.)  Why are people in 20XX using the "stupidest" optimization process out there, instead?
  • In all of this, which parts are "doing work" to motivate events in the story?
    • Is there anything in "1 Day" onward that wouldn't happen in a mere ginormous GPT / MuZero / whatever, but instead requires this exotic hybrid method?
    • (If the answer is "yes," then that sounds like an interesting implicit claim about what currently popular methods can't do...)

Since I can't answer these questions in a way that makes sense, I also don't know how to read the various lines that describe "HXU" doing something, or attribute mental states to "HXU."

For instance, the thing in "1 Day" that has a world model -- is this a single rollout of the Meta-RNN-ish thing, which developed its world model as it chewed its way along a task sequence?  In which case, the world model(s) are being continually discarded (!) at the end of every such rollout and then built anew from scratch in the next one?  Are we doing the search problem of finding-a-world-model inside of a second search problem?

Where the outer search is (maybe?) happening through ES, which is stupid and needs gajillions of inner rollouts to get anywhere, even on trivial problems?

If the smart-thing-that-copies-itself called "HXU" is a single such rollout, and the 20XX computers can afford gajillions of such rollouts, then what are the slightly less meta 20XX models like, and why haven't they already eaten the world?

(Less important, but still jumped out at me: in "1 Day," why is HXU doing "grokking" [i.e. overfitting before the phase transition], as opposed to some other kind of discontinuous capability gain that doesn't involve overfitting?  Like, sure, I suppose it could be grokking here, but this is another one of those paper references that doesn't seem to be "doing work" to motivate story events.)

I dunno, maybe I'm reading the whole thing more closely or literally than it's intended?  But I imagine you intend the ML references to be taken somewhat more "closely" than the namedrops in your average SF novel, given the prefatory material:

grounded in contemporary ML scaling, self-supervised learning, reinforcement learning, and meta-learning research literature

And I'm not alleging that it is "just namedropping like your average SF novel."  I'm taking the references seriously.  But, when I try to view the references as load-bearing pieces in a structure, I can't make out what that structure is supposed to be.

Hard-Coding Neural Computation

I'm confused by your notation for feed-forward layers.

What justifies re-using the same labels ("apple" etc.) for

  1. the coordinates of  
  2. the coordinates of , i.e. the basis in which the nonlinearity operates


If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by , or which vectors/semes they get mapped to by .

But your labels don't correspond to either of these interpretations.  Instead, it looks like you are following rules of the form "the 4th component of every basis is called 'yum'," which leads you to label a coordinate "yum" even though it's neither mapped from "yum" by , nor mapped to "yum" by .

This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case.  In transformers, (2) is typically larger by a factor of 4.   The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:

with some arbitrary choices about which multiplicative constants to absorb into  and  vs. which to absorb into .

More Christiano, Cotra, and Yudkowsky on AI progress

I agree with Eliezer's recommendation to double-check results in papers that one finds surprising.

So, I looked into the claim of a 10x - 100x gain for transformers, using Table 2 from the paper.  Detailed results are in this Colab.

Briefly, I don't think the claim of 10x - 100x is well supported.  Depending on what exactly you compute, you get anywhere from "no speedup" to "over 300x speedup."  All the estimates you can make have obvious problems, and all show a massive gap between French and German.

In detail:

  • The appearance of a large speedup is heavily affected by the fact that previous SOTAs were ensembles, and ensembling is a very inefficient way to spend compute.
    • In terms of simple BLEU / compute, the efficiency gain from transformers looks about 10x smaller if we compare to non-ensembled older models.
  • Simple BLEU / compute is not a great metric because of diminishing marginal returns.
    • By this metric, the small transformer is ~6x "better" than the big one!
    • By this metric, small transformer has a speedup of ~6x to ~40x, while big transformer has a speedup of ~1x to ~6x.
  • We can try to estimate marginal returns by comparing sizes for transformers, and ensembled vs. not for older methods.
    • This gives a speedup of ~5x for German and ~100x to ~300x for French
    • But this is not an apples-to-apples comparison, as the transformer is scaled while the others are ensembled.

I imagine this question has been investigated much more rigorously outside the original paper.  The first Kaplan scaling paper does this for LMs; I dunno who has done it for MT, but I'd be surprised if no one has.

EDIT: something I want to know is why ensembling was popular before transformers, but not after them.  If ensembling older models was actually better than scaling them, that would weaken my conclusion a lot.

I don't know if ensembling vs. scaling has been rigorously tested, either for transformers or older models.

larger language models may disappoint you [or, an eternally unfinished draft]

It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.

All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like "giving someone a free-response math test."

  • They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event)
  • They don't know the material, yet they ace the test: requires an astronomically unlikely coincidence

The distinction I'm meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is "leaving money on the table" if the capability is present yet ends up unused.

larger language models may disappoint you [or, an eternally unfinished draft]

I'm glad you liked the post!  And, given that you are an avowed "enthusiast," I'm pleasantly surprised that we agree about as many things as we do.

The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.

Thanks for pointing out this argument -- I hadn't thought about it before.  A few thoughts:

Ordinary text generation is also a multi-step process.  (The token length generally isn't fixed in advance, but could be, i.e. we could define a task "write convincingly for N tokens.")  So, why does generation quality scale so smoothly?

Part of the answer is that single-token success is not fully binary: there are choices that are suboptimal / "weird" without constituting instant failure.  Due to the "delusion" phenomenon, weird choices can pile on themselves and lead to failure, but "weirdness" is a continuous variable so this effect can scale more gradually. 

But also, part of the answer must be that generation is relatively easy, with single-token success probabilities very close to 1 even for small models.

(Why is generation easy, when it potentially includes every other task as a subtask?  Well, it samples other tasks in proportion to their frequency in natural text, which≈ their relative volume in pre-training data, which≈ how easy they are for the model.)

This shows how the relevance of the argument depends on the success probabilities living in the right "transitional regime," like your 90% vs 99% vs 99.9%.  More precisely, the argument is relevant at the point where, for a given task and set of model scales, the scaling moves us across this range.  I suppose by continuity this has to happen somewhere for any multi-step task, which makes me wonder whether we could "induce" discontinuous scaling for any task by forcing it to be done in a multi-step way.

Last thought: this might explain why one-step arithmetic scales discontinuously.  Suppose it can only be done by some sequential multi-step algorithm (and that this is not true of most tasks).  Presumably the model implements the steps along the "time axis" of successive layers.  The model has some failure probability at each step, and the argument goes through.

I wonder if your thoughts [on abstract reasoning] have changed since Codex was released after you originally drafted this post.

I didn't update much on Codex.  Part of that was because I'd already seen this paper, which strikes me as a comparably impressive feat of abstraction in the code generation domain.

Also, the Codex model available in the API feels very much like GPT in the way it "reasons," and is roughly what I'd expect from a GPT extended to code.  It has that same quality where it frequently but not predictably does the right thing, where I often see it doing many separate things right but I can't rely on it doing any one of them stably across all contexts.  As with GPT, I get the best results when I stop asking "does it know X or not?" and instead ask "can I express X in a form likely to be common in the training data?"

I'm interested in knowing more about your reasons for thinking that little will come of scaled LLMs' abstract reasoning capabilities.

[...] While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is [...], why does this imply that they cannot also learn the “real” patterns? What exactly are the "real" patterns?

This is going to get speculative and hand-wavey.  I don't know what abstract reasoning really is, any more than anyone does.  But I have some ideas :)

First, something I have noticed since I started working with these models is that my own mind contains a module much like GPT, and this module plays a role in my reasoning process.

When I reflect on my own thought processes, they often look like a game played between a GPT-like "babbler" and an evaluating "critic."

The babbler produces an interior monologue that sounds like my own voice, but (unlike when I'm speaking out loud) is only lightly conditioned at best on things like "concepts I want to express."  Instead, it just . . . says words that sound like me, making some argument with the confidence I'd have if I actually believed it, but it's not trying to express an idea I already have -- it's just generating text that sounds like me.

I let the babbler run for a while, and then I step back and assess the monologue, asking "does this make sense? is this really a new idea? does this prove too much? can I think of counterexamples?"  Like generating code and then checking if it compiles.  Most babbler-monologues are rejected by the critic, at which point the babbler tries again, conditioned (in some way I don't understand) on the critic's rejection.

Most of my actually-believed-ideas originated in this game, I think.  Also, I often do a short-range, purely linguistic variant of this when I'm writing: I ask the babbler for the next word or phrase, and there are several rounds of "no that doesn't work" before I pick one.  Even my mathematical reasoning is often like this, though it also involves other babbler-like modules that eg generate mental imagery which can be interpreted (by the critic) as expressing a mathematical argument.

Now, I highly doubt this is the only way that one can do abstract reasoning.  (I don't even think that all humans do it like this.)  However, this is the source of my intuitions about the components involved in "true abstract reasoning" and how it differs from what LMs tend to do.

When I do "true abstract reasoning" as described above, there is a distinction between timesteps of candidate generation (inner loop), timesteps of candidate evaluation (outer loop), and timesteps of actually selecting the next idea (increments on some passes of the outer loop but not others).  This seems important for avoiding "delusive" effects.

I have to run the babbler for a while to even get a coherent idea that's possible to assess.  By that point, the babbler is already conditioning on its earlier output in a self-deluding way.  Unlike in GPT, though, these earlier outputs are not irrevocably written in stone at the moment we receive the later outputs; the critic is free to reject the entire sequence.  With GPT, by the time it would be possible to notice "hey, I'm making a bad argument," it's already ... making a bad argument, and there's no going back. 

(I think there's an analogy here to AlphaZero/MuZero's value head vs. its MCTS rollouts, where GPT is like the value head / "intuitive hunches," lacking the slower search wrapper.)

Of course, in principle, you could imagine bundling this entire procedure inside an LM.  Indeed, any sufficiently good LM would eventually have to solve the problems this procedure is designed to solve.  Why don't I expect transformer LMs to develop this structure internally?

One reason: the existence of my babbler seems like (weak) evidence that it's better to use an LM inside a bigger non-LM algorithm.

My babbler itself feels very much like a likelihood-trained causal generative model, with the same virtuosity at surface mimicry, and the same lack of conditioning latents besides its own output.  I suspect that making these kinds of models comes naturally to the cerebral cortex, and that if the brain could just implement reasoning end-to-end with such a model, it would have done it that way.

A second reason is ... okay, this is a whole separate point and the comment's already long.  I'll try to make this brief.

I think transformer LMs do a lot of what they do through a kind of "compressed memorization" of very large amounts of data.  Early on, they learn many different ways that text is regular; some of this may look like "truly learning (eg syntactic) rules."  This low-level knowledge allows them to store training sequences in a vastly compressed form.  Then, a lot of what they do in training is actual memorization of the data, in a compressed and noisy/interleaved form.  Inference looks like mapping the input to the compressed space, and then doing a shallow-ish ensemble in that space over a massive number of texts the input is "reminiscent of" along various dimensions.  The huge model capacity allows for a huge ensemble, so many superficial patterns cancel out in the ensemble, while deeper patterns stack.

This perspective is inspired by the way logit lens looks in later layers, by this paper which is similar to logit lens, and also by work like this showing you can extract exact strings from trained models that were only seen a few times in training.

The key point here is that you can compress things you can't yet abstractively understand, using easier things you do understand.  I can't use abstractive summarization to compress (say) Grothendieck's EGA, since I don't understand it . . . but I can still run gzip on it, and that goes a long way!  Hence, the frontier of the model's apparent abstractive capability will outrun its actual abstractive capability: this frontier consists of texts the model can't compress via facility with their content, but can simply memorize in bulk using easier compression.

In something like your list sorting example, I suspect the model doesn't "have" an internal list sorter that looks anything like an algorithm.  Instead, it has heavily compressed memories of many actual programming tutorials that included short example lists in unsorted and sorted form, and taking an average over these will usually "sort" a short list of small numbers -- with help from low-level abstract operations like "greater than over small numbers," but without any idea that a list can be arbitrary length / can contain any ordered type.

(EDIT to clarify: the context-dependence and flakiness of the capability is how we can tell it's coming from the compressed ensemble.  Contrast with the reliability of something like English syntax, which I believe is part of the compressor itself.  This is my distinction between abstraction that's "real" and "fake")

Anyway, I think transformers are very good at this kind of compressive memorization -- but not nearly as good at doing other kinds of computational work, like search or (obviously?) recursion.  Like, whenever I think about how to "program" some routine using attn+FFs, I tend to despair.  Even simple things often to be spread across >1 layer/"step" or >1 head, and the number of heads/layers in huge models feels tiny relative to the diversity of abstraction we expect out of them.  (See this paper for some actual transformer "programs.")

This is hand-wavey, but my intuition is that the "right abstractions for abstraction" are hard to fit in a transformer or similar modern NN, while memorizing instances of abstraction-use is far cheaper.  And yes, eventually, at large enough scale, the models will have to do the right thing or they'll stop progressing.  But there is so much more left to smoothly learn with memorization that I think this architectural deficit will be invisible for a long time, over which LMs will continue to (unreliably) imitate abstraction better and better.

NLP Position Paper: When Combatting Hype, Proceed with Caution

I agree with the critiques you make of specific papers (in section 2), but I'm less convinced by your diagnosis that these papers are attempting to manage/combat hype in a misguided way.

IMO, "underclaiming" is ubiquitous in academic papers across many fields -- including fields unrelated to NLP or ML, and fields where there's little to no hype to manage.  Why do academics underclaim?  Common reasons include:

  1. An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior method.
    Anyone who's read papers in ML, numerical analysis, statistical inference, computer graphics, etc. is familiar with this phenomenon; there's a reason this tweet is funny.
  2. An incentive to frame one's own work as solving a real, practically relevant problem which is not adequately addressed by existing approaches.  This is related to #1, but tends to affect motivating discussion, whereas #1 affects the presentation of results.
  3. General sloppiness about citations.  Academics rarely do careful background work on the papers they cite, especially once it becomes "conventional" to cite a particular paper in a particular context.  Even retracted papers often go on being cited year after year, with no mention made of the retraction.

I suspect 1+2+3 above, rather than hype management, explains the specific mistakes you discuss.

For example, Zhang et al 2020 seems like a case of #2.  They cite Jia and Liang as evidence about a problem with earlier models, a problem they are trying to solve with their new method.  It would be strange to "manage hype" by saying NLP systems can't do X, and then in the same breath present a new system which you claim does X! 

Jang and Lukasiewicz (2021) is also a case of #2, describing a flaw primarily in order to motivate their own proposed fix.

Meanwhile, Xu et al 2020 seems like #3: it's a broad review paper on "adversarial attacks" which gives a brief description of Jia and Liang 2017 alongside brief descriptions of many other results, many of them outside NLP.  It's true that the authors should not have used the word "SOTA" here, but it seems more plausible that this is mere sloppiness (they copied other, years-old descriptions of the Jia and Liang result) rather than an attempt to push a specific perspective about NLP.

I think a more useful framing might go something like:

  • We know very little about the real capabilities/limits of existing NLP systems.  The literature does not discuss this topic with much care or seriousness; people often cite outdated results, or attach undue significance to correct-but-narrow philosophical points about limitations.
  • This leads to some waste of effort, as people work on solving problems that have already been solved (like trying to "fix" Jia and Liang issues as if it were still 2017).  Note that this is a point NLP researchers ought to care about, whether they are interested in AI safety or not.
  • This is also bad from an AI safety perspective.
  • We should study the capabilities of existing systems, and the likely future trajectory of those capabilities, with more care and precision.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Most complexity measures give roughly similar values for the (relative) complexity of most objects


I'll write mostly about this statement, as I think it's the crux of our disagreement.

The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure.

However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects.  Whether this is true depends on our choice of translation.

(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object.  Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects.  But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense.  If we don't have any idea what "good enough" means, something is missing from the story.)

In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure.  (More precisely, the hand-waviness about what metric structure survives the translation, if any.)  If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one.

Basic information theory doesn't require a metric, only a measure.  There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability."  If you care about being approximately right according to some metric, this leads you to rate-distortion theory.

Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different.  Consider two distributions on R:

  1. The standard normal,
  2. The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R

According to basic information theory, these are equally simple/compressible.  (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)

But in rate-distortion theory, (1) is way more simple/compressible than (2).  If you're coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1].  Whereas if you're coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.

If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities.  When you remove the metric, they're identical.  So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.


  • If  are finite sets, then there are a finite number of functions .   Let's write  for the finite set of such functions.
  • The authors view the counting measure on  -- where every function is equally likely -- as "unbiased."
  • This choice makes sense if  are truly unstructured collections of objects with no intrinsic meaning.
  • However, if there is some extra structure on them like a metric, it's no longer clear that "all functions are equally likely" is the right reference point.
  • Imposing a constraint that functions should use/respect the extra structure, even in some mild way like continuity, may pick out a tiny subset of  relative to the counting measure.
  • Finally, if we pick a measure of simplicity that happens to judge this subset to be unusually simple, then any prior that prefers mildly reasonable functions (eg continuous ones) will look like a simplicity prior.

This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.

Let's see how this goes wrong in the Shannon entropy argument from this paper.

  • The authors consider (a quantity equivalent to) the fraction of inputs in  for which a given function outputs .
  • They consider a function simpler if this fraction is close to 1 or 0, because then it's easier to compress.
  • With the counting measure, "most" functions output  about half of the time.  (Like the binomial distribution -- there are lots of different ways you can flip 5 tails and 5 heads, but only one way to flip 10 heads.)
  • To learn binary functions with an NN, they encode the inputs as binary vectors like .  They study what happens when you feed these to either (A) linear model, or (B) a ReLu stack, with random weights.
  • It turns out that the functions expressed by these models are much more likely than the counting measure to assign a single label ( or ) to most outputs.
  • Why?
    • For an random function on an input space of size , you need to roll  independent random variables.  Each roll affects only one input element.
    • But when you encode the inputs as vectors of length  and feed them into a model, the layers of the model have weights that are also -vectors.  Each of their components affects many input elements at once, in the same direction.  This makes it likely for the judgments to clump towards  or .
    • For example, with the linear model with no threshold, if we roll a weight vector whose elements are all positive, then every input maps to .  This happens a fraction  of the time.  But only one boolean function maps every input to , so the counting measure would give this probability .
    • This doesn't seem like a special property of neural nets.  It just seems like a result of assigning a normed vector space structure to the inputs, and preferring functions that "use" the structure in their labeling rule.  "Using" the structure means any decision you make about how to treat one input element has implications for others (because they're close to it, or point in the same direction, or something).  Thus you have fewer independent decisions to make, and there's a higher probability they all push in the same direction.

Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity.  Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).

"Simple" functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,.  What does it mean for a function to respond to binary vectors in a predictable way?  It means that knowing the values of some of the bits provides information about the output, even if you don't know all of them.  But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.

Load More