Currently, we do not have a good theoretical understanding of how or why neural networks actually work. For example, we know that large neural networks are sufficiently expressive to compute almost any kind of function. Moreover, most functions that fit a given set of training data will not generalise well to new data. And yet, if we train a neural network we will usually obtain a function that gives good generalisation. What is the mechanism behind this phenomenon?

There has been some recent research which (I believe) sheds some light on this issue. I would like to call attention to this blog post:

Neural Networks Are Fundamentally Bayesian

This post provides a summary of the research in these three papers, which provide a candidate for a theory of generalisation:

https://arxiv.org/abs/2006.15191 
https://arxiv.org/abs/1909.11522 
https://arxiv.org/abs/1805.08522

(You may notice that I had some involvement with this research, but the main credit should go to Chris Mingard and Guillermo Valle-Perez!)

I believe that research of this type is very relevant for AI alignment. It seems quite plausible that neural networks, or something similar to them, will be used as a component of AGI. If that is the case, then we want to be able to reliably predict and reason about how neural networks behave in new situations, and how they interact with other systems, and it is hard to imagine how that would be possible without a deep understanding of the dynamics at play when neural networks learn from data. Understanding their inductive bias seems particularly important, since this is the key to understanding everything from why they work in the first place, to phenomena such as adversarial examples, to the risk of mesa-optimisation. I hence believe that it makes sense for alignment researchers to keep an eye on what is happening in this space.

If you want some more stuff to read in this genre, I can also recommend these two posts:

Recent Progress in the Theory of Neural Networks
Understanding "Deep Double Descent"

EDIT: Here is a second post, which talks more about the "prior" of neural networks:

Deep Neural Networks are biased, at initialisation, towards simple functions

New Comment
54 comments, sorted by Click to highlight new comments since:

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.

Consider:

  • If  are finite sets, then there are a finite number of functions .   Let's write  for the finite set of such functions.
  • The authors view the counting measure on  -- where every function is equally likely -- as "unbiased."
  • This choice makes sense if  are truly unstructured collections of objects with no intrinsic meaning.
  • However, if there is some extra structure on them like a metric, it's no longer clear that "all functions are equally likely" is the right reference point.
  • Imposing a constraint that functions should use/respect the extra structure, even in some mild way like continuity, may pick out a tiny subset of  relative to the counting measure.
  • Finally, if we pick a measure of simplicity that happens to judge this subset to be unusually simple, then any prior that prefers mildly reasonable functions (eg continuous ones) will look like a simplicity prior.

This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.

Let's see how this goes wrong in the Shannon entropy argument from this paper.

  • The authors consider (a quantity equivalent to) the fraction of inputs in  for which a given function outputs .
  • They consider a function simpler if this fraction is close to 1 or 0, because then it's easier to compress.
  • With the counting measure, "most" functions output  about half of the time.  (Like the binomial distribution -- there are lots of different ways you can flip 5 tails and 5 heads, but only one way to flip 10 heads.)
  • To learn binary functions with an NN, they encode the inputs as binary vectors like .  They study what happens when you feed these to either (A) linear model, or (B) a ReLu stack, with random weights.
  • It turns out that the functions expressed by these models are much more likely than the counting measure to assign a single label ( or ) to most outputs.
  • Why?
    • For an random function on an input space of size , you need to roll  independent random variables.  Each roll affects only one input element.
    • But when you encode the inputs as vectors of length  and feed them into a model, the layers of the model have weights that are also -vectors.  Each of their components affects many input elements at once, in the same direction.  This makes it likely for the judgments to clump towards  or .
    • For example, with the linear model with no threshold, if we roll a weight vector whose elements are all positive, then every input maps to .  This happens a fraction  of the time.  But only one boolean function maps every input to , so the counting measure would give this probability .
    • This doesn't seem like a special property of neural nets.  It just seems like a result of assigning a normed vector space structure to the inputs, and preferring functions that "use" the structure in their labeling rule.  "Using" the structure means any decision you make about how to treat one input element has implications for others (because they're close to it, or point in the same direction, or something).  Thus you have fewer independent decisions to make, and there's a higher probability they all push in the same direction.

Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity.  Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).

"Simple" functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,.  What does it mean for a function to respond to binary vectors in a predictable way?  It means that knowing the values of some of the bits provides information about the output, even if you don't know all of them.  But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.

We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about.

Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, if  regardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks.

Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears!

As for the rest of your comment -- what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?

Most complexity measures give roughly similar values for the (relative) complexity of most objects

 

I'll write mostly about this statement, as I think it's the crux of our disagreement.

The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure.

However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects.  Whether this is true depends on our choice of translation.

(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object.  Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects.  But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense.  If we don't have any idea what "good enough" means, something is missing from the story.)

In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure.  (More precisely, the hand-waviness about what metric structure survives the translation, if any.)  If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one.

Basic information theory doesn't require a metric, only a measure.  There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability."  If you care about being approximately right according to some metric, this leads you to rate-distortion theory.

Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different.  Consider two distributions on R:

  1. The standard normal,
  2. The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R

According to basic information theory, these are equally simple/compressible.  (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)

But in rate-distortion theory, (1) is way more simple/compressible than (2).  If you're coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1].  Whereas if you're coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.

If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities.  When you remove the metric, they're identical.  So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.

The work linked in this post was IMO the most important work done on understanding neural networks at the time it came out, and it has also significantly changed the way I think about optimization more generally.

That said, there's a lot of "noise" in the linked papers; it takes some digging to see the key ideas and the data backing them up, and there's a lot of space spent on things which IMO just aren't that interesting at all. So, I'll summarize the things which I consider central.

When optimizing an overparameterized system, there are many many different parameter settings which achieve optimality. Optima are not peaks, they're ridges; there's a whole surface on which optimal performance is achieved. In this regime, the key question is which of the many optima an optimized system actually converges to.

Here's a kind-of-silly way to model it. First, we sample some random point in parameter space from the distribution ; in the neural network case, this is the parameter initialization. Then, we optimize: we find some new parameter values  such that  is maximized. But which of the many optimal  values does our optimizer end up at? If we didn't know anything about the details of the optimizer, one simple guess would be that  is sampled from the initialization distribution, but updated on the point being optimal, i.e.

... so the net effect of randomly initializing and then optimizing is equivalent to using the initialization distribution as a prior, doing a Bayesian update on  being optimal, and then sampling from that posterior.

The linked papers show that this kind-of-silly model is basically accurate. It didn't have to be this way a priori; we could imagine that the specifics of SGD favored some points over others, so that the distribution of  was not proportional to the prior. But that mostly doesn't happen (and to the extent it does, it's a relatively weak effect); the data shows that  values are sampled roughly in proportion to their density in the prior, exactly as we'd expect from the Bayesian-update-on-optimality model.

One implication of this is that the good generalization of neural nets must come mostly from the prior, not from some bias in SGD, because bias in SGD mostly just doesn't impact the distribution of optimized parameters values. The optimized parameter value distribution is approximately-determined by the initialization prior, so any generalization must come from that prior. And indeed, the papers also confirm that the observed generalization error lines up with what we'd expect from the Bayesian-update-on-optimality model.

For me, the most important update from this work has not been specific to neural nets. It's about overparameterized optimization in general: we can think of overparameterized optimization as sampling from the initialization prior updated on optimality, i.e. . This is a great approximation to work with analytically, and the papers here show that it is realistic for real complicated systems like SGD-trained neural nets.

This is great! This definitely does seem to me like strong evidence that SGD is the wrong place to look for understanding neural networks' inductive biases and that we should be focusing more on the architecture instead. I do wonder the extent to which that insight is likely to scale, though—perhaps the more gradient descent you do, the more it starts to look different from random search and the more you see its quirks.

Scott's “How does Gradient Descent Interact with Goodhart?” seems highly relevant here. Perhaps these results could serve as a partial answer to that question, in the sense that SGD doesn't seem to differ very much from random search (with a Gaussian prior on the weights) for deep neural networks on MNIST. I'm not sure how to reconcile that with the other arguments in that post for why it should be different, though, such as the experiments that Peter and I did.

To make sure I am following:

--The stuff linked in this post hypothesizes that simple functions enjoy bigger volume in parameter-space, i.e. there are more possible combinations of neuron weights that add up to a simple function than a complex one. Combined with the plausible hypothesis that real-world data tends to have simple functions that fit it, this explains why neural networks generalize well, and the explanation has nothing to do with SGD and everything to do with the prior, so to speak, implicit in the parameter space.

--However, it doesn't prove this hypothesis, or give a theoretical argument for it of the sort "Here's why, given what we know about math and about complexity theory, simple functions should enjoy bigger volume in parameter-space." Rather, the post (and the related papers) argues that this hypothesis is true because it's an elegant hypothesis that explains the data well (and they have various experiments to back this up.)

--Hence, you think there still remains the mystery of why this hypothesis is true -- why simple functions enjoy bigger volume -- and you think the answer probably has something to do with the architecture rather than SGD, but you link to some stuff about SGD that seems relevant.

I'm not at all sure I'm following, which is why I'm asking. :)

We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.

It's true that we don't have a rigorous mathematical proof for "why" simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.

First of all, we can look at the claim "backwards"; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.

There is a mathematical theorem, called the Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P -> F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).

The Levin bound doesn't apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.

Also, check out this.

The Levin bound doesn't apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.

In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.

Yes, it does of course apply in that sense. 

I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again). 

I'm not sure if accounting for the floating-point implementation is informative or not in this case.

My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it's probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that's not a reasonable/useful level of abstraction).

Thanks, this is great! OK, I'm pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.

I'm not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can't really imagine anyone disagreeing with that part (though I'm still quite glad to have actual analysis to back that up).

Pinging you to see what your current thoughts are! I think that if "SGD is basically equivalent to random search" then that has huge, huge implications.

I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.

Do you have a good reference for the Levin bound? My attempts at finding a relevant paper all failed.

The second Towards Data Science post references this paper, which is also the main reference through which Levin's result is mentioned in the first paper you post. So I assume reading these references should be enough to get the gist.

There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.

Hmmm... I don't know if that's how I would describe what's happening. I would say:

  • The above post provides empirical evidence that there isn't much difference between the generalization performance of “doing SGD on DNNs until you get some level of performance” and “randomly sampling DNN weights until you get some level of performance.”
  • I find that result difficult to reconcile with both theoretical and empirical arguments for why SGD should be different than random sampling, such as the experiments I ran and linked above.
  • The answer to this question, one way or another, is important for understanding the DNN prior, where it's coming from, and what sorts of things it's likely to incentivize—e.g. is there a simplicity bias which might incentivize mesa-optimization or incentivize pseudo-alignment?

OK, thanks! The first bullet point is I think a summary of my first two bullet points, but with different emphasis. I'll check out the experiments you linked.

I'm curious to know what you mean by "focus on the architecture instead." My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.

My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.

In my opinion, I think it's certainly and obviously true that neural net architecture is a huge contributor to inductive biases and comes with a strong simplicity bias. What's surprising to me is that I would expect a similar thing to be true of SGD and yet these results seem to indicate that SGD vs. random search has only a pretty minimal effect on inductive biases.

I'm having trouble parsing your first sentence -- are you saying that yes, pretty much any neural net architecture will have a simplicity bias, but saying also that the biases will be importantly different depending on specifically which architecture you pick?

I think I would have predicted that SGD vs. random search would have a pretty minimal effect on inductive biases. My pet theory for why neural nets have a bias towards simplicity is that there are more ways for neural nets to encode simple functions than complex functions, i.e. larger regions of parameter space for simple functions. As the OP argues (I think) if this is right, then it makes sense that SGD and random search don't affect the bias that much, since larger regions of parameter space will also have larger basins of attraction for SGD to roll down. (As for the justification of my pet theory, well, this is really sketchy but see my top-level comment below)

Important point here: the relevant notion of "simple" is not "low Kolmogorov complexity"; it's more like smoothness. A bias toward "simple" functions, for this notion of "simple", would mainly make interpolation work well, not necessarily other prediction problems.

I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.

How so? In particular, are you saying that gaussian processes bias towards low Kolmogorov complexity, or are you saying that there's some simplicitly prior here besides what gaussian processes have? If the former, how does a gaussian process bias towards short programs other than the compressibility offered by smoothness?

(Of course smoothness itself implies some compressibility in the Kolmogorov picture; the interesting question is whether there's a bias towards functions which can be compressed in more general ways.)

I'd be interested to hear your take on this comment above, which is talking about compressibility, which to me sounds like kolmogorov complexity.

The things Joar is saying there generally match my intuitions, although there are some things unsaid where my intuitions might diverge more.

There's about a gazillion measures of "simplicity". If we pick some simplicity measure M, the usual rule is that things with low M-complexity have low Kolmogorov complexity, but things with low Kolmogorov complexity don't necessarily have low M-complexity. For instance, if a file can be compressed a lot by a ZIP file encoding, then that file has low Kolmogorov complexity, but not all low Kolmogorov complexity strings can be compressed by ZIP specifically.

In this context, smoothness is one such relevant measure: smooth functions have low Kolmogorov complexity, but there are other ways to have low Kolmogorov complexity without being smooth. I don't know about the Levin bound specifically, but in math these sorts of theorems are usually about smoothness. In complexity theory in particular, theorems often connect smoothness measures to circuit size measures (which are another class of complexity measures which imply low Kolmogorov complexity but not vice versa).

Roughly speaking, if a complexity measure were as general as Kolmogorov complexity, then we'd expect it to be uncomputable. If we can actually find low-complexity functions under a complexity measure M, then that complexity measure is probably less general than Kolmogorov complexity. From there, the natural next question is exactly what generality is kept/lost, and whether the loss of generality actually matters for things in the real world.

Ah, I certainly agree with this.

I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as "the n'th input neuron is equal to the n'th output neuron" lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.

So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot "see" properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to "see" more kinds of functions. For example, NALUs give neural networks the ability to "see" arithmetic relations more easily. I hence certainly think it's a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.

This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It's a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it's essentially just an intuition pump.

That's a clever example, I like it.

Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it's just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.

Thanks, that helps. Perhaps an example would be: A purely feed-forward neural network might be "blind" to algorithms that are kolmogorov-simple but which involve repeatedly performing the same procedure a bunch of times (even if it is technically big enough to contain such an algorithm). So the simplicity bias of said network would be importantly different from kolmogorov complexity.

That's exactly right. That exact example would be a case of high circuit size but low Kolmogorov complexity.

Two confusions about what this paper is claiming.

First Confusion: Why does the experimental test involve any test data at all?

It seems like  and  are denoting different things in different places. Writing out the model from the first few sections explicitly:

  • We have a network , where  are the parameters and  are input, output
  • Any distribution  then gives a prior distribution on  functions , where  is an indicator function
  • Given the training data , we can update in two ways: Bayesian update or SGD
  • Bayes update is: 
  • SGD update is: , where  is the "optimal" -value spit out by SGD on data  when initialized at .

Now for the key confusion: the claim that  sounds like it should apply to the definitions above, for all functions  (or at least with high probability), without any reference whatsoever to any test data.

So how does the test data enter? It sounds like the experiments evaluate , for  values from the test set, under the two -distributions above. This is a very different notion of "" from above, and it's not obvious whether it's relevant at all to understanding DNN performance. It's asking "are the probabilities assigned to the test data approximately the same?" rather than "are the distributions of trained functions approximately the same?".

My optimistic guess as to what's going on here: the "test set" is completely ignoring the test labels, and instead choosing random labels. This would be equivalent to comparing  to  at random , but only on a fixed subset of 100 possible -values (drawn from the true distribution). If that's what's going on, then it is asking "are the distributions of trained functions approximately the same on realistic X-values?", and it's just confusing to the reader to talk about these random functions as coming from a "test set". Not a substantive problem, just communication difficulty.

Second Confusion: How does the stated conclusion follow from the figures?

This confusion is more prosaic: I'm not convinced that the conclusion  follows from the figures, at least not to enough of an extent to be "strong evidence" against the claim that SGD is the main source of induction bias.

Many of these figures show awfully large divergence from equality, and I'm not seeing any statistical measure of fit. Eyeballing them, it's clear that there's a strong enough relation to rule in inductive bias in the network architecture, but that does not rule out inductive bias in SGD as well. To make the latter claim, there would have to be statistically-small residual inductive bias after accounting for the contribution of bias from , and I don't see the paper actually making that case. I find the claim plausible a priori, but I don't see the right analysis here to provide significant evidence for it.

Chris (author of the linked post) messaged me about these, and I also looked at the full paper and thought about it some more. There are still some stray pieces, but I'm now generally convinced the headline claim is correct: the vast majority of inductive bias comes from the implicit prior on the network's parameter space.

Shortest path I see to that conclusion, from the figures, is a Fermi estimate:

  • figure 5b (in the post; figure 1b in paper) shows that functions with Bayesian posterior above ~10^-10 have typical generalization error on the order of 5% (hard to get more granular than that)
  • figure 5a shows that SGD typically chooses functions with Bayesian posterior above ~10^-10 (remember the x-axis is a cutoff here), and functions with higher Bayesian posterior are generally chosen more often by SGD
  • So, if the Bayesian posterior fully accounts for generalization error and SGD is not contributing any further inductive bias at all, we'd expect to see generalization error on the order of 5% from SGD.
  • Looking at figure 5d, we do indeed see generalization error on the order of 5% from SGD samples (the y-axis is not relevant to the estimate here).

This isn't precise enough to rule out small contributions of SGD to the inductive bias, but it is pretty strong evidence that the Bayesian posterior is the main factor.

Other things I'm updating on, based on the post/paper:

  • This makes it very likely that gaussian processes, as a theoretical model, capture most of what makes DNNs work in practice
  • This makes it very likely that DNNs are "only doing interpolation", in some sense, as opposed to extrapolation. (This already seemed fairly likely based on scaling curves, and the gaussian process model gives us a second line of evidence.)

Yes, I agree with this. 

I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It's also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.

I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the "null hypothesis" should be that it basically won't generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.

I'd be interested to hear your take on why this means NN's are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.

Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I've never heard anyone try to argue that Solomonoff Induction "merely interpolates" before!

I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I'll make some clarifications here as well, for the benefit of others:

The key claim, that , is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take "the space of functions" to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct -- the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).

The results in Neural Networks Are Fundamentally Bayesian are pretty cool -- that's clever how they were able to estimate the densities.

A couple thoughts on the limitations:

  • There are various priors over functions for which we can calculate the exact posterior. (E.g., Gaussian processes.) However, doing Bayesian inference on these priors doesn't perform as well as neural networks on most datasets. So knowing SGD is Bayesian is only interesting if we also know that the prior is interesting. I think the ideal theoretical result would be to show that SGD on neural nets is an approximation of Solomonoff Induction (or something like SI), and the approximation gets better as the NNs get bigger and deeper. But I have yet to see any theory that connects neural nets/ SGD to something like short programs.
  • If SGD works because it's Bayesian, then making it more Bayesian should make it work better. But according to https://arxiv.org/abs/2002.02405 that's not the case. Lowering the temperature, or taking the MAP (=temperature 0) generalizes better than taking the full Bayesian posterior, as calculated by an expensive MCMC procedure.

We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural networks work.

I haven't seen the paper you link, so I can't comment on it specifically, but I want to note that the claim "SGD is roughly Bayesian" does not imply "Bayesian inference would give better generalisation than SGD". It can simultaneously be the case that the neural network "prior" is biased towards low-complexity functions, that SGD roughly follows the "prior", and that SGD provides some additional bias towards low-complexity functions (over and above what is provided by the "prior"). For example, if you look at Figure 6 in the post I link, you can see that different versions of SGD do provide a slightly different inductive bias. However, this effect seems to be quite small relative to what is provided by the "prior".

More explanation on why I'm not summarizing the paper about Kolmogorov complexity:

I don’t find the Kolmogorov complexity stuff very convincing. In general, algorithmic information theory tends to have arguments of the form “since this measure simulates every computable process, it (eventually) at least matches any particular computable process”. This feels pretty different from normal notions of “simplicity” or “intelligence”, and so I often try to specifically taboo phrases like “Solomonoff induction” or “Kolmogorov complexity” and replace them with something like “by simulating every possible computational process”, and see if the argument still seems convincing. That mostly doesn’t seem to be the case here.

If I try to do this with the arguments here, I get something like:

Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs. Since shorter programs are necessarily more likely in the prior that simulates all possible programs, they should be expected to be better programs, and so generalize well.

This argument just doesn’t sound very compelling to me. It also can be applied to literally any machine learning algorithm; I don’t see why this is specific to neural nets. If this is just meant to explain why it is okay to overparameterize neural nets, then that makes more sense to me, though then I’d say something like “with overparameterized neural nets, many different parameterizations instantiate the same function, and so the ‘effective parameterization’ is lower than you might have thought”, rather than saying anything about Kolmogorov complexity.

(This doesn't capture everything that simplicity is meant to capture -- for example, it doesn't capture the argument that neural networks can express overfit-to-the-training-data models, but those are high complexity and so low likelihood in the prior and so don't happen in general; but as mentioned above I find the Kolmogorov complexity argument for this pretty tenuous.)

This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.

But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people involved with this work), but I think of the Levin bound argument as essentially a more formal way to state this intuition. I.e., it is a loose argument for why we needn't be surprised that simple functions have larger measures in parameter-space, even if the argument doesn’t identify all the precise details of how this works in any particular case. For example, while the argument doesn’t apply to all ML algorithms, it does apply to all neural network architectures in exactly the same way, but clearly there are important differences in the inductive bias of eg CNNs and RNNs.

(Also: I don’t think the notion of “low effective parameterisation” really captures what’s going on here, but for the reason you point out yourself.)

I agree it's not vacuous. It sounds like you're mostly stating the same argument I gave but in different words. Can you tell me what's wrong or missing from my summary of the argument?

Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs. Since shorter programs are necessarily more likely in the prior that simulates all possible programs, they should be expected to be better programs, and so generalize well.

(Even if you are talking about the overparameterized case, where this argument is not vacuous and also applies primarily to neural nets and not other ML models, I don't find this argument very compelling a priori, though I agree that based on empirical evidence it is probably mostly correct.)

I agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing. 

I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the double-descent phenomenon as you increase the size of a model (at small sizes VC-dimensionality mechanisms dominate, but at larger sizes the overparameterisation starts to induce a simplicity bias, which eventually starts to dominate). Since you get double descent in the model size for both neural networks and eg random forests, you should expect there to be some mechanism in common between them (even if the details of course differ from case to case).

Zach's summary for the Alignment Newsletter (just for the SGD as Bayesian sampler paper):

Neural networks have been shown empirically to generalize well in the overparameterized setting, which suggests that there is an inductive bias for the final learned function to be simple. The obvious next question: does this inductive bias come from the _architecture_ and _initialization_ of the neural network, or does it come from stochastic gradient descent (SGD)? This paper argues that it is primarily the former.

Specifically, if the inductive bias came from SGD, we would expect that bias to go away if we replaced SGD with random sampling. In random sampling, we sample an initialization of the neural network, and if it has zero training error, then we’re done, otherwise we repeat.

The authors explore this hypothesis experimentally on the MNIST, Fashion-MNIST, and IMDb movie review databases. They test on variants of SGD, including Adam, Adagrad, and RMSprop. Since actually running rejection sampling for a dataset would take _way_ too much time, the authors approximate it using a Gaussian Process. This is known to be a good approximation in the large width regime.

Results show that the two probabilities are correlated over a wide order of magnitudes for different architectures, datasets, and optimization methods. While correlation isn't perfect over all scales, it tends to improve as the frequency of the function increases. In particular, the top few most likely functions tend to have highly correlated probabilities under both generation mechanisms.

Zach's opinion:

Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point. Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.

One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I'm not aware of any work that claims SGD uniformly samples from this boundary, but it's worth considering that possibility if the experimental results hold up.

Rohin’s opinion:

I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like <@double descent@>(@Deep Double Descent@). I don’t think this is in conflict with the claim of the paper, which is that _most_ of the inductive bias comes from the architecture and initialization.

[Other](https://arxiv.org/abs/1805.08522) [work](https://arxiv.org/abs/1909.11522) by the same group provides some theoretical and empirical arguments that the neural network prior does have an inductive bias towards simplicity. I find those results suggestive but not conclusive, and am far more persuaded by the paper summarized here, so I don’t expect to summarize them.

I have a few comments on this:

Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.

The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume. Saying that SGD is “Bayesian” is one way of saying the latter, and the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.

Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.

This has been done with real neural networks! See this, for example -- they use Gaussian Processes on stuff like Mobilenetv2, Densenet121, and Resnet50. It seems to work well.

One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I'm not aware of any work that claims SGD uniformly samples from this boundary, but it's worth considering that possibility if the experimental results hold up.

We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.

Rohin’s opinion: [...]

I basically agree with what you say here.

The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.

This seems right, but I'm not sure how that's different from Zach's phrasing of the main point? Zach's phrasing was "SGD approximately equals random sampling", and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said.

(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)

Saying that SGD is “Bayesian” is one way of saying the latter

This feels similar to:

Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.

(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural networks are fundamentally Bayesian" I was thinking of something very different.)

the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.

I've discussed this above, I'll continue the discussion there.

The rest of the comment is about stuff that I didn't have a strong opinion on, so I'll leave it for Zach to answer if he wants.

(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)

I'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point".

This feels similar to:

Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.

(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural networks are fundamentally Bayesian" I was thinking of something very different.)

Haha. That's obviously not what we're trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway. 

I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.