Currently, we do not have a good theoretical understanding of how or why neural networks actually work. For example, we know that large neural networks are sufficiently expressive to compute almost any kind of function. Moreover, most functions that fit a given set of training data will not generalise well to new data. And yet, if we train a neural network we will usually obtain a function that gives good generalisation. What is the mechanism behind this phenomenon?

There has been some recent research which (I believe) sheds some light on this issue. I would like to call attention to this blog post:

Neural Networks Are Fundamentally Bayesian

This post provides a summary of the research in these three papers, which provide a candidate for a theory of generalisation:

https://arxiv.org/abs/2006.15191

https://arxiv.org/abs/1909.11522

https://arxiv.org/abs/1805.08522

(You may notice that I had some involvement with this research, but the main credit should go to Chris Mingard and Guillermo Valle-Perez!)

I believe that research of this type is very relevant for AI alignment. It seems quite plausible that neural networks, or something similar to them, will be used as a component of AGI. If that is the case, then we want to be able to reliably predict and reason about how neural networks behave in new situations, and how they interact with other systems, and it is hard to imagine how that would be possible without a deep understanding of the dynamics at play when neural networks learn from data. Understanding their inductive bias seems particularly important, since this is the key to understanding everything from why they work in the first place, to phenomena such as adversarial examples, to the risk of mesa-optimisation. I hence believe that it makes sense for alignment researchers to keep an eye on what is happening in this space.

If you want some more stuff to read in this genre, I can also recommend these two posts:

Recent Progress in the Theory of Neural Networks

Understanding "Deep Double Descent"

EDIT: Here is a second post, which talks more about the "prior" of neural networks:

Deep Neural Networks are biased, at initialisation, towards simple functions

This is great! This definitely does seem to me like strong evidence that SGD is the wrong place to look for understanding neural networks' inductive biases and that we should be focusing more on the architecture instead. I do wonder the extent to which that insight is likely to scale, though—perhaps the more gradient descent you do, the more it starts to look different from random search and the more you see its quirks.

Scott's “How does Gradient Descent Interact with Goodhart?” seems highly relevant here. Perhaps these results could serve as a partial answer to that question, in the sense that SGD doesn't seem to differ very much from random search (with a Gaussian prior on the weights) for deep neural networks on MNIST. I'm not sure how to reconcile that with the other arguments in that post for why it should be different, though, such as the experiments that Peter and I did.

To make sure I am following:

--The stuff linked in this post

hypothesizesthat simple functions enjoy bigger volume in parameter-space, i.e. there are more possible combinations of neuron weights that add up to a simple function than a complex one. Combined with the plausible hypothesis that real-world data tends to have simple functions that fit it, this explains why neural networks generalize well, and the explanation has nothing to do with SGD and everything to do with the prior, so to speak, implicit in the parameter space.--However, it doesn't

provethis hypothesis, or give a theoretical argument for it of the sort "Here's why, given what we know about math and about complexity theory, simple functions should enjoy bigger volume in parameter-space." Rather, the post (and the related papers) argues that this hypothesis is true because it's an elegant hypothesis that explains the data well (and they have various experiments to back this up.)--Hence, you think there still remains the mystery of

whythis hypothesis is true --whysimple functions enjoy bigger volume -- and you think the answer probably has something to do with the architecture rather than SGD, but you link to some stuff about SGD that seems relevant.I'm not at all sure I'm following, which is why I'm asking. :)

We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.

It's true that we don't have a rigorous mathematical proof for "why" simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.

First of all, we can look at the claim "backwards"; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.

There is a mathematical theorem, called the

Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P -> F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).The Levin bound doesn't apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.

Also, check out this.

In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.

Yes, it does of course apply in that sense.

I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).

I'm not sure if accounting for the floating-point implementation is informative or not in this case.

My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it's probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that's

nota reasonable/useful level of abstraction).Thanks, this is great! OK, I'm pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.

I'm not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can't really imagine anyone disagreeing with that part (though I'm still quite glad to have actual analysis to back that up).

Do you have a good reference for the Levin bound? My attempts at finding a relevant paper all failed.

The second Towards Data Science post references this paper, which is also the main reference through which Levin's result is mentioned in the first paper you post. So I assume reading these references should be enough to get the gist.

There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.

Hmmm... I don't know if that's how I would describe what's happening. I would say:

OK, thanks! The first bullet point is I think a summary of my first two bullet points, but with different emphasis. I'll check out the experiments you linked.

I'm curious to know what you mean by "focus on the architecture instead." My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.

In my opinion, I think it's certainly and obviously true that neural net architecture is a huge contributor to inductive biases and comes with a strong simplicity bias. What's surprising to me is that I would expect a similar thing to be true of SGD and yet these results seem to indicate that SGD vs. random search has only a pretty minimal effect on inductive biases.

I'm having trouble parsing your first sentence -- are you saying that yes, pretty much any neural net architecture will have a simplicity bias, but saying also that the biases will be importantly different depending on specifically which architecture you pick?

I think I would have predicted that SGD vs. random search would have a pretty minimal effect on inductive biases. My pet theory for why neural nets have a bias towards simplicity is that there are more ways for neural nets to encode simple functions than complex functions, i.e. larger regions of parameter space for simple functions. As the OP argues (I think) if this is right, then it makes sense that SGD and random search don't affect the bias that much, since larger regions of parameter space will also have larger basins of attraction for SGD to roll down. (As for the justification of my pet theory, well, this is really sketchy but see my top-level comment below)

Important point here: the relevant notion of "simple" is not "low Kolmogorov complexity"; it's more like smoothness. A bias toward "simple" functions, for this notion of "simple", would mainly make interpolation work well, not necessarily other prediction problems.

I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.

How so? In particular, are you saying that gaussian processes bias towards low Kolmogorov complexity, or are you saying that there's some simplicitly prior here besides what gaussian processes have? If the former, how does a gaussian process bias towards short programs other than the compressibility offered by smoothness?

(Of course smoothness itself implies some compressibility in the Kolmogorov picture; the interesting question is whether there's a bias towards functions which can be compressed in more general ways.)

I'd be interested to hear your take on this comment above, which is talking about compressibility, which to me sounds like kolmogorov complexity.

The things Joar is saying there generally match my intuitions, although there are some things unsaid where my intuitions might diverge more.

There's about a gazillion measures of "simplicity". If we pick some simplicity measure M, the usual rule is that things with low M-complexity have low Kolmogorov complexity, but things with low Kolmogorov complexity don't necessarily have low M-complexity. For instance, if a file can be compressed a lot by a ZIP file encoding, then that file has low Kolmogorov complexity, but not all low Kolmogorov complexity strings can be compressed by ZIP specifically.

In this context, smoothness is one such relevant measure: smooth functions have low Kolmogorov complexity, but there are other ways to have low Kolmogorov complexity without being smooth. I don't know about the Levin bound specifically, but in math these sorts of theorems are usually about smoothness. In complexity theory in particular, theorems often connect smoothness measures to circuit size measures (which are another class of complexity measures which imply low Kolmogorov complexity but not vice versa).

Roughly speaking, if a complexity measure were as general as Kolmogorov complexity, then we'd expect it to be uncomputable. If we can actually find low-complexity functions under a complexity measure M, then that complexity measure is probably less general than Kolmogorov complexity. From there, the natural next question is exactly what generality is kept/lost, and whether the loss of generality actually matters for things in the real world.

Ah, I certainly agree with this.

I do not wish to claim that

allfunctions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, theidentity functioncertainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as "the n'th input neuron is equal to the n'th output neuron" lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot "see" properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to "see" more kinds of functions. For example, NALUs give neural networks the ability to "see" arithmetic relations more easily. I hence certainly think it's a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.

This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It's a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not

per seactually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it's essentially just an intuition pump.That's a clever example, I like it.

Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it's just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.

Thanks, that helps. Perhaps an example would be: A purely feed-forward neural network might be "blind" to algorithms that are kolmogorov-simple but which involve repeatedly performing the same procedure a bunch of times (even if it is technically big enough to contain such an algorithm). So the simplicity bias of said network would be importantly different from kolmogorov complexity.

That's exactly right. That exact example would be a case of high circuit size but low Kolmogorov complexity.

Two confusions about what this paper is claiming.

## First Confusion: Why does the experimental test involve any test data at all?

It seems like Pβ(f|S) and Popt(f|S) are denoting different things in different places. Writing out the model from the first few sections explicitly:

Now for the key confusion: the claim that Pβ(f|S)≈Popt(f|S) sounds like it should apply to the definitions above, for all functions f (or at least with high probability),

without any reference whatsoever to any test data.So how does the test data enter? It sounds like the experiments evaluate P[Y=f(X)], for (X,Y) values from the test set, under the two f-distributions above. This is a very different notion of "Pβ(f|S)≈Popt(f|S)" from above, and it's not obvious whether it's relevant at all to understanding DNN performance. It's asking "are the probabilities assigned to the test data approximately the same?" rather than "are the distributions of trained functions approximately the same?".

My optimistic guess as to what's going on here: the "test set" is completely ignoring the test labels, and instead choosing random labels. This would be equivalent to comparing Pβ(f|S) to Popt(f|S) at random f, but only on a fixed subset of 100 possible X-values (drawn from the true distribution). If that's what's going on, then it is asking "are the distributions of trained functions approximately the same on realistic X-values?", and it's just confusing to the reader to talk about these random functions as coming from a "test set". Not a substantive problem, just communication difficulty.

## Second Confusion: How does the stated conclusion follow from the figures?

This confusion is more prosaic: I'm not convinced that the conclusion Pβ(f|S)≈Popt(f|S) follows from the figures, at least not to enough of an extent to be "strong evidence" against the claim that SGD is the main source of induction bias.

Many of these figures show awfully large divergence from equality, and I'm not seeing any statistical measure of fit. Eyeballing them, it's clear that there's a strong enough relation to rule

ininductive bias in the network architecture, but that does not ruleoutinductive bias in SGD as well. To make the latter claim, there would have to be statistically-small residual inductive biasafteraccounting for the contribution of bias from Pβ(f|S), and I don't see the paper actually making that case. I find the claim plausible a priori, but I don't see the right analysis here to provide significant evidence for it.Chris (author of the linked post) messaged me about these, and I also looked at the full paper and thought about it some more. There are still some stray pieces, but

I'm now generally convinced the headline claim is correct: the vast majority of inductive bias comes from the implicit prior on the network's parameter space.Shortest path I see to that conclusion, from the figures, is a Fermi estimate:

This isn't precise enough to rule out small contributions of SGD to the inductive bias, but it is pretty strong evidence that the Bayesian posterior is the main factor.

Other things I'm updating on, based on the post/paper:

Yes, I agree with this.

I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the

Towards Data Sciencepost, you can see that different versions of SGD give a slightly different inductive bias. It's also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the "null hypothesis" should be that it basically won't generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.

I'd be interested to hear your take on why this means NN's are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.

Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I've never heard anyone try to argue that Solomonoff Induction "merely interpolates" before!

I believe Chris has now updated the

Towards Data Scienceblog post to be more clear, based on the conversation you had, but I'll make some clarifications here as well, for the benefit of others:The key claim, that Pβ(f∣S)≈Popt(f∣S), is indeed not (meant to be) dependent on any test data

per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take "the space of functions" to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct -- the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).The results in Neural Networks Are Fundamentally Bayesian are pretty cool -- that's clever how they were able to estimate the densities.

A couple thoughts on the limitations:

We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural networks work.

I haven't seen the paper you link, so I can't comment on it specifically, but I want to note that the claim "SGD is roughly Bayesian" does not imply "Bayesian inference would give better generalisation than SGD". It can simultaneously be the case that the neural network "prior" is biased towards low-complexity functions, that SGD roughly follows the "prior", and that SGD provides some additional bias towards low-complexity functions (over and above what is provided by the "prior"). For example, if you look at Figure 6 in the post I link, you can see that different versions of SGD do provide a slightly different inductive bias. However, this effect seems to be quite small relative to what is provided by the "prior".

Thanks for this, this is really exciting! I am especially interested to hear pushback against it, because I think it is evidence for short timelines.

Can you say more about the sense in which this is true? There's a sense in which it is not true, IIRC: If you use the number line between 0 and 1 to code for functions, such that every possible function is a number on that line somewhere, and you bundle equivalent functions together (i.e. functions that give the same output on all the relevant data) then simpler functions will have more measure. This is how Solomonoff Induction works I think. And insofar as we think solomonoff induction would in principle generalize well to new data, then it seems that there's a sense in which most functions that fit a given set of training data would generalize well to new data.

Yes, as Adam Shimi suggested below, I'm here talking about "input-output relations" when I say "functions".

The claim can be made formal in a few different ways, but one way is this: let's say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accuracy on the test set. Therefore, if we tried to do inference by picking a random function that fits the training data, we would almost certainly get very poor generalisation (in fact, we would expect to get the same accuracy as with random guessing).

We can sometimes ensure that a learning algorithm will generalise well if it can only express a restricted set of functions (see VC dimensionality). However, a neural network is too expressive for this explanation to work (because they can express all those 10^10k functions). Therefore, there must be some further explanation for why neural networks tend to learn functions that generalise well (ie, why they tend to learn low-complexity functions, when they in principle could learn a high-complexity function instead).

Thanks! Yeah, I just forgot the definition of a function, sorry. I was thinking of computer programs. Simple functions are expressed by more computer programs than complex functions. (Which is why I was already inclined to believe your conclusion before reading your stuff -- if neural nets are like computer programs, then it is plausible that simple functions are expressed by more neural net initializations)

The comparison with Solomonoff induction strikes me as wrong, because Solomonoff induction adapts to new data, whereas "most functions that fit a given set of training data will not generalise well to new data." generally means that the function is learned and thus fixed when given new data.

As for why most functions will not generalize well, it's probably because there's far more wrong answers than correct ones for each new datapoint, in general.

SI is bayesianism + universal prior, IIRC. The way that solomonoff induction generalizes to new data is simply that it takes the prior and cuts out all the functions that contradict the new data. (i.e. it does bayesian update). So insofar as it generalizes well, it's because it contains functions in the original universal prior that generalize well,

and they have relatively high prior compared to the functions that don't.And the reason this is true is that the prior is based on complexity, and when you dig into the details you encounter the argument I just sketched, where the prior is actually based on weighting each function equally but proving that simpler functions are more numerous.It's true that there are more wrong answers than right answers, but (I claim) there are more functions that continue the sequence by giving the right answer than functions that continue the sequence by giving a wrong answer. (At least, assuming the right answer is simple, or rather the result of simple laws, which we assume it is if we think that solomonoff induction generalizes well.)

Okay, I think a confusion here is that I (and the OP AFAIK) don't talk about the same things as you do when using the word "function". Solomonoff induction is about programs (usually Turing machines) and from your comment it seems like the sense of functions you're taking. But functions as I'm using here (and I'm pretty sure this is the meaning in the quote) is just an input/output relation. So it doesn't make sense to say that two functions are equivalent (in the way your talking about at least), because they necessarily differ on some input (or they would be the same function). On the other hand, two programs can be equivalent in that they output the same things (they compute similar functions).

So from the input/output standpoint, if functions are coded in a line, the correct function is a single point. In that sense there are indeed far more functions that generalize poorly than ones that generalize well (this becomes a bit more complicated when you consider how well your function generalize, but you still generally have way more possibility to do wrong for each data point than to have the unique right answer).

I think this makes sense for clarifying the quote, as this part of the post explains the relatively classic arguments that neural networks generalize pretty well, are not really limited in terms of input/output relations, yet most input/output relations will not generalize correctly to new data. So there must be something more here.

Does that make sense to you?

Ahhhh, yes, thank you that hits the nail on the head!

So I guess my original question has been answered, and I'm left with a new question about whether the analogy to solomonoff induction might be useful here: Simpler functions get more weight in the universal prior because there are more programs that compute them; perhaps simpler functions get more weight in neural network's implicit prior because there are more parameter-settings that compute them (i.e. bigger region of parameter-space) and maybe both of these facts are true for the same reason.