Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)

I'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point".

... (read more)This feels similar to:

Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural ne

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing.

I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I have a few comments on this:

Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.

The main point, as I see it, is essentially tha... (read more)

14moWhat I'm suggesting is that volume in high-dimensions can concentrate on the
boundary. To be clear, when I say SGD only typically reaches the boundary, I'm
talking about early stopping and the main experimental setup in your paper where
training is stopped upon reaching zero train error.
This does seem to invalidate the model. However, something tells me that the
difference here is more about degree. Since you use the word 'should' I'll use
the wiggle room to propose an argument for what 'should' happen.
If SGD is run with early stopping, as described above, then my argument is that
this is roughly equivalent to random sampling via an appeal to concentration of
measure in high-dimensions.
If SGD is not run with early stopping, it's enclosed by the boundary of zero
train error functions. Because these are most likely in the interior these
functions are unlikely to be produced by random sampling. Thus, on a log-log
plot I'd expect overtraining to 'tilt' the correspondence between SGD and random
sampling likelihoods downward.
Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling
algorithm. You will see that functions that are unlikely to be generated by
random sampling will be more likely under SGD with overtraining. Moreover,
functions that are more likely with random sampling will be become less likely
under SGD with overtraining.

34moThis seems right, but I'm not sure how that's different from Zach's phrasing of
the main point? Zach's phrasing was "SGD approximately equals random sampling",
and random sampling finds functions with probability exactly proportional to
their volume. Combine that with the fact that empirically we get good
generalization and we get the thing you said.
(Maybe you weren't disagreeing with Zach and were just saying the same thing a
different way?)
This feels similar
[https://www.lesswrong.com/posts/yCWPkLi8wJvewPbEp/the-noncentral-fallacy-the-worst-argument-in-the-world]
to:
Saying that MLK was a "criminal" is one way of saying that MLK thought and acted
as though he had a moral responsibility to break unjust laws and to take direct
action.
(This is an exaggeration but I think it is directionally correct. Certainly when
I read the title "neural networks are fundamentally Bayesian" I was thinking of
something very different.)
I've discussed this above
[https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of?commentId=KFuGWt3FJ288T4tbB]
, I'll continue the discussion there.
The rest of the comment is about stuff that I didn't have a strong opinion on,
so I'll leave it for Zach to answer if he wants.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.

But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people i... (read more)

24moI agree it's not vacuous. It sounds like you're mostly stating the same argument
I gave but in different words. Can you tell me what's wrong or missing from my
summary of the argument?
(Even if you are talking about the overparameterized case, where this argument
is not vacuous and also applies primarily to neural nets and not other ML
models, I don't find this argument very compelling a priori, though I agree that
based on empirical evidence it is probably mostly correct.)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for

functions between finite sets, such as binary functions.Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don't think this captures the kinds of function complexity we care about for NNs.

We’re not saying that discr... (read more)

14moI'll write mostly about this statement, as I think it's the crux of our
disagreement.
The statement may be true as long as we hold the meaning of "objects" constant
as we vary the complexity measure.
However, if we translate objects from one mathematical space to another (say by
discretizing, or adding/removing a metric structure), we can't simply say the
complexity measures for space A on the original A-objects inevitably agree with
those space B on the translated B-objects. Whether this is true depends on our
choice of translation.
(This is clear in the trivial cases of bad translation where we, say, map every
A-object onto the same B-object. Now, obviously, no one would consider this a
correct or adequate way to associate A-objects with B-objects. But the example
shows that the claim about complexity measures will only hold if our translation
is "good enough" in some sense. If we don't have any idea what "good enough"
means, something is missing from the story.)
In the problem at hand, the worrying part of the translation from real to
boolean inputs is the loss of metric structure. (More precisely, the
hand-waviness about what metric structure survives the translation, if any.) If
there's no metric, this destroys the information needed by complexity measures
that care about how easy it is to reconstruct an object "close to" the specified
one.
Basic information theory doesn't require a metric, only a measure. There's no
sense of "getting an output approximately right," only of "getting the exactly
right output with high probability." If you care about being approximately right
according to some metric, this leads you to rate-distortion theory
[https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory].
Both of these domains -- information theory without a metric, and with one --
define notions of incompressibility/complexity, but they're different. Consider
two distributions on R:
1. The standard normal,
2. The standard normal, but you chop it into a

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Yes, it does of course apply in that sense.

I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).

I'm not sure if accounting for the floating-point implementation is informative or not in this case.

26moMy understanding is that floating-point granularity is enough of a real problem
that it does sometimes matter in realistic ML settings, which suggests that it's
probably a reasonable level of abstraction on which to analyze neural networks
(whereas any additional insights from an electromagnetism-based analysis
probably never matter, suggesting that's not a reasonable/useful level of
abstraction).

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Ah, I certainly agree with this.

I do not wish to claim that *all* functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the *identity function* certainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training... (read more)

36moThat's a clever example, I like it.
Based on that description, it should be straightforward to generalize the Levin
bound to neural networks. The main step would be to replace the Huffman code
with a turbocode (or any other near-Shannon-bound code), at which point the
compressibility is basically identical to the log probability density, and we
can take the limit to continuous function space without any trouble. The main
change is that entropy would become relative entropy (as is normal when taking
info theory bounds to a continuous limit). Intuitively, it's just using the
usual translation between probability theory and minimum description length, and
applying it to the probability density of parameter space.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.

36moHow so? In particular, are you saying that gaussian processes bias towards low
Kolmogorov complexity, or are you saying that there's some simplicitly prior
here besides what gaussian processes have? If the former, how does a gaussian
process bias towards short programs other than the compressibility offered by
smoothness?
(Of course smoothness itself implies some compressibility in the Kolmogorov
picture; the interesting question is whether there's a bias towards functions
which can be compressed in more general ways.)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Yes, I agree with this.

I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the *Towards Data Science* post, you can see that different versions of SGD give a slightly different inductive bias. It's also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and bat... (read more)

16moI'd be interested to hear your take on why this means NN's are only doing
interpolation. What does it mean, to only do interpolation and not
extrapolation? I know the toy model definitions of those terms (connecting dots
vs. drawing a line off away from your dots) but what does it mean in real life
problems? It seems like a fuzzy/graded distinction to me, at best.
Also, if the simplicity prior they use is akin to kolmogorov complexity-based
priors, then that means what they are doing is akin to what e.g. Solomonoff
Induction does. And I've never heard anyone try to argue that Solomonoff
Induction "merely interpolates" before!

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I believe Chris has now updated the *Towards Data Science* blog post to be more clear, based on the conversation you had, but I'll make some clarifications here as well, for the benefit of others:

The key claim, that , is indeed not (meant to be) dependent on any test data *per se*. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take "the space of functions" to be all the functions that a given neural network can express... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural netw... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.

It's true that we don't have a rigorous mathematical proof for "why" simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.

First of all, we can look at the claim "backwards"; suppose a function has a large volume in parameter-space. This means tha... (read more)

16moDo you have a good reference for the Levin bound? My attempts at finding a
relevant paper all failed.

36moIn what sense is the parameter space of a neural network not finite and
discrete? It is often useful to understand floating-point values as continuous,
but in fact they are discrete such that it seems like a theorem which assumes
discreteness would still apply.

26moThanks, this is great! OK, I'm pretty convinced now, but I was biased to believe
something like this anyway. I would love to hear what a skeptic thinks.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Yes, as Adam Shimi suggested below, I'm here talking about "input-output relations" when I say "functions".

The claim can be made formal in a few different ways, but one way is this: let's say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accura... (read more)

26moThanks! Yeah, I just forgot the definition of a function, sorry. I was thinking
of computer programs. Simple functions are expressed by more computer programs
than complex functions. (Which is why I was already inclined to believe your
conclusion before reading your stuff -- if neural nets are like computer
programs, then it is plausible that simple functions are expressed by more
neural net initializations)

Thoughts from a Two Boxer

As you may know, CDT has a lot of fans in academia. It might be interesting to consider what they have to say about Newcomb's Problem (and other supposed counter-examples to CDT).

In "The Foundations of Causal Decision Theory", James Joyce argues that Newcomb's Problem is "unfair" on the grounds that it treats EDT and CDT agents differently. An EDT agent is given two good choices ($1,000,000 and $1,001,000) whereas a CDT agent is given two bad choices ($0 and $1,000). If you wanted to represent Newcomb's Problem as a Marko... (read more)

Two senses of “optimizer”

I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is "leaky" in a given case is something that should be carefully examined, not something that should be glossed over.

I also agree with what ofer said.

"even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 b... (read more)

12yI think there is only a question of how leaky, but it is always non-zero amounts
of leaky, which is the reason Bostrom and others are concerned about it for all
optimizers and don't bother to make this distinction.

Two senses of “optimizer”

I agree with everything you have said.

Two senses of “optimizer”

I don't think that I'm assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of "optimizer" does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?

42ySure, let's be super specific about it.
Let's say we have something you consider an optimizer_1, a SAT solver. It
operates over a set of variables V arranged in predicts P using an algorithm A.
Since this is a real SAT solver that is computed rather than a purely
mathematical one we think about, it runs on some computer C and thus for each of
V, P, and A there is some C(V), C(P), and C(A) that is the manifestation of each
on the computer. We can conceptualize what C does to V, P, and A in different
ways: it turns them into bytes, it turns A into instructions, it uses C(A) to
operate on C(V) and C(P) to produce a solution for V and P.
Now the intention is that the algorithm A is an optimizer_1 that only operates
on V and P, but in fact A is never run, properly speaking, C(A) is, and we can
only say A is run to the extent C(A) does something to reality that we can set
up an isomorphism to A with. So C(A) is only an optimizer_1 to the extent the
isomorphism holds and it is, as you defined optimizer_1, "solving a
computational optimization problem". But properly speaking C(A) doesn't "know"
it's an algorithm: it's just matter arranged in a way that is isomorphic, via
some transformation, to A.
So what is C(A) doing then to produce a solution? Well, I'd say it "optimizes
its environment", that is literally the matter and its configuration that it is
in contact with, so it's an optimizer_2.
You might object that there's something special going on here such that C(A) is
still an optimizer_1 because it was set up in a way that isolates it from the
broader environment so it stays within the isomorphism, but that's not a matter
of classification, that's an engineering problem of making an optimizer_2 behave
as if it were an optimizer_1. And a large chunk of AI safety (mostly boxing) is
dealing with ways in which, even if we can make something safe in optimizer_1
terms, it may still be dangerous as an optimizer_2 because of unexpected
behavior where it "breaks" the isomorp

Yes. I imagine this is why overtraining doesn't make a huge difference.

See e.g., page 47 in the main paper.