All of Joar Skalse's Comments + Replies

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

What I'm suggesting is that volume in high-dimensions can concentrate on the boundary.

Yes. I imagine this is why overtraining doesn't make a huge difference.

Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.

See e.g., page 47 in the main paper.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)

I'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point".

This feels similar to:

Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.

(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural ne

... (read more)
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing. 

I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I have a few comments on this:

Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.

The main point, as I see it, is essentially tha... (read more)

1Repetitive Experimenter4moWhat I'm suggesting is that volume in high-dimensions can concentrate on the boundary. To be clear, when I say SGD only typically reaches the boundary, I'm talking about early stopping and the main experimental setup in your paper where training is stopped upon reaching zero train error. This does seem to invalidate the model. However, something tells me that the difference here is more about degree. Since you use the word 'should' I'll use the wiggle room to propose an argument for what 'should' happen. If SGD is run with early stopping, as described above, then my argument is that this is roughly equivalent to random sampling via an appeal to concentration of measure in high-dimensions. If SGD is not run with early stopping, it's enclosed by the boundary of zero train error functions. Because these are most likely in the interior these functions are unlikely to be produced by random sampling. Thus, on a log-log plot I'd expect overtraining to 'tilt' the correspondence between SGD and random sampling likelihoods downward. Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.
3Rohin Shah4moThis seems right, but I'm not sure how that's different from Zach's phrasing of the main point? Zach's phrasing was "SGD approximately equals random sampling", and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said. (Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?) This feels similar [https://www.lesswrong.com/posts/yCWPkLi8wJvewPbEp/the-noncentral-fallacy-the-worst-argument-in-the-world] to: Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action. (This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural networks are fundamentally Bayesian" I was thinking of something very different.) I've discussed this above [https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of?commentId=KFuGWt3FJ288T4tbB] , I'll continue the discussion there. The rest of the comment is about stuff that I didn't have a strong opinion on, so I'll leave it for Zach to answer if he wants.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.

But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people i... (read more)

2Rohin Shah4moI agree it's not vacuous. It sounds like you're mostly stating the same argument I gave but in different words. Can you tell me what's wrong or missing from my summary of the argument? (Even if you are talking about the overparameterized case, where this argument is not vacuous and also applies primarily to neural nets and not other ML models, I don't find this argument very compelling a priori, though I agree that based on empirical evidence it is probably mostly correct.)
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.

We’re not saying that discr... (read more)

1nostalgebraist4moI'll write mostly about this statement, as I think it's the crux of our disagreement. The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure. However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects. Whether this is true depends on our choice of translation. (This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object. Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects. But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense. If we don't have any idea what "good enough" means, something is missing from the story.) In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure. (More precisely, the hand-waviness about what metric structure survives the translation, if any.) If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one. Basic information theory doesn't require a metric, only a measure. There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability." If you care about being approximately right according to some metric, this leads you to rate-distortion theory [https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory]. Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different. Consider two distributions on R: 1. The standard normal, 2. The standard normal, but you chop it into a
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Yes, it does of course apply in that sense. 

I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again). 

I'm not sure if accounting for the floating-point implementation is informative or not in this case.

2Evan Hubinger6moMy understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it's probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that's not a reasonable/useful level of abstraction).
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Ah, I certainly agree with this.

I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training... (read more)

3johnswentworth6moThat's a clever example, I like it. Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it's just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.

3johnswentworth6moHow so? In particular, are you saying that gaussian processes bias towards low Kolmogorov complexity, or are you saying that there's some simplicitly prior here besides what gaussian processes have? If the former, how does a gaussian process bias towards short programs other than the compressibility offered by smoothness? (Of course smoothness itself implies some compressibility in the Kolmogorov picture; the interesting question is whether there's a bias towards functions which can be compressed in more general ways.)
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Yes, I agree with this. 

I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It's also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and bat... (read more)

1Daniel Kokotajlo6moI'd be interested to hear your take on why this means NN's are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best. Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I've never heard anyone try to argue that Solomonoff Induction "merely interpolates" before!
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I'll make some clarifications here as well, for the benefit of others:

The key claim, that , is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take "the space of functions" to be all the functions that a given neural network can express... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural netw... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.

It's true that we don't have a rigorous mathematical proof for "why" simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.

First of all, we can look at the claim "backwards"; suppose a function has a large volume in parameter-space. This means tha... (read more)

1Adam Shimi6moDo you have a good reference for the Levin bound? My attempts at finding a relevant paper all failed.
3Evan Hubinger6moIn what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.
2Daniel Kokotajlo6moThanks, this is great! OK, I'm pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Yes, as Adam Shimi suggested below, I'm here talking about "input-output relations" when I say "functions". 

The claim can be made formal in a few different ways, but one way is this: let's say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accura... (read more)

2Daniel Kokotajlo6moThanks! Yeah, I just forgot the definition of a function, sorry. I was thinking of computer programs. Simple functions are expressed by more computer programs than complex functions. (Which is why I was already inclined to believe your conclusion before reading your stuff -- if neural nets are like computer programs, then it is plausible that simple functions are expressed by more neural net initializations)
Thoughts from a Two Boxer

As you may know, CDT has a lot of fans in academia. It might be interesting to consider what they have to say about Newcomb's Problem (and other supposed counter-examples to CDT).

In "The Foundations of Causal Decision Theory", James Joyce argues that Newcomb's Problem is "unfair" on the grounds that it treats EDT and CDT agents differently. An EDT agent is given two good choices ($1,000,000 and $1,001,000) whereas a CDT agent is given two bad choices ($0 and $1,000). If you wanted to represent Newcomb's Problem as a Marko... (read more)

Two senses of “optimizer”

I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is "leaky" in a given case is something that should be carefully examined, not something that should be glossed over.

I also agree with what ofer said.

"even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 b... (read more)

1G Gordon Worley III2yI think there is only a question of how leaky, but it is always non-zero amounts of leaky, which is the reason Bostrom and others are concerned about it for all optimizers and don't bother to make this distinction.
Two senses of “optimizer”

I agree with everything you have said.

Two senses of “optimizer”

I don't think that I'm assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of "optimizer" does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?

4G Gordon Worley III2ySure, let's be super specific about it. Let's say we have something you consider an optimizer_1, a SAT solver. It operates over a set of variables V arranged in predicts P using an algorithm A. Since this is a real SAT solver that is computed rather than a purely mathematical one we think about, it runs on some computer C and thus for each of V, P, and A there is some C(V), C(P), and C(A) that is the manifestation of each on the computer. We can conceptualize what C does to V, P, and A in different ways: it turns them into bytes, it turns A into instructions, it uses C(A) to operate on C(V) and C(P) to produce a solution for V and P. Now the intention is that the algorithm A is an optimizer_1 that only operates on V and P, but in fact A is never run, properly speaking, C(A) is, and we can only say A is run to the extent C(A) does something to reality that we can set up an isomorphism to A with. So C(A) is only an optimizer_1 to the extent the isomorphism holds and it is, as you defined optimizer_1, "solving a computational optimization problem". But properly speaking C(A) doesn't "know" it's an algorithm: it's just matter arranged in a way that is isomorphic, via some transformation, to A. So what is C(A) doing then to produce a solution? Well, I'd say it "optimizes its environment", that is literally the matter and its configuration that it is in contact with, so it's an optimizer_2. You might object that there's something special going on here such that C(A) is still an optimizer_1 because it was set up in a way that isolates it from the broader environment so it stays within the isomorphism, but that's not a matter of classification, that's an engineering problem of making an optimizer_2 behave as if it were an optimizer_1. And a large chunk of AI safety (mostly boxing) is dealing with ways in which, even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 because of unexpected behavior where it "breaks" the isomorp