What makes you think that's intended to be a counting argument over function space? I usually think of this as a counting argument over infinite bitstrings
I definitely thought you were making a counting argument over function space, and AFAICT Joe also thought this in his report.
The bitstring version of the argument, to the extent I can understand it, just seems even worse to me. You're making an argument about one type of learning procedure, Solomonoff induction, which is physically unrealizable and AFAICT has not even inspired any serious real-world approximations, and then assuming that somehow the conclusions will transfer over to a mechanistically very different learning procedure, gradient descent. The same goes for the circuit prior thing (although FWIW I think you're very likely wrong that minimal circuits can be deceptive).
Then show me how! If you think there are errors in the math, please point them out.
I'm not aware of any actual math behind the counting argument for scheming. I've only ever seen handwavy informal arguments about the number of Christs vs Martin Luthers vs Blaise Pascals. There certainly was no formal argument presented in Joe's extensive scheming report, which I assumed would be sufficient context for writing this essay.
I obviously don't think the counting argument for overfitting is actually sound, that's the whole point. But I think the counting argument for scheming is just as obviously invalid, and misuses formalisms just as egregiously, if not moreso.
I deny that your Kolmogorov framework is anything like "the proper formalism" for neural networks. I also deny that the counting argument for overfitting is appropriately characterized as a "finite bitstring" argument, because that suggests I'm talking about Turing machine programs of finite length, which I'm not- I'm directly enumerating functions over a subset of the natural numbers. Are you saying the set of functions over 1...10,000 is not a well defined mathematical object?
I never used any kind of bitstring analysis.
I think the infinite bitstring case has zero relevance to deep learning.
There does exist a concept you might call "simplicity" which is relevant to deep learning. The neural network Gaussian process describes the prior distribution over functions which is induced by the initialization distribution over neural net parameters. Under weak assumptions about the activation function and initialization variance, the NNGP is biased toward lower frequency functions. I think this cuts against scheming, and we plan to write up a post on this in the next month or two.
Thanks for the reply. A couple remarks:
With respect to which measure though? You have to define a measure, there are going to be infinitely many possible measures you could define on this space. And then we'll have to debate if your measure is a good one.
It depends what you mean by a "way" the model can overfit.
Really we need to bring in measure theory to rigorously talk about this, and an early draft of this post actually did introduce some measure-theoretic concepts. Basically we need to define:
The English locution "lots of ways to do X" can be formalized as "the measure of X-networks is high." And that's going to be an empirical claim that we can actually debate.