All of Oliver Sourbut's Comments + Replies

Information Loss --> Basin flatness

Another aesthetic similarity which my brain noted is between your concept of 'information loss' on inputs for layers-which-discriminate and layers-which-don't and the concept of sufficient statistics.

A sufficient statistic is one for which the posterior is independent of the data , given the statistic

which has the same flavour as

In the respective cases, and are 'sufficient' and induce an equivalence class between s

1Vivek Hebbar1mo
Yup, seems correct.
Information Loss --> Basin flatness

Regarding your empirical findings which may run counter to the question

  1. Is manifold dimensionality actually a good predictor of which solution will be found?

I wonder if there's a connection to asymptotic equipartitioning - it may be that the 'modal' (most 'voluminous' few) solution basins are indeed higher-rank, but that they are in practice so comparatively few as to contribute negligible overall volume?

This is a fuzzy tentative connection made mostly on the basis of aesthetics rather than a deep technical connection I'm aware of.

1Vivek Hebbar1mo
Yeah, this seems roughly correct, and similar to what I was thinking. There is probably even a direct connection to the "asymptotic equipartitioning" math, via manifold counts containing terms like "A choose B" from permutations of neurons.
Information Loss --> Basin flatness

Interesting stuff! I'm still getting my head around it, but I think implicit in a lot of this is that loss is some quadratic function of 'behaviour' - is that right? If so, it could be worth spelling that out. Though maybe in a small neighbourhood of a local minimum this is approximately true anyway?

This also brings to mind the question of what happens when we're in a region with no local minimum (e.g. saddle points all the way down, or asymptoting to a lower loss, etc.)

1Vivek Hebbar1mo
Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic. "Saddle points all the way down" isn't possible if a global min exists, since a saddle point implies the existence of an adjacent lower point. As for asymptotes, this is indeed possible, especially in classification tasks. I have basically ignored this and stuck to regression here. I might return to the issue of classification / solutions at infinity in a later post, but for now I will say this: It doesn't seem that much different, especially when it comes to manifold dimension; an m-dimensional manifold in parameter space generally extends to infinity, and it corresponds to an m-1 dimensional manifold in angle space (you can think of it as a hypersphere of asymptote directions). I would say the main things neglected in this post are: 1. Manifold count (Most important neglected thing) 2. Basin width in non-infinite directions 3. Distance from the origin These apply to both regression and classification.
(A -> B) -> A

I think the gradient descent bit is spot on. That also looks like the flavour of natural selection, with non infinitesimal (but really small) deltas. Natural selection consumes a proof that a particular (mutation) produces (fitness) to generate/propagate/multiply .

I recently did some thinking about this and found an equivalence proof under certain conditions for the natural selection case and the gradient descent case.

In general, I think the type signature here can indeed be soft or fuzzy or lossy and you still get consequentialism, and the 'better... (read more)

Inner Alignment: Explain like I'm 12 Edition

This post is thoroughly excellent, a good summary and an important service!

However, the big caveat here is that evolution does not implement Stochastic Gradient Descent.

I came here to say that in fact they are quite analogous after all