Regarding your empirical findings which may run counter to the question
- Is manifold dimensionality actually a good predictor of which solution will be found?
I wonder if there's a connection to asymptotic equipartitioning - it may be that the 'modal' (most 'voluminous' few) solution basins are indeed higher-rank, but that they are in practice so comparatively few as to contribute negligible overall volume?
This is a fuzzy tentative connection made mostly on the basis of aesthetics rather than a deep technical connection I'm aware of.
Interesting stuff! I'm still getting my head around it, but I think implicit in a lot of this is that loss is some quadratic function of 'behaviour' - is that right? If so, it could be worth spelling that out. Though maybe in a small neighbourhood of a local minimum this is approximately true anyway?
This also brings to mind the question of what happens when we're in a region with no local minimum (e.g. saddle points all the way down, or asymptoting to a lower loss, etc.)
I think the gradient descent bit is spot on. That also looks like the flavour of natural selection, with non infinitesimal (but really small) deltas. Natural selection consumes a proof that a particular (mutation) produces (fitness) to generate/propagate/multiply .
I recently did some thinking about this and found an equivalence proof under certain conditions for the natural selection case and the gradient descent case.
In general, I think the type signature here can indeed be soft or fuzzy or lossy and you still get consequentialism, and the 'better... (read more)
This post is thoroughly excellent, a good summary and an important service!
However, the big caveat here is that evolution does not implement Stochastic Gradient Descent.
I came here to say that in fact they are quite analogous after all
Another aesthetic similarity which my brain noted is between your concept of 'information loss' on inputs for layers-which-discriminate and layers-which-don't and the concept of sufficient statistics.
A sufficient statistic is one for which the posterior y is independent of the data x, given the statistic ϕ
P(y|x=x0)=P(y|ϕ(x)=ϕ(x0))
which has the same flavour as
f(x,θa,θb)=g(a(x),θb)
In the respective cases, ϕ and a are 'sufficient' and induce an equivalence class between xs