Vivek Hebbar

From this paper, "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)"

So for overparameterized nets, the answer is probably:

- There is only one solution manifold, so there
*are no separate basins*. Every solution is connected. - We can salvage the idea of "basin volume" as follows:
- In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
- In the dimensions parallel to the manifold, ask "how can I move before it stops being the 'same function'?". If we define "sameness" as "same behavior on the validation set",
^{[1]}then this means looking at the Jacobian of that behavior in the plane of the manifold. - Multiply the two hypervolumes to get the hypervolume of our "basin segment" (very roughly, the region of the basin which drains to our specific model)

^{^}There are other "sameness" measures which look at the

*internals*of the model; I will be proposing one in an upcoming post.

The loss is defined over all dimensions of parameter space, so is still a function of all 3 x's. You should think of it as . It's thickness in the direction is **infinite**, not zero.

Here's what a zero-determinant Hessian corresponds to:

The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:

- Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding to the Hessian.
- The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of , the standard deviation of the initialization).

So the is a fairly principled correction, and much better than just "throwing out" the other dimensions. "Throwing out" dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.

I will split this into a math reply, and a reply about the big picture / info loss interpretation.

Math reply:

Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.

I agree that assuming MSE, and looking at a local minimum, you have

This is still false! *Edit: I am now confused, I don't know if it is false or not.*

You are conflating and . Adding disambiguation, we have:

So we see that the second term disappears if . But the critical point condition is . From chain rule, we have:

So it is possible to have a local minimum where , if is in the left null-space of . There is a nice qualitative interpretation as well, but I don't have energy/time to explain it.

However, if we are at a perfect-behavior global minimum of a regression task, then is definitely zero.

A few points about rank equality * at a perfect-behavior global min*:

- holds as long as is a diagonal matrix. It need not be a multiple of the identity.
- Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior.
- If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs.
- We can extend to larger outputs by having the behavior be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is
*not*satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.

Thanks for this reply, its quite helpful.

I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

Ah nice, didn't know what it was called / what field it's from. I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original".

The Jacobian matrix is what you call I think

Yeah, you're right. Previously I thought was the Jacobian, because I had the Jacobian transposed in my head. I only realized that has a standard name fairly late (as I was writing the post I think), and decided to keep the non-standard notation since I was used to it, and just add a footnote.

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank.

Yes; this is the whole point of the post. The math is just a preliminary to get there.

But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right?

Good catch -- it is technically possible at a local minimum, although probably extremely rare. At a global minimum of a regression task it is not possible, since there is only one behavior vector corresponding to zero loss. Note that behavior in this post was defined specifically on the training set. At *global* minima, "Rank(Hessian(Loss))=Rank(G)" should be true without exception.

And more generally, in exactly which directions do the implications go?

In "Flat basin Low-rank Hessian Low-rank High manifold dimension":

The first "" is a correlation. The second "" is the implication "High manifold dimension => Low-rank ". (Based on what you pointed out, this only works at global minima).

when you say things like "Low rank

indicatesinformation loss"

"Indicates" here should be taken as slightly softened from "implies", like "strongly suggests but can't be proven to imply". Can you think of plausible mechanisms for causing low rank which don't involve information loss?

I'm pretty sure my framework doesn't apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.

I'll reply to the rest of your comment later today when I have some time

About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.

Yup, seems correct.

Yeah, this seems roughly correct, and similar to what I was thinking. There is probably even a direct connection to the "asymptotic equipartitioning" math, via manifold counts containing terms like "A choose B" from permutations of neurons.

What

isyour list of problems by urgency, btw? Would be curious to know.