Review

Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

3Vladimir Nesov

2Spencer Becker-Kahn

1Vivek Hebbar

5Vivek Hebbar

3Johannes Treutlein

New Comment

5 comments, sorted by Click to highlight new comments since: Today at 7:44 PM

I worry that using as the space of behaviors misses something important about the intuitive idea of robustness, making any conclusions about or or behavior manifolds harder to apply. A more natural space (to illustrate my point, not as something helpful for this post) would be , with a metric that cares about how outputs differ on inputs that fall within a particular base distribution , something like

The issue with is that models in a behavior manifold only need to agree on the training inputs, and always include all models with arbitrarily crazy behaviors at all inputs outside the dataset, even if we are talking about inputs very close to those in the dataset (which is what above is supposed to prevent). So the behavior manifolds are more like cylinders than balls, ignoring crucial dimensions. Since generalization does work (so learning tends to find very unusual points of them), it's generally unclear how a behavior manifold as a whole is going to be relevant to what's actually going on.

I agree that the space may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space may well be a more natural one. (It's of course the space of functions , and so a space in which 'model space' naturally sits in some sense. )

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture.

You're correct that the written portion of the Information Loss --> Basin flatness post doesn't use any non-trivial facts about NNs. The purpose of the written portion was to explain some mathematical groundwork, which is then used for the non-trivial claim. (I did not know at the time that there was a standard name "Submersion theorem". I had also made formal mistakes, which I am glad you pointed out in your comments. The essence was mostly valid though.) The non-trivial claim occurs in the video section of the post, where a sort of degeneracy occuring in ReLU MLPs is examined. I now no longer believe that the precise form of my claim is relevant to practical networks. An approximate form (where low rank is replaced with something similar to low determinant) seems salvageable, though still of dubious value, since I think I have better framings now.

Secondly, the use of the submersion theorem here only makes sense when .

Agreed. I was addressing the overparameterized case, not the underparameterized one. In hindsight, I should have mentioned this at the very beginning of the post -- my bad.

(Sorry for the very late response)

In this note I will discuss some computations and observations that I have seen in other posts about "basin broadness/flatness". I am mostly working off the content of the posts Information Loss --> Basin flatness and Basin broadness depends on the size and number of orthogonal features. I will attempt to give one rigorous and unified narrative for core mathematical parts of these posts and I will also attempt to explain my reservations about some aspects of these approaches. This post started out as a series of comments that I had already made on the posts, but I felt it may be worthwhile for me to spell out my position and give my own explanations.

Work completed while author was a SERI MATS scholar under the mentorship of Evan Hubinger.## Basic Notation and Terminology

We will imagine fixing some model architecture and thinking about the loss landscape from a purely mathematical perspective. We will not concern ourselves with the realities of training.

Let Θ denote the

f:Θ×Rn→O,parameter spaceof a deep neural network model f. This means that each element θ∈Θ is a complete set of weights and biases for the model. And suppose that when a set of parameters θ∈Θ is fixed, the network maps from an input space Rn to an output space O. When it matters below, we will take O=Rk, but for now let us leave it abstract. So we have a functionsuch that for any θ∈Θ, the function f(⋅,θ):Rn→O is a fixed input-output function implemented by the network.

Let D={(xd,yd)}Dd=1⊂Rn×O be a dataset of D training examples. We can then define a function F:Θ→OD, by

F(θ)=(f(x1,θ),…,f(xD,θ)).This takes as input a set of parameters θ and returns the

behaviourof f(⋅,θ) on the training data.We will think of the

loss functionas l:OD→R.

l(o1,…,oD)=12D∑d=1∥∥od−yd∥∥2.(*)Example.We could have D={(xd,yd)}Dd=1 , O=Rk , andWe also then define what we will call the

L:Θ→Rtotallossby

L(θ)=l(F(θ))=l(f(x1,θ),…,f(xD,θ)).This is just the usual thing: The total loss over the training data set for a given set of weights and biases. So the graph of L is what one might call the 'loss landscape'.

## Behaviour Manifolds

By a

Σ=F−1((o1,…,oD))={θ∈Θ:F(θ)=(o1,…,oD)}behaviour manifold(see [Hebbar]), we mean a set Σ⊂Θ of the formwhere (o1,…,oD)∈OD is a tuple of possible outputs. The idea here is that for a fixed behaviour manifold Σ, all of the models given by parameter sets θ∈Σ have identical behaviour on the training data.

Assume that Θ is an appropriately smooth N-dimensional space and let us now assume that O=Rk.

Suppose that N>kD. In this case, at a point θ∈Θ at which the Jacobian matrix JF(θ) has full rank, the map F is a submersion. The

submersion theorem(which - in this context - is little more than the implicit function theorem)tells us that given o∈OD, if F is a submersion in a neighbourhood of a point θ∈F−1(o), then F−1(o) is a smooth (N−kD)-dimensional submanifold in a neighbourhood of θ . So we conclude that in a neighbourhood of a point in parameter space at which the Jacobian of F has full rank, the behaviour manifold is an (N−kD)-dimensional smooth submanifold.## Reservations

Firstly, I want to emphasize that when the Jacobian of F does not have full rank, it is generally difficult to make conclusions about the geometry of the level set, i.e. about the set that is called the behaviour manifold in this setting.

Examples.The following simple examples are to emphasize that there isstraightforward intuitive relationship that says "when the Jacobian has less than full rank, there are fewer directions in parameter space along which the behaviour changes and therefore the behaviour manifold is bigger than (N−kD)-dimensional":notaRemark.We note further, just for the sake of intuition about these kinds of issues, that the geometry of the level set of a smooth function can in general be very bad:Everyclosed subset is the zero set of some smooth function, i.e. givenanyclosed set C⊂Rn , there exists a smooth function g:Rn→R with C={x∈Rn:g(x)=0}. Knowing that a level set is closed is an extremely basic fact and yet without using specific information about the function you are looking at, you cannot concludeanythingelse.Secondly, the use of the submersion theorem here only makes sense when N>kD. But this is not even commonly the case. It is common to have many more data points (the D) than parameters (the N), ultimately meaning that the dimension of OD is much, much larger than the dimension of the domain of F. This suggests a slightly different perspective, which I briefly outline next.

## Behavioural Space

When the codomain is a higher-dimensional space than the domain, we more commonly picture the

imageof a function, as opposed to the graph, e.g. if I say to consider a smooth function g:R→R2, one more naturally pictures the curve g(R) in the plane, as a kind-of 'copy' of the line R, as opposed to the graph of g. So if one were to try to continue along these lines, one might instead imagine theimageF(Θ) of parameter spaceinthebehaviour spaceOD. We think of each point of OD as a complete specification of possible outputs on the dataset. Then the image F(Θ)⊂OD is (loosely speaking) an N−dimensional submanifold of this space which we should think of as having large codimension. And each point F(θ) on this submanifold is the outputs of an actual model with parameters θ. In this setting, the points θ∈Θ at which the Jacobian JF(θ) has full rank map to points F(θ)∈F(Θ) which have neighbourhoods in which F(Θ) is smooth and embedded.## The Hessian of the Total Loss

A computation of the Hessian of L appears in both Information Loss --> Basin flatness and Basin broadness depends on the size and number of orthogonal features, under slightly different assumptions. Let us carefully go over that computation here, in a slightly greater level of generality. We continue with O=Rk, in which case OD=Rk×D. The function we are going to differentiate is:

L(θ)=l(F(θ))=l(f(x1,θ),…,f(xD,θ)).And since each f(xd,θ)∈Rk for d=1,…,D, we should think of F(θ) as a k×D matrix, the general (p,d)thentry of which is fp(xd,θ).

We want to differentiate twice with respect to θ. Firstly, we have

∂∂θiL(θ)=k∑p=1D∑d=1∇(p,d)l(F(θ))⋅∂fp(xd,θ)∂θifor i=1,…,N.

Then for j=1,…,N we differentiate again:

∂2∂θj∂θiL(θ)=k∑p,q=1D∑d,d′=1∇(q,d′)∇(p,d)l(F(θ))∂fq(xd′,θ)∂θj∂fp(xd,θ)∂θi+k∑p=1D∑d=1∇(p,d)l(F(θ))∂2fp(xd,θ)∂θj∂θi.(1)This is now an equation of (N×N) matrices.

## At A Local Minimum of The Loss Function

If θ is such that F(θ) is a local minimum for l (which means that the parameters are such that the output of the network on the training data is a local minimum for the loss function), then the second term on the right-hand side of (1) vanishes (because the term includes the first derivatives of l, which are zero at a minimum). Therefore: If F(θ∗) is a local minimum for l we have:

∂2∂θj∂θiL(θ∗)=k∑p,q=1D∑d,d′=1∇(q,d′)∇(p,d)l(F(θ∗))∂fq(xd′,θ∗)∂θj∂fp(xd,θ∗)∂θi.If, in addition, the Hessian of l is equal to the identity matrix (by which we mean ∇(q,d′)∇(p,d)l=δpqδdd′ - as is the case for the example loss function given above in (*)), then we would have:

∂2∂θj∂θiL(θ∗)=k∑p=1D∑d=1∂fp(xd,θ∗)∂θj∂fp(xd,θ∗)∂θi=D∑d=1∂f(xd,θ∗)∂θi⋅ ∂f(xd,θ∗)∂θj.(2)## Reservations

In Basin broadness depends on the size and number of orthogonal features, the expression on the right-hand side of equation (2) above is referred to as an inner product of "the features over the training data set". I do not understand the use of the word 'features' here and in the remainder of their post. The phrase seems to imply that a function of the form

xd⟼∂f(xd,θ∗)∂θj,defined on the inputs of the training dataset, is what constitutes a feature. No further explanation is really given. It's completely plausible that I have missed something (and perhaps other readers do not or will not share my confusion) but I would like to see an attempt at a clear and detailed explanation of exactly how this notion is supposed to be the same notion of feature that (say) Anthropic use in their interpretability work (as was claimed to me).

Criticism

I'd like to tentatively try to give some higher-level criticism of these kinds of approaches. This is a tricky thing to do, I admit; it's generally very hard to say that a certain approach is unlikely to yield results, but I will at least try to explain where my skepticism is coming from.

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture. In particular, in the mathematical framework that is set up, the function f is more or less just any smooth function. And the methods used are just a few lines of calculus and linear algebra applied to abstract smooth functions. If these are the principal ingredients, then I am naturally led to expect that the conclusions will be relatively straightforward facts that will hold for more or less any smooth function f.

Such facts may be useful as part of bigger arguments - of course many arguments in mathematics do yield truly significant results using only 'low-level' methods - but in my experience one is extremely unlikely to end up with significant results in this way without it ultimately being clear after the fact where the hard work has happened or what the significant original insight was.

So, naively, my expectation at the moment is that in order to arrive at better results about this sort of thing, arguments that start like these ones do must quickly bring to bear substantial mathematical facts

aboutthe network, e.g. random initialization, gradient descent, the structure of the network's layers, activations etc. One has to actuallyusesomething.I feel (again, speaking naively) that after achieving more success with a mathematical argument along these lines, one's hands would look dirtier. In particular, for what it's worth, I do not expect my suggestion to look at the image of the parameter space in 'behaviour space' to lead (by itself) to any further non-trivial progress. (And I say 'naively' in the preceding sentences here because I do not claim myself to have produced any significant results of the form I am discussing).