Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

carboniferous_umbraculum

In this note I will discuss some computations and observations that I have seen in other posts about "basin broadness/flatness". I am mostly working off the content of the posts Information Loss --> Basin flatness and Basin broadness depends on the size and number of orthogonal features. I will attempt to give one rigorous and unified narrative for core mathematical parts of these posts and I will also attempt to explain my reservations about some aspects of these approaches. This post started out as a series of comments that I had already made on the posts, but I felt it may be worthwhile for me to spell out my position and give my own explanations.

Work completed while author was a SERI MATS scholar under the mentorship of Evan Hubinger.

Basic Notation and Terminology

We will imagine fixing some model architecture and thinking about the loss landscape from a purely mathematical perspective. We will not concern ourselves with the realities of training.

Let $Θ$ denote the parameter space of a deep neural network model $f$ . This means that each element $θ \in Θ$ is a complete set of weights and biases for the model. And suppose that when a set of parameters $θ \in Θ$ is fixed, the network maps from an input space $R^{n}$ to an output space $O$ . When it matters below, we will take $O = R^{k}$ , but for now let us leave it abstract. So we have a function

f : Θ \times R^{n} \to O,

such that for any $θ \in Θ$ , the function $f (\cdot, θ) : R^{n} \to O$ is a fixed input-output function implemented by the network.

Let $D = {(x^{d}, y^{d})}_{d = 1}^{D} \subset R^{n} \times O$ be a dataset of $D$ training examples. We can then define a function $F : Θ \to O^{D}$ , by

F (θ) = (f (x^{1}, θ), \dots, f (x^{D}, θ)) .

This takes as input a set of parameters $θ$ and returns the behaviour of $f (\cdot, θ)$ on the training data.

We will think of the loss function as $l : O^{D} \to R$ .

Example. We could have $D = {(x^{d}, y^{d})}_{d = 1}^{D}$ , $O = R^{k}$ , and

\begin{matrix} l (o^{1}, \dots, o^{D}) = \frac{1}{2} D \sum d = 1 ∥ ∥ o^{d} - y^{d} {∥ ∥}^{2} . \\ (*) \end{matrix}

We also then define what we will call the total loss

L : Θ \to R

L (θ) = l (F (θ)) = l (f (x^{1}, θ), \dots, f (x^{D}, θ)) .

This is just the usual thing: The total loss over the training data set for a given set of weights and biases. So the graph of $L$ is what one might call the 'loss landscape'.

Behaviour Manifolds

By a behaviour manifold (see [Hebbar]), we mean a set $Σ \subset Θ$ of the form

Σ = F^{- 1} ((o_{1}, \dots, o_{D})) = {θ \in Θ : F (θ) = (o_{1}, \dots, o_{D})}

where $(o_{1}, \dots, o_{D}) \in O^{D}$ is a tuple of possible outputs. The idea here is that for a fixed behaviour manifold $Σ$ , all of the models given by parameter sets $θ \in Σ$ have identical behaviour on the training data.

Assume that $Θ$ is an appropriately smooth $N$ -dimensional space and let us now assume that $O = R^{k}$ .

Suppose that $N > k D$ . In this case, at a point $θ \in Θ$ at which the Jacobian matrix $J F (θ)$ has full rank, the map $F$ is a submersion. The submersion theorem (which - in this context - is little more than the implicit function theorem) tells us that given $o \in O^{D}$ , if $F$ is a submersion in a neighbourhood of a point $θ \in F^{- 1} (o)$ , then $F^{- 1} (o)$ is a smooth $(N - k D)$ -dimensional submanifold in a neighbourhood of $θ$ . So we conclude that in a neighbourhood of a point in parameter space at which the Jacobian of $F$ has full rank, the behaviour manifold is an $(N - k D)$ -dimensional smooth submanifold.

Reservations

Firstly, I want to emphasize that when the Jacobian of $F$ does not have full rank, it is generally difficult to make conclusions about the geometry of the level set, i.e. about the set that is called the behaviour manifold in this setting.

Examples. The following simple examples are to emphasize that there is not a straightforward intuitive relationship that says "when the Jacobian has less than full rank, there are fewer directions in parameter space along which the behaviour changes and therefore the behaviour manifold is bigger than $(N - k D)$ -dimensional":

Consider $g : R^{2} \to R$ given by $g (x, y) = x^{2} + y^{2}$ . We have $\nabla g = (2 x, 2 y)$ . This has rank 1 everywhere except the origin: At the point $(0, 0)$ it has less than full rank. And at that point, the level set is just a single point, i.e. it is 0-dimensional.
Consider $h : R^{2} \to R$ given by $h (x, y) = x^{2} .$ We have $\nabla h = (2 x, 0)$ . Again, this has less than full rank at the point $(0, 0) .$ And at that point, the level set is the entire $y$ -axis, i.e. it is 1-dimensional.
Consider $j : R^{2} \to R$ given by $j (x, y) = 1.$ We of course have $\nabla j = (0, 0)$ . This has less than full rank everywhere, and the only non-empty level set is the entire of $R^{2}$ , i.e. 2-dimensional.

Remark. We note further, just for the sake of intuition about these kinds of issues, that the geometry of the level set of a smooth function can in general be very bad: Every closed subset is the zero set of some smooth function, i.e. given any closed set $C \subset R^{n}$ , there exists a smooth function $g : R^{n} \to R$ with $C = {x \in R^{n} : g (x) = 0} .$ Knowing that a level set is closed is an extremely basic fact and yet without using specific information about the function you are looking at, you cannot conclude anything else.

Secondly, the use of the submersion theorem here only makes sense when $N > k D$ . But this is not even commonly the case. It is common to have many more data points (the $D$ ) than parameters (the $N$ ), ultimately meaning that the dimension of $O^{D}$ is much, much larger than the dimension of the domain of $F$ . This suggests a slightly different perspective, which I briefly outline next.

Behavioural Space

When the codomain is a higher-dimensional space than the domain, we more commonly picture the image of a function, as opposed to the graph, e.g. if I say to consider a smooth function $g : R \to R^{2}$ , one more naturally pictures the curve $g (R)$ in the plane, as a kind-of 'copy' of the line $R$ , as opposed to the graph of $g$ . So if one were to try to continue along these lines, one might instead imagine the image $F (Θ)$ of parameter space in the behaviour space $O^{D} .$ We think of each point of $O^{D}$ as a complete specification of possible outputs on the dataset. Then the image $F (Θ) \subset O^{D}$ is (loosely speaking) an $N -$ dimensional submanifold of this space which we should think of as having large codimension. And each point $F (θ)$ on this submanifold is the outputs of an actual model with parameters $θ$ . In this setting, the points $θ \in Θ$ at which the Jacobian $J F (θ)$ has full rank map to points $F (θ) \in F (Θ)$ which have neighbourhoods in which $F (Θ)$ is smooth and embedded.

The Hessian of the Total Loss

A computation of the Hessian of $L$ appears in both Information Loss --> Basin flatness and Basin broadness depends on the size and number of orthogonal features, under slightly different assumptions. Let us carefully go over that computation here, in a slightly greater level of generality. We continue with $O = R^{k}$ , in which case $O^{D} = R^{k \times D}$ . The function we are going to differentiate is:

L (θ) = l (F (θ)) = l (f (x^{1}, θ), \dots, f (x^{D}, θ)) .

And since each $f (x^{d}, θ) \in R^{k}$ for $d = 1, \dots, D$ , we should think of $F (θ)$ as a $k \times D$ matrix, the general $(p, d)^{t h}$ entry of which is $f^{p} (x^{d}, θ)$ .

We want to differentiate twice with respect to $θ$ . Firstly, we have

\frac{\partial}{\partial θ_{i}} L (θ) = k \sum p = 1 D \sum d = 1 \nabla^{(p, d)} l (F (θ)) \cdot \frac{\partial f^{p} (x^{d}, θ)}{\partial θ_{i}}

for $i = 1, \dots, N$ .

Then for $j = 1, \dots, N$ we differentiate again:

\frac{\partial^{2}}{\partial θ_{j} \partial θ_{i}} L (θ) = k \sum p, q = 1 D \sum d, d^{'} = 1 \nabla^{(q, d^{'})} \nabla^{(p, d)} l (F (θ)) \frac{\partial f^{q} (x^{d^{'}}, θ)}{\partial θ_{j}} \frac{\partial f^{p} (x^{d}, θ)}{\partial θ_{i}}

\begin{matrix} + k \sum p = 1 D \sum d = 1 \nabla^{(p, d)} l (F (θ)) \frac{\partial^{2} f^{p} (x^{d}, θ)}{\partial θ_{j} \partial θ_{i}} . \\ (1) \end{matrix}

This is now an equation of $(N \times N)$ matrices.

At A Local Minimum of The Loss Function

If $θ$ is such that $F (θ)$ is a local minimum for $l$ (which means that the parameters are such that the output of the network on the training data is a local minimum for the loss function), then the second term on the right-hand side of (1) vanishes (because the term includes the first derivatives of $l$ , which are zero at a minimum). Therefore: If $F (θ^{*})$ is a local minimum for $l$ we have:

\frac{\partial^{2}}{\partial θ_{j} \partial θ_{i}} L (θ^{*}) = k \sum p, q = 1 D \sum d, d^{'} = 1 \nabla^{(q, d^{'})} \nabla^{(p, d)} l (F (θ^{*})) \frac{\partial f^{q} (x^{d^{'}}, θ^{*})}{\partial θ_{j}} \frac{\partial f^{p} (x^{d}, θ^{*})}{\partial θ_{i}} .

If, in addition, the Hessian of $l$ is equal to the identity matrix (by which we mean $\nabla^{(q, d^{'})} \nabla^{(p, d)} l = δ_{p q} δ_{d d^{'}}$ - as is the case for the example loss function given above in (*)), then we would have:

\frac{\partial^{2}}{\partial θ_{j} \partial θ_{i}} L (θ^{*}) = k \sum p = 1 D \sum d = 1 \frac{\partial f^{p} (x^{d}, θ^{*})}{\partial θ_{j}} \frac{\partial f^{p} (x^{d}, θ^{*})}{\partial θ_{i}}

\begin{matrix} = D \sum d = 1 \frac{\partial f (x^{d}, θ^{*})}{\partial θ_{i}} \cdot \frac{\partial f (x^{d}, θ^{*})}{\partial θ_{j}} . \\ (2) \end{matrix}

Reservations

In Basin broadness depends on the size and number of orthogonal features, the expression on the right-hand side of equation (2) above is referred to as an inner product of "the features over the training data set". I do not understand the use of the word 'features' here and in the remainder of their post. The phrase seems to imply that a function of the form

x^{d} ⟼ \frac{\partial f (x^{d}, θ^{*})}{\partial θ_{j}},

defined on the inputs of the training dataset, is what constitutes a feature. No further explanation is really given. It's completely plausible that I have missed something (and perhaps other readers do not or will not share my confusion) but I would like to see an attempt at a clear and detailed explanation of exactly how this notion is supposed to be the same notion of feature that (say) Anthropic use in their interpretability work (as was claimed to me).

Criticism

I'd like to tentatively try to give some higher-level criticism of these kinds of approaches. This is a tricky thing to do, I admit; it's generally very hard to say that a certain approach is unlikely to yield results, but I will at least try to explain where my skepticism is coming from.

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture. In particular, in the mathematical framework that is set up, the function $f$ is more or less just any smooth function. And the methods used are just a few lines of calculus and linear algebra applied to abstract smooth functions. If these are the principal ingredients, then I am naturally led to expect that the conclusions will be relatively straightforward facts that will hold for more or less any smooth function $f$ .

Such facts may be useful as part of bigger arguments - of course many arguments in mathematics do yield truly significant results using only 'low-level' methods - but in my experience one is extremely unlikely to end up with significant results in this way without it ultimately being clear after the fact where the hard work has happened or what the significant original insight was.

So, naively, my expectation at the moment is that in order to arrive at better results about this sort of thing, arguments that start like these ones do must quickly bring to bear substantial mathematical facts about the network, e.g. random initialization, gradient descent, the structure of the network's layers, activations etc. One has to actually use something. I feel (again, speaking naively) that after achieving more success with a mathematical argument along these lines, one's hands would look dirtier. In particular, for what it's worth, I do not expect my suggestion to look at the image of the parameter space in 'behaviour space' to lead (by itself) to any further non-trivial progress. (And I say 'naively' in the preceding sentences here because I do not claim myself to have produced any significant results of the form I am discussing).

23

Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

23

Basic Notation and Terminology

Behaviour Manifolds

Reservations

Behavioural Space

The Hessian of the Total Loss

At A Local Minimum of The Loss Function

Reservations

Criticism