Basin broadness depends on the size and number of orthogonal features

CallumMcDougall; Avery; Lucius Bushnaq

TL;DR

For neural networks trained to perfect loss, the broadness of optima in parameter space is given by the number and norm of independent orthogonal features the neural network has. The inner product that defines this "feature norm" and independence/orthogonality is the product of Hilbert space.

Introduction - why do we care about broadness?

Recently, there's been some discussion of what determines the broadness of optima of neural networks in parameter space.

People care about this because the size of the optimum basin may influence how easy it is for gradient descent (or other local optimisation algorithms which you might use to train a neural network) to find the corresponding solution, since it's probably easier to stumble across a broader optimum's attractor.

People also care about this because there's some hypotheses floating around saying that the broadness of optima is correlated with their generality somehow. The broader the basin of your solution is in parameter space, the more likely it seems to be that the network will successfully generalise to new data in deployment. So understanding what kinds of circuit configurations are broad seems like it might help us grasp better how AIs learn things, and how an AGI might really work internally.

Meaning, understanding what kind of solutions are broad is probably relevant to start developing general predictive theories of AI and AI training dynamics, which we'd like to have so we can answer questions like "What goals is my AGI going to develop, given that it has architecture $A$ , and I'm training it with loss function $L$ on data set $D$ ?"

Measuring broadness

Warning: calculus and linear algebra incoming!

Say we have a loss function $L (f, x, y)$ for a neural network $f (x, Θ)$ , with some data points $x$ and labels $y (x)$ , and we've trained the network to perfect loss, setting the parameters to some set $^Θ$ .

$\begin{matrix} L & = \sum x l (f (x, Θ), y (x)) \frac{d L (^Θ)}{d Θ} & = 0 \end{matrix}$ ^[1]

Now, lets try to look at how much the loss changes when we make small changes to the parameters $^Θ$ . To do this, we're going to perform a polynomial expansion of $L$ up to quadratic order. This isn't going to describe the function behaviour perfectly, but within a small neighbourhood of $^Θ$ , it'll be mostly accurate.

$\begin{matrix} L (Θ) - L (^Θ) & = \frac{d L (^Θ)}{d Θ} (Θ -^Θ) + (Θ -^Θ)^{T} \frac{d^{2} L (^Θ)}{d^{2} Θ} (Θ -^Θ) + O ((Θ -^Θ)^{3}) = (Θ -^Θ)^{T} \frac{d^{2} L (^Θ)}{d^{2} Θ} (Θ -^Θ) + O ((Θ -^Θ)^{3}) =: δ Θ^{T} \frac{d^{2} L^Θ)}{d^{2} Θ} δ Θ + O (δ Θ^{3}) \end{matrix}$

The first order term vanished because $\frac{d L (^Θ)}{d Θ} = 0$ (since we're at an optimum). $δ Θ := (Θ -^Θ)$ is a vector with $# params$ entries, specifying how much we're perturbing each parameter in the network from its value at the optimum $^Θ$ . $\frac{d^{2} L (^Θ)}{d^{2} Θ}$ is going to be a matrix with $(# params, # params)$ entries.

So far so good. Now, if we wanted to work out how broad the basin is by looking at this matrix, how would we do it?

The matrix is filled with second derivatives. Or in other words, curvatures.

If you only had one parameter, this matrix would be a single number (the second derivative), telling you the curvature of the parabola you can fit to the optimum. And the curvature of a parabola determines its broadness, the closer the curvature is to zero, the broader the parabola, and with it the optimum, is.

If you have two parameters, you can imagine the optimum as a little valley in the loss function landscape. The valley has two curvatures that characterise it now, but those don't need to line up with our coordinate axes $x$ and $y$ . You could, for example, have one "principal direction" of curvature lie in the $(1, 1)$ direction of parameter space, and one in the $(1, - 1)$ direction.

To find these principal directions and their curvatures, you perform an eigendecomposition of the matrix. The two eigenvalues of the matrix will be the curvatures, and the two eigenvectors the principal directions in parameter space characterised by these curvatures. Multiplying the parabola widths we get from the two, we obtain the basin volume.

We can do this for the $n$ -dimensional case as well. If some eigenvalues of the matrix are zero, that means that at least within a small neighbourhood of the optimum, the loss doesn't drop at all if you walk along the corresponding principal direction. So each such eigenvalue effectively increases the dimensionality of the optimum, from a point to a line, a line to an area, etc.

Now that's all fine and good, but it's not very interpretable. What makes the eigenvalues of this matrix be small or large for any given network? What determines how many of them are zero? What does all this have to do with interpretability? Let's find out!

Broadness and feature space

Let's explicitly work out what that matrix, which is also called the Hessian of the network, will look like, by calculating its entries in the $j$ -th row and $k$ -th column, meaning the derivative of the loss function with respect to the $j$ -th and $k$ -th parameters of the network:

$\begin{matrix} \frac{d^{2} L (^Θ)}{d Θ_{j} d Θ_{k}} & = \sum x \frac{l (f (x,^Θ), y (x))}{d Θ_{j} d Θ_{k}} = \sum x \frac{d}{d Θ_{k}} (\frac{d l (f (x,^Θ), y (x))}{d f} \frac{d f (x,^Θ)}{d Θ_{j}}) = \sum x (\frac{d^{2} l (f (x,^Θ), y (x))}{d f^{2}} \frac{d f (x,^Θ)}{d Θ_{j}} \frac{d f (x,^Θ)}{d Θ_{k}} + \frac{d l (f (x,^Θ), y (x))}{d f} \frac{d^{2} f (x,^Θ)}{d Θ_{j} d Θ_{k}}) = \frac{d^{2} l (f (x,^Θ), y (x))}{d f^{2}} \sum x \frac{d f (x,^Θ)}{d Θ_{j}} \frac{d f (x,^Θ)}{d Θ_{k}} \end{matrix}$

The second term vanishes again, because the loss function is required to be convex at optima, so $\frac{d l (f (x,^Θ), y (x))}{d f} = 0$ .^[2]

Look at this expression for a bit. If you've done some quantum mechanics before, this might remind you of something.

Let's go through a small example. If you have a network that takes in a single floating point number $x_{1}$ from the input $x$ , and maps it to an output using features $x_{1}, cos (x_{1})$ and a constant, as in $f (x) = Θ_{0} + Θ_{1} x_{1} + Θ_{2} cos (x_{1})$ , the gradient of the output would look like:

$\begin{matrix} \frac{d f (x,^Θ)}{d Θ} = & ⎡ ⎢ ⎣ \begin{matrix} 1 x_{1} cos (x_{1}) \end{matrix} ⎤ ⎥ ⎦ \end{matrix}$

So, each entry in the gradient corresponds to one feature of the network.

The Hessian matrix for this network would be

$\begin{matrix} \frac{d^{2} L (^Θ)}{d Θ_{j} d Θ_{k}} & = \frac{d^{2} l (f (x,^Θ), y (x))}{d f^{2}} & ⎡ ⎢ ⎢ ⎣ \begin{matrix} \sum_{x} 1 & \sum_{x} x_{1} & \sum_{x} cos (x_{1}) \sum_{x} x_{1} & \sum_{x} x_{1}^{2} & \sum_{x} x_{1} cos (x_{1}) \sum_{x} cos (x_{1}) & \sum_{x} cos (x_{1}) x_{1} & \sum_{x} cos (x_{1})^{2} \end{matrix} ⎤ ⎥ ⎥ ⎦ \end{matrix}$

These are the $L_{2}$ inner products of the features over the training data set. They tell you how close to orthogonal the features are with respect to each other, and how big they are.

The idea of the $L_{2}$ inner product / norm and Hilbert space is that we treat the space of functions as a vector space, which can have basis vectors. The size and relative angles of these vectors are given by the $L_{2}$ inner product: you multiply the functions in question, and sum the result over the function domain.

This gives us a well defined notion of what constitutes "orthogonal" features: Two features are orthogonal if their $L_{2}$ norm is zero. Intuitively, this means they're separate, non-interacting components of the network mapping.

This generalises to arbitrary numbers of features in arbitrary numbers of layers, though features in earlier layers enter multiplied with derivatives of activation functions, since we're characterising their effect on the final layer.

Each constant feature is treated as multiplying a constant of $1$ . The Hessian of the network at the optimum contains the $L_{2}$ norm of the products of all the features in the network with each other.

So, back to our second order approximation of the loss function. When we find the eigendecomposition of our Hessian matrix, that corresponds to shifting into a basis in parameter space where all off-diagonals of the Hessian are zero. So, the eigenvectors of the Hessian correspond to linear combinations of network features that are orthogonal (in terms of $L_{2}$ norm) to each other.

Remember, the bigger the eigenvalues, which are the $L_{2}$ norms of these feature combinations, the bigger the curvature of the loss function is in that direction of parameter space, and the narrower the basin will be.

Eigenvalues of zero mean the network actually had less independent, orthogonal features inside it than it had parameters, giving you directions that are perfectly flat.

So: the broadness of the basin is determined by the number and size of independent features in the network. At least within second order polynomial approximation range.

That sure sounds like it's pointing out the precise mathematical notion of network "complexity", "fine-tuning" or "generality" that's relevant for the basin broadness.

It also neatly quantifies and extends the formalism found by Vivek in this post. If a set of $n$ features all depend on different independent parts of the input, they are going to have an orthogonal basis of size $n$ . If a set of $n$ features depends on the same parts of the input, they are a lot more likely to span a set of dimension less than $n$ in Hilbert space, meaning more zero eigenvalues. So, information loss about the input tends to lead to basin broadness.

Some potential implications

If we leave second order approximation range, features in different layers of the network no longer interact linearly, so you can't just straightforwardly treat them as one big set of Hilbert space basis vectors.

But features in the same layer can still be treated like this, and orthogonalised by taking the vector of neuron activations in the layer times its transpose, summed over the training dataset, to form a matrix of Hilbert norms like the one we got from the Hessian. Then, finding the eigenbasis of that matrix corresponds to finding a basis of linear combinations of features in the layer that are mutually orthogonal. Once the bases of two adjacent layers are found, the weight matrix connecting the two layers can also be transformed into the new basis.

This might give us a new perspective of the network. A basis in which each element in a layer might stand for one “independent” feature in that layer, instead of a neuron. Cancellations between the signals coming from different nodes could no longer occur. The connection weights might tell you exactly how much the network “cares” about the signals coming from each element.

This stands in contrast to the usual method (where we treat the neurons as the fundamental unit of the network, and use measures from network theory to try and measure clustering among neuron groups). As we've pointed out before, how big the weights between neurons are doesn't really indicate how much signal is flowing through their connection, making this a non-optimal strategy for interpreting your network.

This makes us suspect that the orthogonal feature basis of the layers might be close to the right framing to use, if we want to quantify how many "independent" calculations a neural network performs, and how it uses these calculations to make new calculations. In other words, this this formalisation might provide the basis for the fundamentally motivated modularity measure we're after.

How might we test this hypothesis?

In our previous post, we discussed the problem of polysemanticity: when a single neuron seems to represent multiple different concepts which aren't naturally bucketed for humans. This problem suggests that individual neurons might not be the best unit of analysis, and we could do better to use features as the basic unit with which to formalise general theories of neural networks. A natural question follows - if we try out graphing the "feature flows" in neural networks with this, do polysemantic neurons mostly disappear and every orthogonalised feature means one, understandable thing?

We plan to find out soon, by experimenting with this change of basis and related ones in small neural networks. If you're interested in helping out with these experiments, or you have any other suggestions for useful experiments to run to test out these ideas, please let us know by commenting below, or sending a private LW message.

^{^}
Note that we've written this with a sum (and have continued this convention for the rest of the post), but the formalisation would work equally well with an integral. Additionally, for the rest of the post we're assuming $f$ is a scalar to make the math a little less complicated, but the formalism should hold for other output types too.
^{^}
How this works out when you're not at a perfect optimum and this term stays around is something I'm unsure of, and its the reason we proposed experiment 8 in the experiment list on LW.
^{^}
$\frac{d L}{d Θ}$ is zero because we required a perfect loss score, and loss functions are required to be convex.

[-]carboniferous_umbraculum4y20

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because for me the sum over was included in the function I called " $l$ ", but it doesn't affect my main point).

I think the most concrete thing is that the function $f$ - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like

\frac{d^{2} l}{d f^{2}}

without any further explanation or indices. In your main computation it seems like it's being treated as a scalar.

Since we never change the labels or the dataset, on can drop the explicit dependence on $y (x)$ from our notation for $l$ . Then if the network has $p$ neurons in the final layer, the codomain of the function $f$ is $R^{p}$ (unless I've misunderstood what you are doing?). So to my mind we have:

l (f (x)) = l (f^{1} (x), f^{2} (x), \dots, f^{p} (x)) .

Going through the computation in full using the chain rule (and a local minimum of the loss function $l$ ) one would get something like:

H e s s (L) (θ) = \sum x J f (θ)^{T} [H e s s (l) (f (θ))] J f (θ)

Vivek wanted to suppose that $H e s s (l)$ were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss. But without such an assumption, I don't think that the term

\sum x J f (θ)^{T} J f (θ)

appears (this is the matrix you describe as containing "the $L_{2}$ inner products of the features over the training data set.")

Another (probably more important but higher-level) issue is basically: What is your definition of 'feature'? I could say: Have you not essentially just defined `feature' to be something like `an entry of $J f (θ)$ '? Is the example not too contrived in that sense it clearly supposes that $f$ has a very special form (in particular it is linear in the $Θ$ variables so that the derivatives are not functions of $Θ$ .)

\frac{d^{2} l}{d f^{2}}

l (f (x)) = l (f^{1} (x), f^{2} (x), \dots, f^{p} (x)) .

H e s s (L) (θ) = \sum x J f (θ)^{T} [H e s s (l) (f (θ))] J f (θ)

\sum x J f (θ)^{T} J f (θ)

10