TLDR; This is the third main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied in DSLT4

As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less than half the dimension of parameter space. With this in mind, it should be no surprise that neural networks are indeed singular models, but up until this point we have not exactly explained what feature they possess that makes them singular. In this post, we will explain that in essence:

Neural networks are singular because there are often ways to vary their parameters without changing the function they compute.  

In the case where the model and truth are both defined by similar neural network architectures, this fact means that the set of true parameters  is non-trivial (i.e. bigger than the regular case where it is a single point), and often possesses many symmetries. This directly implies that neural networks are singular models. 

The primary purpose of this post is to show with examples why neural networks are singular, and classify the set of true parameters  in the case where the model and truth are simple two layer feedforward ReLU networks. In doing so, we will lay the groundwork for understanding the phases present in the setup so that we can then study relevant phase transitions in DSLT4. Feel free to jump ahead to the slightly more exciting DSLT4 Phase Transitions in Neural Networks and refer back to this post as needed. 

Outline of Classification

To understand the different regions that minimise the free energy (and thus, as we'll see in DSLT4, the phases), one needs to first understand the singularities in the set of optimal parameters of .

In the realisable regression case with a model neural network  and true neural network defined by  for some , the set of true parameters has the form [1]

Thus, classifying the true parameters is a matter of establishing which parameters  yield functional equivalence between the model and the truth . The property of being singular is specific to a model class , regardless of the underlying truth. But, classifying  in the realisable case is a convenient way of studying what functionally equivalent symmetries exist for a particular model class. 

Neural networks have been shown to satisfy a number of different symmetries of functional equivalence across a range of activation functions and architectures, which we will elaborate on throughout the post. Unsurprisingly, the nonlinearity of the activation function plays a central role in governing these symmetries. In general, then, deep neural networks are highly singular.

In this post we are going to explore a full characterisation of the symmetries of  when the model is a two layer feedforward ReLU neural networks with  hidden nodes, and the truth is the same architecture but with  nodes. Though you would never use such a basic model in real deep learning, the simplicity of this class of network allows us to study  with full precision. We will see that:

  • If the model and truth have the same number of nodes, : There are three forms of symmetry of :
    • Scaling symmetry of the incoming and outgoing weights to any node.
    • Permutation symmetry of the hidden nodes in a layer.
    • Orientation reversing symmetry of the weights, only when some subset of weights sum to zero (i.e. "annihilate" one another).
  • If the model has more nodes than the truth, : Without loss of generality, the first  nodes of the model must have the same symmetries as in the first case. Then each excess node  is either
    • Degenerate, meaning its total weight (gradient) is 0 (thus the node is always constant).
    • Or it has the same activation boundary as another already in the model such that the weights sum to the total gradient in a region [2].

In [Carroll, Chapter 4], I give rigorous proofs that in both cases,  is classified by these symmetries, and these symmetries alone. The purpose of this post is not to repeat these proofs, but to provide the intuition for each of these symmetries. I have included a sketch of the full proof in the appendix of this post if you are more mathematically inclined.

Two layer Feedforward ReLU Neural Networks

Literature abounds on what neural networks are, so I will merely give the definition of the class we are going to study here and some related terminology for the discussion.

Defining the Networks and Terminology

Let  be a compact parameter space. We will let  denote the set of hidden nodes in the first layer of our network, and  denote the standard dot product between two vectors. Also recall that 

We let  denote a two layer feedforward ReLU neural network with two inputs  and one output , defined by a parameter . The function is given by 

 where for each :

  • the first layer weights are  and the biases are 
  • the second layer weights are  and the bias is .

These functions are simply piecewise affine functions (i.e. piecewise hyperplanes), and as such they have (relatively) easy topology to study. Before we give an example, we will briefly mention some key terminology.

Let  be defined by a fixed . We say a particular node  is degenerate in  if either of the weights are zero, so  or [3]

We say a non-degenerate node  is activated in some linear domain [4]  when the ReLU is non-zero for all  , that is, 

The activation boundary associated to node  is thus the line 

One of the key accounting tools in the symmetry classification is identifying the foldsets of  (in the terminology of [PL19]), which are the regions where  is non-differentiable in , and noticing that these equate to the union of non-degenerate activation boundaries . Two functionally equivalent networks must then have the same foldsets since they define the same function, allowing us to compare the lines defined by .

Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes

Example 3.1: Consider the following two layer feedforward ReLU neural network: 

 defined by biases  and , second layer weights , and first layer weights 

Its graphical structure and activation boundaries in the  plane can be seen below: 

The data of  above.

Conceptually, it's helpful to notice that when anchored on its corresponding activation boundary, each weight vector  "points" into its region of activation.

The Symmetries of Two Layer Feedforward ReLU Neural Networks

In this section I am going to provide some motivating examples of each kind of symmetry exhibited in two layer feedforward ReLU neural networks. To prove that this is the full set of symmetries in generality requires a bit more work, which we relegate to the appendix.

Scaling Inner and Outer Weights of a Node

The scaling symmetry of ReLU networks offers us our first window into why these models are singular. The key property is to notice that for any , the ReLU satisfies a scale invariance [5]

 Say we had the simplest model possible with just one node: 

 Then we could define an alternative parameter  with 

 which gives functional equivalence because, 

For a model with  hidden nodes, the same scaling symmetry applies to each individual node  with a set of scaling factors .

The fact that we can define such a  for any set of positive scalars means that the Fisher information matrix of these models is degenerate at all points . We prove this in generality in Appendix 1, but I'll spell it out explicitly for a simple example here. 

Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix

Example 3.2: It is worth taking a moment to recognise how this scaling symmetry affects the geometry of the loss landscape . The mental model to have here is that it results in valleys in , where the set of true parameters  is like a river on the valley floor. To see this, say we defined a model with parameter  and truth as: 

where  is some fixed constant. If  is uniform on  then it is easy to calculate that when  we have 

 We can depict this valley and its effect on the posterior for :

K(w) is a valley
Setting , we see that  is a valley due to the scaling symmetry (left), thus there is no unique maximum a posterior (right). Remember that, up to a scaling factor,  is the posterior when the prior  is uniform, and  for large  since .

Looking at this , it's easy to intuit that the Fisher information matrix  is degenerate for all . But, for clarity, let me spell this out for the true parameters in the case where , so .

Remember that at true parameters the Fisher information matrix is just the Hessian, which in this case has the form

In particular, let  be a fixed true parameter parameterised by a fixed , so . Then the Fisher information matrix has the form

Setting  and  to be the rows of the matrix, there is clearly a linear dependence relation 

and since  is arbitrary, this shows that all true parameters have degenerate Fisher information matrices and are thus singular.

Permutation of Nodes

This one is easy to see. If we have a model with  nodes,