TLDR; This is the third main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied in DSLT4.
As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less than half the dimension of parameter space. With this in mind, it should be no surprise that neural networks are indeed singular models, but up until this point we have not exactly explained what feature they possess that makes them singular. In this post, we will explain that in essence:
Neural networks are singular because there are often ways to vary their parameters without changing the function they compute.
In the case where the model and truth are both defined by similar neural network architectures, this fact means that the set of true parameters W0 is non-trivial (i.e. bigger than the regular case where it is a single point), and often possesses many symmetries. This directly implies that neural networks are singular models.
The primary purpose of this post is to show with examples why neural networks are singular, and classify the set of true parameters W0 in the case where the model and truth are simple two layer feedforward ReLU networks. In doing so, we will lay the groundwork for understanding the phases present in the setup so that we can then study relevant phase transitions in DSLT4. Feel free to jump ahead to the slightly more exciting DSLT4 Phase Transitions in Neural Networks and refer back to this post as needed.
Outline of Classification
To understand the different regions that minimise the free energy (and thus, as we'll see in DSLT4, the phases), one needs to first understand the singularities in the set of optimal parameters of K(w).
In the realisable regression case with a model neural network f(x,w) and true neural network defined by f0(x)=f(x,w(0)) for some w(0)∈W, the set of true parameters has the form [1]
W0={w∈W|f(x,w)=f0(x)}.
Thus, classifying the true parameters is a matter of establishing which parameters w∈W yield functional equivalence between the model and the truth f(x,w)=f0(x). The property of being singular is specific to a model class f(x,w), regardless of the underlying truth. But, classifying W0 in the realisable case is a convenient way of studying what functionally equivalent symmetries exist for a particular model class.
Neural networks have been shown to satisfy a number of different symmetries of functional equivalence across a range of activation functions and architectures, which we will elaborate on throughout the post. Unsurprisingly, the nonlinearity of the activation function plays a central role in governing these symmetries. In general, then, deep neural networks are highly singular.
In this post we are going to explore a full characterisation of the symmetries of W0 when the model is a two layer feedforward ReLU neural networks with d hidden nodes, and the truth is the same architecture but with m≤d nodes. Though you would never use such a basic model in real deep learning, the simplicity of this class of network allows us to study W0 with full precision. We will see that:
If the model and truth have the same number of nodes, m=d: There are three forms of symmetry of W0:
Scaling symmetry of the incoming and outgoing weights to any node.
Permutation symmetry of the hidden nodes in a layer.
Orientation reversing symmetry of the weights, only when some subset of weights sum to zero (i.e. "annihilate" one another).
If the model has more nodes than the truth, m<d: Without loss of generality, the first m nodes of the model must have the same symmetries as in the first case. Then each excess node i∈{m+1,…,d} is either
Degenerate, meaning its total weight (gradient) is 0 (thus the node is always constant).
Or it has the same activation boundary as another already in the model such that the weights sum to the total gradient in a region [2].
In [Carroll, Chapter 4], I give rigorous proofs that in both cases, W0 is classified by these symmetries, and these symmetries alone. The purpose of this post is not to repeat these proofs, but to provide the intuition for each of these symmetries. I have included a sketch of the full proof in the appendix of this post if you are more mathematically inclined.
Two layer Feedforward ReLU Neural Networks
Literature abounds on what neural networks are, so I will merely give the definition of the class we are going to study here and some related terminology for the discussion.
Defining the Networks and Terminology
Let W⊆R4d+1 be a compact parameter space. We will let [d]={1,…,d} denote the set of hidden nodes in the first layer of our network, and ⟨wi,x⟩ denote the standard dot product between two vectors. Also recall that
ReLU(x)={xifx≥00ifx<0.
We let f:R2×W→R1 denote a two layer feedforward ReLU neural network with two inputs x1,x2 and one output y, defined by a parameter w∈W. The function is given by
f(x,w)=c+d∑i=1qiReLU(⟨wi,x⟩+bi)
where for each i∈[d]:
the first layer weights are wi∈R2 and the biases are bi∈R
the second layer weights are qi∈R and the bias is c∈R.
These functions are simply piecewise affine functions (i.e. piecewise hyperplanes), and as such they have (relatively) easy topology to study. Before we give an example, we will briefly mention some key terminology.
Let fw(x)=f(x,w) be defined by a fixed w∈W. We say a particular node i∈[d] is degenerate in fw if either of the weights are zero, so wi=0 or qi=0. [3]
We say a non-degenerate node i is activated in some linear domain [4]U⊆R2 when the ReLU is non-zero for all x∈U , that is,
⟨wi,x⟩+bi=wi,1x1+wi,2x2+bi>0.
The activation boundary associated to node i is thus the line
Hi={x∈R2|⟨wi,x⟩+bi=0}.
One of the key accounting tools in the symmetry classification is identifying the foldsets of fw (in the terminology of [PL19]), which are the regions where fw is non-differentiable in x, and noticing that these equate to the union of non-degenerate activation boundaries Hi. Two functionally equivalent networks must then have the same foldsets since they define the same function, allowing us to compare the lines defined by Hi.
Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes
Example 3.1: Consider the following two layer feedforward ReLU neural network:
defined by biases bi=−1 and c=0, second layer weights qi=1, and first layer weights
w1=(10),w2=(01),w3=(−10),w4=(0−1).
Its graphical structure and activation boundaries in the (x1,x2) plane can be seen below:
The data of fw(x) above.
Conceptually, it's helpful to notice that when anchored on its corresponding activation boundary, each weight vector wi "points" into its region of activation.
The Symmetries of Two Layer Feedforward ReLU Neural Networks
In this section I am going to provide some motivating examples of each kind of symmetry exhibited in two layer feedforward ReLU neural networks. To prove that this is the full set of symmetries in generality requires a bit more work, which we relegate to the appendix.
Scaling Inner and Outer Weights of a Node
The scaling symmetry of ReLU networks offers us our first window into why these models are singular. The key property is to notice that for any α>0, the ReLU satisfies a scale invariance [5]
1αReLU(αx)=ReLU(x).
Say we had the simplest model possible with just one node:
f(x,w)=q1ReLU(⟨w1,x⟩+b1)+c.
Then we could define an alternative parameter w′ with
For a model with d hidden nodes, the same scaling symmetry applies to each individual node i∈[d] with a set of scaling factors αi>0.
The fact that we can define such a w′ for any set of positive scalars means that the Fisher information matrix of these models is degenerate at all points w∈W. We prove this in generality in Appendix 1, but I'll spell it out explicitly for a simple example here.
Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix
Example 3.2: It is worth taking a moment to recognise how this scaling symmetry affects the geometry of the loss landscape K(w). The mental model to have here is that it results in valleys in K(w), where the set of true parameters W0 is like a river on the valley floor. To see this, say we defined a model with parameter w=(w,q) and truth as:
f(x,w)=qReLU(wx),f0(x)=θ0ReLU(x),
where θ0>0 is some fixed constant. If q(x) is uniform on [−√3,√3] then it is easy to calculate that when w,q≥0 we have
K(w)=(wq−θ0)2,soW0={(w,q)|wq=θ0}.
We can depict this valley and its effect on the posterior for θ0=15:
Setting θ0=15, we see that K(w) is a valley due to the scaling symmetry (left), thus there is no unique maximum a posterior (right). Remember that, up to a scaling factor, e−nKn(w) is the posterior when the prior φ(w) is uniform, and e−nKn(w)≈e−nK(w) for large n since E[Kn(w)]=K(w).
Looking at this K(w), it's easy to intuit that the Fisher information matrix I(w) is degenerate for all w. But, for clarity, let me spell this out for the true parameters in the case where θ0=1, so K(w)=(wq−1)2.
Remember that at true parameters the Fisher information matrix is just the Hessian, which in this case has the form
J(w)=(2q24wq−24wq−22w2).
In particular, let w(0)∈W0 be a fixed true parameter parameterised by a fixed α>0, so w(0)=(α,1α). Then the Fisher information matrix has the form
I(w(0))=(2α2222α2).
Setting I1(w(0)) and I2(w(0)) to be the rows of the matrix, there is clearly a linear dependence relation
−α2I1(w(0))+I2(w(0))=0
and since α is arbitrary, this shows that all true parameters have degenerate Fisher information matrices and are thus singular.
Permutation of Nodes
This one is easy to see. If we have a model with d=2 nodes,
TLDR; This is the third main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied in DSLT4.
As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less than half the dimension of parameter space. With this in mind, it should be no surprise that neural networks are indeed singular models, but up until this point we have not exactly explained what feature they possess that makes them singular. In this post, we will explain that in essence:
In the case where the model and truth are both defined by similar neural network architectures, this fact means that the set of true parameters W0 is non-trivial (i.e. bigger than the regular case where it is a single point), and often possesses many symmetries. This directly implies that neural networks are singular models.
The primary purpose of this post is to show with examples why neural networks are singular, and classify the set of true parameters W0 in the case where the model and truth are simple two layer feedforward ReLU networks. In doing so, we will lay the groundwork for understanding the phases present in the setup so that we can then study relevant phase transitions in DSLT4. Feel free to jump ahead to the slightly more exciting DSLT4 Phase Transitions in Neural Networks and refer back to this post as needed.
Outline of Classification
To understand the different regions that minimise the free energy (and thus, as we'll see in DSLT4, the phases), one needs to first understand the singularities in the set of optimal parameters of K(w).
In the realisable regression case with a model neural network f(x,w) and true neural network defined by f0(x)=f(x,w(0)) for some w(0)∈W, the set of true parameters has the form [1]
W0={w∈W|f(x,w)=f0(x)}.Thus, classifying the true parameters is a matter of establishing which parameters w∈W yield functional equivalence between the model and the truth f(x,w)=f0(x). The property of being singular is specific to a model class f(x,w), regardless of the underlying truth. But, classifying W0 in the realisable case is a convenient way of studying what functionally equivalent symmetries exist for a particular model class.
Neural networks have been shown to satisfy a number of different symmetries of functional equivalence across a range of activation functions and architectures, which we will elaborate on throughout the post. Unsurprisingly, the nonlinearity of the activation function plays a central role in governing these symmetries. In general, then, deep neural networks are highly singular.
In this post we are going to explore a full characterisation of the symmetries of W0 when the model is a two layer feedforward ReLU neural networks with d hidden nodes, and the truth is the same architecture but with m≤d nodes. Though you would never use such a basic model in real deep learning, the simplicity of this class of network allows us to study W0 with full precision. We will see that:
In [Carroll, Chapter 4], I give rigorous proofs that in both cases, W0 is classified by these symmetries, and these symmetries alone. The purpose of this post is not to repeat these proofs, but to provide the intuition for each of these symmetries. I have included a sketch of the full proof in the appendix of this post if you are more mathematically inclined.
Two layer Feedforward ReLU Neural Networks
Literature abounds on what neural networks are, so I will merely give the definition of the class we are going to study here and some related terminology for the discussion.
Defining the Networks and Terminology
Let W⊆R4d+1 be a compact parameter space. We will let [d]={1,…,d} denote the set of hidden nodes in the first layer of our network, and ⟨wi,x⟩ denote the standard dot product between two vectors. Also recall that
ReLU(x)={xifx≥00ifx<0.We let f:R2×W→R1 denote a two layer feedforward ReLU neural network with two inputs x1,x2 and one output y, defined by a parameter w∈W. The function is given by
f(x,w)=c+d∑i=1qiReLU(⟨wi,x⟩+bi)where for each i∈[d]:
These functions are simply piecewise affine functions (i.e. piecewise hyperplanes), and as such they have (relatively) easy topology to study. Before we give an example, we will briefly mention some key terminology.
Let fw(x)=f(x,w) be defined by a fixed w∈W. We say a particular node i∈[d] is degenerate in fw if either of the weights are zero, so wi=0 or qi=0. [3]
We say a non-degenerate node i is activated in some linear domain [4] U⊆R2 when the ReLU is non-zero for all x∈U , that is,
⟨wi,x⟩+bi=wi,1x1+wi,2x2+bi>0.The activation boundary associated to node i is thus the line
Hi={x∈R2|⟨wi,x⟩+bi=0}.One of the key accounting tools in the symmetry classification is identifying the foldsets of fw (in the terminology of [PL19]), which are the regions where fw is non-differentiable in x, and noticing that these equate to the union of non-degenerate activation boundaries Hi. Two functionally equivalent networks must then have the same foldsets since they define the same function, allowing us to compare the lines defined by Hi.
Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes
Example 3.1: Consider the following two layer feedforward ReLU neural network:
fw(x)=ReLU(x1−1)+ReLU(x2−1)+ReLU(−x1−1)+ReLU(−x2−1).defined by biases bi=−1 and c=0, second layer weights qi=1, and first layer weights
w1=(10),w2=(01),w3=(−10),w4=(0−1).Its graphical structure and activation boundaries in the (x1,x2) plane can be seen below:
Conceptually, it's helpful to notice that when anchored on its corresponding activation boundary, each weight vector wi "points" into its region of activation.
The Symmetries of Two Layer Feedforward ReLU Neural Networks
In this section I am going to provide some motivating examples of each kind of symmetry exhibited in two layer feedforward ReLU neural networks. To prove that this is the full set of symmetries in generality requires a bit more work, which we relegate to the appendix.
Scaling Inner and Outer Weights of a Node
The scaling symmetry of ReLU networks offers us our first window into why these models are singular. The key property is to notice that for any α>0, the ReLU satisfies a scale invariance [5]
1αReLU(αx)=ReLU(x).Say we had the simplest model possible with just one node:
f(x,w)=q1ReLU(⟨w1,x⟩+b1)+c.Then we could define an alternative parameter w′ with
q′1=q1α,w′1=αw1,b′1=αb1,c′=c,which gives functional equivalence because,
f(x,w′)=q′1ReLU(⟨w′1,x⟩+b′1)+c′=q1αReLU(⟨αw1,x⟩+αb1)+c=q1αReLU(α(⟨w1,x⟩+b1))+c=q1ReLU(⟨w1,x⟩+b1)+c=f(x,w).For a model with d hidden nodes, the same scaling symmetry applies to each individual node i∈[d] with a set of scaling factors αi>0.
The fact that we can define such a w′ for any set of positive scalars means that the Fisher information matrix of these models is degenerate at all points w∈W. We prove this in generality in Appendix 1, but I'll spell it out explicitly for a simple example here.
Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix
Example 3.2: It is worth taking a moment to recognise how this scaling symmetry affects the geometry of the loss landscape K(w). The mental model to have here is that it results in valleys in K(w), where the set of true parameters W0 is like a river on the valley floor. To see this, say we defined a model with parameter w=(w,q) and truth as:
f(x,w)=qReLU(wx),f0(x)=θ0ReLU(x),where θ0>0 is some fixed constant. If q(x) is uniform on [−√3,√3] then it is easy to calculate that when w,q≥0 we have
K(w)=(wq−θ0)2,soW0={(w,q)|wq=θ0}.We can depict this valley and its effect on the posterior for θ0=15:
Looking at this K(w), it's easy to intuit that the Fisher information matrix I(w) is degenerate for all w. But, for clarity, let me spell this out for the true parameters in the case where θ0=1, so K(w)=(wq−1)2.
Remember that at true parameters the Fisher information matrix is just the Hessian, which in this case has the form
J(w)=(2q24wq−24wq−22w2).In particular, let w(0)∈W0 be a fixed true parameter parameterised by a fixed α>0, so w(0)=(α,1α). Then the Fisher information matrix has the form
I(w(0))=(2α2222α2).Setting I1(w(0)) and I2(w(0)) to be the rows of the matrix, there is clearly a linear dependence relation
−α2I1(w(0))+I2(w(0))=0and since α is arbitrary, this shows that all true parameters have degenerate Fisher information matrices and are thus singular.
Permutation of Nodes
This one is easy to see. If we have a model with d=2 nodes,
f(x,w)=