TLDR; This is the first post of Distilling Singular Learning Theory (DSLT), an introduction to which can be read at DSLT0. In this post I explain how singular models (like neural networks) differ from regular ones (like linear regression), give examples of singular loss landscapes, and then explain why the Real Log Canonical Threshold (aka the learning coefficient) is the correct measure of effective dimension in singular models.
When a model class is singular (like neural networks), the complexity of a parameter w in parameter space W⊂Rd needs a new interpretation. Instead of being defined by the total parameters available to the model d, the complexity (or effective dimensionality) of w is defined by a positive rational λ∈Q>0 called the Real Log Canonical Threshold (RLCT), also known as the learning coefficient. The geometry of the loss K(w) is fundamentally defined by the singularity structure of its minima, which λ measures. Moreover, in regular models like linear regression the RLCT is λ=d2, but in singular models it satisfies λ≤d2 in general. At its core, then, Sumio Watanabe's Singular Learning Theory (SLT) shows the following key insight:
The RLCT λ∈Q>0 is the correct measure of effective dimensionality of a model w∈W.
Watanabe shows that the RLCT λ has strong effects on the learning process: it is the correct generalisation of model complexity in the Bayesian Information Criterion for singular models, and therefore plays a central role in the asymptotic generalisation error, thereby inheriting the name "learning coefficient".
In this first post, after outlining the Bayesian setup of SLT, we will start by defining what a singular model is and explain what makes them fundamentally different to regular models. After examining different examples of singular K(w) loss landscapes, we will define the RLCT to be the scaling exponent of the volume integral of nearly true parameters, and conclude by summarising how this quantity correctly generalises dimensionality.
Preliminaries of SLT
The following section introduces some necessary technical terminology, so use it as a reference point, not necessarily something to cram into your head on a first read through. A more thorough setup can be found in [Car21, Chapter 2], which follows [Wat09] and [Wat18].
SLT is established in the Bayesian paradigm, where the Bayesian posterior on the parameter space W is the primary object of focus, containing information on which parameters w∈W correspond to "good" models.
Our statistical learning setup consists of the following data:
A dataset Dn={(X1,Y1),…,(Xn,Yn)}, where for i=1,…,n each Xi∈RN is an input and Yi∈RM is an output (so we are in the supervised learning setting).
We suppose the sequence in Dn is independent and identically distributed according to a true distribution q(y,x)=q(y|x)q(x). For our purposes, we assume the true distribution of inputs q(x) to be known, but the true distribution of outputs q(y|x) to be unknown.
We then choose a model class p(y|x,w) defined by parameters w in a compact parameter space W⊆Rd that contains the origin. We hope to find model parameters w that will adequately approximate the truth, or in other words, learn how to accurately predict an output given an input. For example, a model class could be a fixed neural network architecture with Gaussian noise, as below.
We can select a prior distribution φ(w) of our choosing[1] that is non-zero on W, so φ(w)>0.
Using this data, the error of the model w on the dataset Dn is defined by the empirical negative log likelihood (NLL), Ln(w),
Ln(w)=−1nn∑i=1logp(yi|xi,w),
where e−nLn(w)=∏ni=1p(yi|xi,w)=p(Dn|w) is the model likelihood. [2][3]
This gives rise to the Bayesian posterior of w defined by [4]
p(w|Dn):=1Znφ(w)e−nLn(w)
where the partition function (or in Bayesian terms the evidence) is given by
Zn=∫Wφ(w)e−nLn(w)dw.
The partition function Zn measures posterior density, and thus contains a lot of macroscopic data about a system. Inspired by its role in physics, and for theoretical ease, we consider the free energy
Fn=−logZn.
Performing asymptotic analysis on Zn (and therefore Fn) is the main task of SLT. The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.
Though we never have access to the truth in the learning procedure, for theoretical purposes we nonetheless define the empirical entropy of the true distribution
Sn:=−1nn∑i=1logq(yi|xi).
Even though this quantity is always inaccessible in real settings, there is almost sure convergence Sn→S as n→∞ to a constant S that doesn't depend on n, [5]
S=EX[−logq(y|x)]=−∬RN+Mq(y,x)logq(y|x)dxdy,
To study the posterior, we define the Kullback-Leibler divergenceK(w) between the truth and the model,
K(w):=∬RN+Mq(y|x)q(x)logq(y|x)p(y|x,w)dxdy,
which is the infinite-data limit of its empirical counterpart,
Kn(w):=1nn∑i=1logq(yi|xi)p(yi|xi,w)=Ln(w)−Sn.
The KL divergence is usually thought of as a "loss metric"[6] between the truth and and the model since
K(w)≥0 for all w∈W, and;
K(w)=0 if and only if p(y|x,w)=q(y|x) for all x∈RN and all y∈RM.
As such, I will colloquially refer to K(w) as the loss landscape. [7] The current state of results in SLT require K(w) to be an analytic function, but it seems likely that the results can be generalised to non-analytic settings with suitable hypotheses and constraints.
To analyse where the loss K(w) is minimised, we are then lead to defining the set of true parameters,
W0={w∈W|K(w)=0}={w∈W|p(y|x,w)=q(y|x)}.
We say that the true distribution q(y|x) is realisable by the model class p(y|x,w) if W0 is non empty, that is, there exists somew(0)∈W such that q(y|x):=p(y|x,w(0)) for all x,y. Despite being unrealistic in real world applications, this is nonetheless an important starting point to the theory, which will then generalise to the set of optimal parameters in DSLT2.
We are going to restrict our attention to a particular kind of model: neural networks with Gaussian noise. We will formally define a neural network f(x,w) in a following chapter of this sequence, but for now it suffices to say that it is a function f:RN×W→RM with N inputs and M outputs defined by some parameter w∈W. Then our model density is going to be given by
p(y|x,w)=1(2π)M2exp(−12∥y−f(x,w)∥2).
From here on in, we will assume we are working with a (model, truth, prior) triple (p<
TLDR; This is the first post of Distilling Singular Learning Theory (DSLT), an introduction to which can be read at DSLT0. In this post I explain how singular models (like neural networks) differ from regular ones (like linear regression), give examples of singular loss landscapes, and then explain why the Real Log Canonical Threshold (aka the learning coefficient) is the correct measure of effective dimension in singular models.
When a model class is singular (like neural networks), the complexity of a parameter w in parameter space W⊂Rd needs a new interpretation. Instead of being defined by the total parameters available to the model d, the complexity (or effective dimensionality) of w is defined by a positive rational λ∈Q>0 called the Real Log Canonical Threshold (RLCT), also known as the learning coefficient. The geometry of the loss K(w) is fundamentally defined by the singularity structure of its minima, which λ measures. Moreover, in regular models like linear regression the RLCT is λ=d2, but in singular models it satisfies λ≤d2 in general. At its core, then, Sumio Watanabe's Singular Learning Theory (SLT) shows the following key insight:
Watanabe shows that the RLCT λ has strong effects on the learning process: it is the correct generalisation of model complexity in the Bayesian Information Criterion for singular models, and therefore plays a central role in the asymptotic generalisation error, thereby inheriting the name "learning coefficient".
In this first post, after outlining the Bayesian setup of SLT, we will start by defining what a singular model is and explain what makes them fundamentally different to regular models. After examining different examples of singular K(w) loss landscapes, we will define the RLCT to be the scaling exponent of the volume integral of nearly true parameters, and conclude by summarising how this quantity correctly generalises dimensionality.
Preliminaries of SLT
The following section introduces some necessary technical terminology, so use it as a reference point, not necessarily something to cram into your head on a first read through. A more thorough setup can be found in [Car21, Chapter 2], which follows [Wat09] and [Wat18].
SLT is established in the Bayesian paradigm, where the Bayesian posterior on the parameter space W is the primary object of focus, containing information on which parameters w∈W correspond to "good" models.
Our statistical learning setup consists of the following data:
Using this data, the error of the model w on the dataset Dn is defined by the empirical negative log likelihood (NLL), Ln(w),
Ln(w)=−1nn∑i=1logp(yi|xi,w),where e−nLn(w)=∏ni=1p(yi|xi,w)=p(Dn|w) is the model likelihood. [2] [3]
This gives rise to the Bayesian posterior of w defined by [4]
p(w|Dn):=1Znφ(w)e−nLn(w)where the partition function (or in Bayesian terms the evidence) is given by
Zn=∫Wφ(w)e−nLn(w)dw.The partition function Zn measures posterior density, and thus contains a lot of macroscopic data about a system. Inspired by its role in physics, and for theoretical ease, we consider the free energy
Fn=−logZn.Performing asymptotic analysis on Zn (and therefore Fn) is the main task of SLT. The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.
Though we never have access to the truth in the learning procedure, for theoretical purposes we nonetheless define the empirical entropy of the true distribution
Sn:=−1nn∑i=1logq(yi|xi).Even though this quantity is always inaccessible in real settings, there is almost sure convergence Sn→S as n→∞ to a constant S that doesn't depend on n, [5]
S=EX[−logq(y|x)]=−∬RN+Mq(y,x)logq(y|x)dxdy,To study the posterior, we define the Kullback-Leibler divergence K(w) between the truth and the model,
K(w):=∬RN+Mq(y|x)q(x)logq(y|x)p(y|x,w)dxdy,which is the infinite-data limit of its empirical counterpart,
Kn(w):=1nn∑i=1logq(yi|xi)p(yi|xi,w)=Ln(w)−Sn.The KL divergence is usually thought of as a "loss metric"[6] between the truth and and the model since
As such, I will colloquially refer to K(w) as the loss landscape. [7] The current state of results in SLT require K(w) to be an analytic function, but it seems likely that the results can be generalised to non-analytic settings with suitable hypotheses and constraints.
To analyse where the loss K(w) is minimised, we are then lead to defining the set of true parameters,
W0={w∈W|K(w)=0}={w∈W|p(y|x,w)=q(y|x)}.We say that the true distribution q(y|x) is realisable by the model class p(y|x,w) if W0 is non empty, that is, there exists some w(0)∈W such that q(y|x):=p(y|x,w(0)) for all x,y. Despite being unrealistic in real world applications, this is nonetheless an important starting point to the theory, which will then generalise to the set of optimal parameters in DSLT2.
We are going to restrict our attention to a particular kind of model: neural networks with Gaussian noise. We will formally define a neural network f(x,w) in a following chapter of this sequence, but for now it suffices to say that it is a function f:RN×W→RM with N inputs and M outputs defined by some parameter w∈W. Then our model density is going to be given by
p(y|x,w)=1(2π)M2exp(−12∥y−f(x,w)∥2).From here on in, we will assume we are working with a (model, truth, prior) triple (p<