Review

TLDR; This is the first post of Distilling Singular Learning Theory (DSLT), an introduction to which can be read at DSLT0. In this post I explain how singular models (like neural networks) differ from regular ones (like linear regression), give examples of singular loss landscapes, and then explain why the Real Log Canonical Threshold (aka the learning coefficient) is the correct measure of effective dimension in singular models.

When a model class is singular (like neural networks), the complexity of a parameter  in parameter space  needs a new interpretation. Instead of being defined by the total parameters available to the model , the complexity (or effective dimensionality) of  is defined by a positive rational  called the Real Log Canonical Threshold (RLCT), also known as the learning coefficient. The geometry of the loss  is fundamentally defined by the singularity structure of its minima, which  measures. Moreover, in regular models like linear regression the RLCT is , but in singular models it satisfies  in general. At its core, then, Sumio Watanabe's Singular Learning Theory (SLT) shows the following key insight:

The RLCT  is the correct measure of effective dimensionality of a model .

Watanabe shows that the RLCT  has strong effects on the learning process: it is the correct generalisation of model complexity in the Bayesian Information Criterion for singular models, and therefore plays a central role in the asymptotic generalisation error, thereby inheriting the name "learning coefficient". 

In this first post, after outlining the Bayesian setup of SLT, we will start by defining what a singular model is and explain what makes them fundamentally different to regular models. After examining different examples of singular  loss landscapes, we will define the RLCT to be the scaling exponent of the volume integral of nearly true parameters, and conclude by summarising how this quantity correctly generalises dimensionality.

Preliminaries of SLT

The following section introduces some necessary technical terminology, so use it as a reference point, not necessarily something to cram into your head on a first read through. A more thorough setup can be found in [Car21, Chapter 2], which follows [Wat09] and [Wat18]. 

SLT is established in the Bayesian paradigm, where the Bayesian posterior on the parameter space  is the primary object of focus, containing information on which parameters  correspond to "good" models.

Our statistical learning setup consists of the following data:

  • A dataset , where for  each  is an input and  is an output (so we are in the supervised learning setting).
  • We suppose the sequence in  is independent and identically distributed according to a true distribution . For our purposes, we assume the true distribution of inputs  to be known, but the true distribution of outputs  to be unknown.
  • We then choose a model class  defined by parameters  in a compact parameter space  that contains the origin. We hope to find model parameters  that will adequately approximate the truth, or in other words, learn how to accurately predict an output given an input. For example, a model class could be a fixed neural network architecture with Gaussian noise, as below.
  • We can select a prior distribution  of our choosing[1] that is non-zero on , so .

Using this data, the error of the model  on the dataset  is defined by the empirical negative log likelihood (NLL), 

where  is the model likelihood.  [2]  [3]

This gives rise to the Bayesian posterior of  defined by [4]

where the partition function (or in Bayesian terms the evidence) is given by 

The partition function  measures posterior density, and thus contains a lot of macroscopic data about a system. Inspired by its role in physics, and for theoretical ease, we consider the free energy 

Performing asymptotic analysis on  (and therefore ) is the main task of SLT. The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.

Though we never have access to the truth in the learning procedure, for theoretical purposes we nonetheless define the empirical entropy of the true distribution

Even though this quantity is always inaccessible in real settings, there is almost sure convergence  as  to a constant  that doesn't depend on [5]

To study the posterior, we define the Kullback-Leibler divergence  between the truth and the model, 

which is the infinite-data limit of its empirical counterpart,

The KL divergence is usually thought of as a "loss metric"[6] between the truth and and the model since 

  •  for all , and; 
  •  if and only if  for all  and all 

As such, I will colloquially refer to  as the loss landscape. [7] The current state of results in SLT require  to be an analytic function, but it seems likely that the results can be generalised to non-analytic settings with suitable hypotheses and constraints. 

To analyse where the loss  is minimised, we are then lead to defining the set of true parameters,

We say that the true distribution  is realisable by the model class  if  is non empty, that is, there exists some  such that  for all . Despite being unrealistic in real world applications, this is nonetheless an important starting point to the theory, which will then generalise to the set of optimal parameters in DSLT2. 

We are going to restrict our attention to a particular kind of model: neural networks with Gaussian noise. We will formally define a neural network  in a following chapter of this sequence, but for now it suffices to say that it is a function  with  inputs and  outputs defined by some parameter . Then our model density is going to be given by 

From here on in, we will assume we are working with a (model, truth, prior) triple