TLDR; This is the second main post of Distilling Singular Learning Theory which is introduced in DSLT0. I synthesise why Watanabe's free energy formula explains why neural networks have the capacity to generalise well, since different regions of the loss landscape have different accuracy-complexity tradeoffs. I also provide some simple intuitive examples that visually demonstrate why true parameters (i.e. optimally accurate parameters) are preferred according to the RLCT as n→∞, and why non-true parameters can still be preferred at finite nif they have lower RLCT's, due to the accuracy-complexity tradeoff. (The RLCT is introduced and explained in DSLT1).
It is an amazing fact that deep neural networks seem to have an inductive bias towards "simple" models, suggesting that they obey a kind of Occam's Razor:
Plurality should not be posited without necessity.
or in modern parlance,
If two models of the world are similarly accurate, the simpler explanation should be preferred.
This allows them to achieve exceptionally low generalisation error despite classical statistics predictions that they should overfit data:
This fact has come to be known as the generalisation problem and has been discussed at length in Zhang et. al 2017 (and a 2021 supplement), and in Bengio et al., amongst countless others.
Remarkably, Singular Learning Theory can help explain why neural networks, which are singular models, have the capacity to generalise so well.
The degeneracy of the Fisher information matrix is actually a feature of singular models, not a bug. This is because different regions of parameter space can have different complexities as measured by the RLCT λ, unlike regular models where the complexity is fixed to the total number of parameters in the model d. This is the implicit content of Watanabe's profound free energy formula, called the Widely Applicable Bayesian Information Criterion (WBIC), which quantifies a precise asymptotic tradeoff between inaccuracy and complexity,
giving a mathematically rigorous realisation of Occam's Razor, since λ≤d2 in singular models.
In this post we will explore Watanabe's free energy formula and provide an intuitive example of why the RLCT matters so much. If you are new to statistical learning theory, I would recommend jumping straight to the examples and their related animations to gain the intuition first, and then return to the theory afterwards.
The four key points to take away are:
As n→∞, true parameters with the best accuracy will always be preferred.
As n→∞, if two true parameters are equally accurate but have different RLCT's, the parameter with the lower RLCT is preferred.
For finite but large n, non-true parameters can be preferred by the posterior because of an accuracy-complexity tradeoff as measured by the WBIC.
Parameters with low inaccuracy and small RLCT's λ have low generalisation error (in a Bayesian sense) since the Bayes generalisation error Gn is the "derivative" of the free energy, so
Gn=Ln(w(0))+λn+o(1n).
Information Criteria Help Avoid Underfitting and Overfitting
In the last post, we derived the asymptotic free energy Fn as n→∞ for regular models, called the Bayesian Information Criterion (BIC):
BIC=nLn(w(0))+d2logn,
where n is the total number of datapoints in the dataset Dn, the optimal loss is Ln(w(0)) where w(0)∈W is a maximum likelihood estimate (i.e. w(0)=argminW(L(w))), and d is the total dimension of parameter space W⊆Rd.
As a statistical practitioner, given some dataset Dn, your goal is to find a model that you hope will represent the truth from some candidate list. You only have access to the truth via your (training) dataset Dn, but you also want to ensure that it generalises to data beyond the dataset. You can use the BIC to compare model candidates across a set of model classes that you can think to compare, since it captures a precise asymptotic tradeoff between inaccuracyLn(w(0)) and complexityd2. Under this paradigm, we should choose the model that achieves the lowest BIC as it is the best option for avoiding both underfitting and overfitting the data. Let's consider a simple example in action:
Example 1: Suppose we have n=61 datapoints drawn from a quadratic with Gaussian noise, y=x2+ε where ε∼N(0,0.152)[1], where x is drawn according to a uniform prior q(x)=121(x∈[−1,1]). After looking at our scatterplot of data Dn, we could try models across the following set of model classes:
Name
d
Model
Linear
2
f1(x,w)=w1x+w0
Quadratic
3
f2(x,w)=w2x2+w1x+w0
Cubic
4
f3(x,w)=w3x3+w2x2+w1x+w0
Degree 15
16
f15(x,w)=w15x15+w14x14+⋯+w1x+w0
(The degree 15 model is an extremity just to illustrate a point.)
Within each model class, we can then perform ordinary least squares regression [2] to find the model fit w(0) with optimal loss Ln(w(0)) (which, in the regression case, is simply the mean-squared-error plus a constant[3]). With the optimal loss of each model class in hand, we can then compare the BIC over our set of candidates.
Model
Ln(w(0))
d
BIC
f1(x,w)
0.9668
1
63.09
f2(x,w)
0.9283
2
62.80
f3(x,w)
0.9280
3
64.83
f15(x,w)
0.9243
15
87.22
As one expects, in a similar vein to the bias-variance tradeoff, there is a clear optimal (lowest) BIC. As the dimension increases, the accuracy gets better and better, but at the cost of the complexity of the model (and therefore, its generalisability[4]). The linear model is simple, but has high loss. The cubic has marginally lower loss, but at the expense of a complexity increase that isn't worth it. The degree 15 polynomial has the lowest loss of them all, but is penalised heavily for its complexity (as it should be - it is clearly overfitting). The Goldilocks choice is unsurprisingly the quadratic model, f2(x), because the tradeoff between accuracy and complexity is just right.
Other than the fact that the BIC simply does not hold in the singular case, it also points us towards a limitation of regular models. Once you pick your model class to optimise, every point on the loss landscape has a fixed model complexity. If your goal is to minimise the BIC, you only have one choice: to find the single point that optimises the loss at the bottom of the well.
In fact, in our particular case, we can calculate the KL divergence [5] for the linear model f2(x,w),
K(w0,w1)=16w21+12(w0−13)2+245,
which we can see a plot of below.
In singular models, this all changes: within the same model class, different models w in parameter space W have different effective dimensionalities as measured by the RLCT λ. The learning procedure does the work of the statistician for us, because the loss landscape contains information about both the accuracy and the complexity.
Watanabe's Free Energy Formula for Singular Models
Free Energy, Generalisation and Model Selection
Fundamentally, we care about the free energy Fn=−logZn because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior. In particular, given a compact neighbourhood of parameter space W⊆W we can define its local free energy
Fn(W)=−log(∫Wφ(w)e−nLn(w)dw),
providing a direct tool for comparing different regions of the posterior, and thus different models. Since there is a correspondence
we say the posterior prefers a region W when it has low free energy relative to other regions of W.
But this isn't the only reason to care about the free energy. In fact, it is explicitly related to generalisation, at least in the Bayesian sense.
In the frequentist framework that real-world deep learning takes place in (i.e. estimating a single parameter ^w using SGD), we typically split our dataset Dn into a training set and a test set. We then say a model defined by ^w generalises well when it has low loss on the test set - in other words, it performs well on data that it hasn't seen before.
In the Bayesian paradigm, generalisation can be formulated according to a number of different estimation methods and quantities, depending on how you extract information from the posterior (e.g. you could estimate a maximum-likelihood parameter, a maximum a posterior estimate, or an average over samples from the posterior, etc.). Let's focus on one for the moment that involves the Bayes predictive distribution given by
p(y|x,Dn)=Ew[p(y|x,w)]=∫Wp(y|x,w)p(w|Dn)dw,
which weights the probability of an output y given an input x according to the posterior measure. The Bayes generalisation loss is then given by
Intuitively, it is the expected loss of the predictive distribution over all possible inputs x and outputs y. It can be shown with relative ease [6] that the Bayes generalisation loss is equal to the average increase in free energy,
Gn=EXn+1[Fn+1]−Fn.
In an informal-yet-conceptually-correct way, we can treat this difference as being the "derivative with respect to n" [7].
It follows immediately that the generalisation loss of a region W⊆W is
Gn(W)=EXn+1[Fn+1(W)]−Fn(W).
Ergo, to understand the information contained in the posterior, and which regions contain models with low generalisation error, we want to calculate the free energy. In even modestly simple settings this integral is intractable, thus why we need to calculate its asymptotic form.
The Free Energy Formula for Singular Models
This subsection is a little bit more technical. If this overwhelms you, I recommend skipping ahead to the next subsection where I interpret the free energy formula.
As I explained in DSLT1, finding the asymptotic form of the free energy Fn as n→∞ when I(w) is degenerate is hard, and depends of theorems of algebraic geometry and distribution theory. This formula has been refined over the course of many papers [8], adjusted and generalised with various hypotheses. Here we will focus on the form given in [Wat13], which also applies to the unrealisable case where the set of true parameters W0 may be empty. Thus we instead care about the set of optimal parameters
Wopt={w∈W|L(w)=minw′∈WL(w′)},
and if W0 is non-empty, then W0=Wopt. As it stands, the free energy formula in the unrealisable case depends on a hypothesis of relatively finite variance [9].
Watanabe shows that the free energy of W asymptotically satisfies
Fn=nLn(w(0))+λlogn+Un√λlogn2+Op(1)asn→∞.
where:
n is the size of the dataset Dn.
w(0)∈Wopt is a most singular optimal point of W (which we will explain below), which is in the interior of W.
λ∈Q>0 is the global RLCT, associated to the most singular point w(0).
{Un} is a sequence of random variables that satisfies E[Un]=0 and converges to a Gaussian random variable in law as n→∞.
Op(1) denotes a sequence that is bounded in probability.
We will interpret the formula in a moment, but let me briefly clarify what it means to call w(0) a "most singular point", a notion is made precise in [Lin11, Proposition 3.9]. (See the below figure, too).
The gist is that every singularity w∈Wopt has an associated local RLCT λw defined by considering small neighbourhoods around the point. The global RLCT of W is defined to be the minimum over these optimal points, λ=minw∈Wopt{λw}, and an optimal point w(0)∈Wopt is a most singular point if λw(0)=λ[10]. Note also that the formula currently depends on an optimal parameter being in the interior of W, so you can think of a most singular point as being a local minimum of K(w).
The Widely Applicable Bayesian Information Criterion
In the asymptotic limit as n→∞, we can ignore the last two terms of the free energy formula and arrive at the Widely Applicable Bayesian Information Criterion (WBIC) across the full parameter space W,
WBIC=nLn(w(0))+λlogn.
Notice how the WBIC formula is the same as the BIC except that complexity is measured by λ instead of d2. But in DSLT1, we explained how
In regular models λ=d2.
In singular models λ≤d2 in general.
Thus the WBIC is a generalisation of the BIC, containing the regular result as a special case.
Though the WBIC can be used to compare model classes like in the case of the BIC, its real power is what it tells us about the information geometry of the posterior within the same class of models. To this end, we can calculate the local free energy of a compact neighbourhood W⊆W
Fn(W)=nLn(w(0)W)+λWlogn,
where w(0)W∈Wopt is the most singular optimal point in W with associated RLCT λW.
The Accuracy-Complexity Tradeoff
The WBIC shows that the free energy of a region W⊆W is comprised of the accuracy and complexity (or, to the physicists, energy and entropy) of the most singular point in W
What makes this so profound is that in singular models, different regions W⊂W can have different RLCT's λW, each with a different tradeoff between accuracy and complexity, unlike the regular model case where every region has a fixed complexity:
So, the region in W that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam's Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model.
Interpreting the terms in the free energy formula leads us to three main points:
As n→∞, regions W containing optimal parameters w(0)∈Wopt will always be preferred, since the inaccuracy is the leading order term.
As n→∞, if there are multiple regions W1 and W2 with equally optimal accuracy, then they are preferred according to their respective RLCT's. Lower is better, so if λ1<λ2 then W1 is preferred by the posterior.
For finite but large n, regions that do not contain a globally optimal parameter can be preferred by the posterior because they can have a better tradeoff between accuracy and complexity.
Why Singular Models (Can) Generalise Well
Armed with our free energy formula, we can now understand why singular models have the capacity to generalise well.
Recall that the Bayes generalisation error can be expressed as the difference Gn=EXn+1[Fn+1]−Fn, which can be interpreted as the "derivative" of the free energy with respect to n. Then since Fn=nLn(w(0))+λlogn, Watanabe is able to prove in [Wat18, Chapter 8] that, asymptotically, the Bayes generalisation error is
E[Gn]=L(w(0))+λn+o(1n).
In fact, he goes one step further by considering two other forms of generalisation, the leave-one-out-cross-validation-loss CV, and the WAIC (an empirically measurable form of the WBIC), and shows that asymptotically
E[Gn]≈E[CV]≈E[WAIC]=L(w(0))+λn+o(1n).
Once again, we find that the RLCT λ plays a central role in the learning process. Most importantly, we have a correspondence:
Optimal parameters w(0) with low RLCTλ⟺Low generalisation error.
On top of this, we can carry out the same analysis on our local Fn(W) to find that the local generalisation loss of W is
E[Gn(W)]=L(w(0)W)+λWn.
Since the RLCT can differ region to region, this tells us that:
In singular models, different regions W have different generalisation error.
All of this is to say: under any reasonable conception of Bayesian generalisation, the RLCT plays a central role in minimising generalisation loss, asymptotically. And since λ≤d2 in singular models, the generalisation error of a singular model will always be better than that of a regular model with the same number of parameters. In DSLT3 we will show neural networks are singular, so:
This is why neural networks can generalise well!
...sort of. Don't forget, we are in the Bayesian paradigm here, and it is not a given that Stochastic Gradient Descent (SGD) finds the same regions of parameter space that the Bayesian posterior says are "good". We postulate that in some sense they are equivalent, and the work of Mingard et al. in Is SGD a Bayesian Sampler? Well, Almost. agrees with this postulate. But, formalising this relationship does remain a key open problem in conclusively applying SLT to modern deep learning with SGD and its variants.
From points to local neighbourhoods
For those with less background knowledge, there is an important conceptual shift we have made here that I want to elaborate on briefly.
In frequentist statistics we care about particular point estimates ^w in parameter space W. But in Bayesian statistics, we care about measurable regions W⊂W of parameter space, and the probability to which the posterior assigns those regions. This is a powerful shift in perspective, and points towards why SLT is placed in a Bayesian paradigm: the observation the geometry of K(w) contains a lot more information than simple point estimates do lends itself naturally to Bayesian statistics.
But in modern deep learning, we only ever have access to a point estimate ^w at the end of training via SGD. Sampling from the Bayesian posterior for large neural networks would not only be silly, but it would be, essentially, computationally impossible. So does this mean SLT has absolutely no applicability to modern deep learning?
Not at all. By studying the local geometry of the loss landscape K(w) - which is to say, arbitrarily small neighbourhoods W of the posterior - we are able to analyse the set of points that are arbitrarily close to the singularities of K(w).
What Watanabe shows is that the singularities contained in these small neighbourhoods affect the geometry of the other points in the neighbourhood. If W contains one true parameter w(0)∈W0, the other points in W may not be equal minima of K(w), but they are extremely close to being so. The same logic applies for the RLCT of a region: perhaps there is only one most singular point in W, but any nearby parameter within the small neighbourhood will define a model whose functional output is nearly identical to that of f(x,w(0)) which has a lower complexity.
In focusing only on points, we lose an extraordinary amount of information. So, we localise to neighbourhoods.
Intuitive Examples to Interpret the WBIC
It's time we looked at a specific example to build intuition about what the WBIC is telling us.
I have constructed this toy example specifically to illustrate the main points here, and a lot of the details about the sub-leading terms in the free energy expansion are obfuscated, as well as the random fluctuations that make Kn(w) different to K(w). But, it is conceptually correct, and helps to illustrate the dominant features of the learning process as n→∞. Don't take it too literally, but do take it seriously.
We will start with some calculations, and then visualise what this means for the posterior as n→∞.
Example 1: True parameters are preferred according to their RLCT
Let's consider a one parameter model d=1 with KL divergence defined by
K(w)=(w+1)2(w−1)4,
on the region W=[−2,2] with uniform prior φ(w)=141(w∈W). There are two singularities in the set of true parameters,
W0={−1,1},
which we will label as w(0)−1 and w(0)1 respectively. The Fisher information at true parameters is just the Hessian, which in the one dimensional case is simply the second derivative,
I(w(0))=J(w(0))=d2Kdw2∣∣w=w(0).
An easy calculation shows
d2Kdw2=2(w−1)2(15w2+10w−1),
meaning w(0)−1 is a regular point and w(0)1 is a singular point since
TLDR; This is the second main post of Distilling Singular Learning Theory which is introduced in DSLT0. I synthesise why Watanabe's free energy formula explains why neural networks have the capacity to generalise well, since different regions of the loss landscape have different accuracy-complexity tradeoffs. I also provide some simple intuitive examples that visually demonstrate why true parameters (i.e. optimally accurate parameters) are preferred according to the RLCT as n→∞, and why non-true parameters can still be preferred at finite n if they have lower RLCT's, due to the accuracy-complexity tradeoff. (The RLCT is introduced and explained in DSLT1).
It is an amazing fact that deep neural networks seem to have an inductive bias towards "simple" models, suggesting that they obey a kind of Occam's Razor:
or in modern parlance,
This allows them to achieve exceptionally low generalisation error despite classical statistics predictions that they should overfit data:
This fact has come to be known as the generalisation problem and has been discussed at length in Zhang et. al 2017 (and a 2021 supplement), and in Bengio et al., amongst countless others.
Remarkably, Singular Learning Theory can help explain why neural networks, which are singular models, have the capacity to generalise so well.
The degeneracy of the Fisher information matrix is actually a feature of singular models, not a bug. This is because different regions of parameter space can have different complexities as measured by the RLCT λ, unlike regular models where the complexity is fixed to the total number of parameters in the model d. This is the implicit content of Watanabe's profound free energy formula, called the Widely Applicable Bayesian Information Criterion (WBIC), which quantifies a precise asymptotic tradeoff between inaccuracy and complexity,
WBIC=nLn(w(0))inaccuracy+λcomplexitylogn,giving a mathematically rigorous realisation of Occam's Razor, since λ≤d2 in singular models.
In this post we will explore Watanabe's free energy formula and provide an intuitive example of why the RLCT matters so much. If you are new to statistical learning theory, I would recommend jumping straight to the examples and their related animations to gain the intuition first, and then return to the theory afterwards.
The four key points to take away are:
- As n→∞, true parameters with the best accuracy will always be preferred.
- As n→∞, if two true parameters are equally accurate but have different RLCT's, the parameter with the lower RLCT is preferred.
- For finite but large n, non-true parameters can be preferred by the posterior because of an accuracy-complexity tradeoff as measured by the WBIC.
- Parameters with low inaccuracy and small RLCT's λ have low generalisation error (in a Bayesian sense) since the Bayes generalisation error Gn is the "derivative" of the free energy, so
Gn=Ln(w(0))+λn+o(1n).Information Criteria Help Avoid Underfitting and Overfitting
In the last post, we derived the asymptotic free energy Fn as n→∞ for regular models, called the Bayesian Information Criterion (BIC):
BIC=nLn(w(0))+d2logn,where n is the total number of datapoints in the dataset Dn, the optimal loss is Ln(w(0)) where w(0)∈W is a maximum likelihood estimate (i.e. w(0)=argminW(L(w))), and d is the total dimension of parameter space W⊆Rd.
As a statistical practitioner, given some dataset Dn, your goal is to find a model that you hope will represent the truth from some candidate list. You only have access to the truth via your (training) dataset Dn, but you also want to ensure that it generalises to data beyond the dataset. You can use the BIC to compare model candidates across a set of model classes that you can think to compare, since it captures a precise asymptotic tradeoff between inaccuracy Ln(w(0)) and complexity d2. Under this paradigm, we should choose the model that achieves the lowest BIC as it is the best option for avoiding both underfitting and overfitting the data. Let's consider a simple example in action:
Example 1: Suppose we have n=61 datapoints drawn from a quadratic with Gaussian noise, y=x2+ε where ε∼N(0,0.152)[1], where x is drawn according to a uniform prior q(x)=121(x∈[−1,1]). After looking at our scatterplot of data Dn, we could try models across the following set of model classes:
(The degree 15 model is an extremity just to illustrate a point.)
Within each model class, we can then perform ordinary least squares regression [2] to find the model fit w(0) with optimal loss Ln(w(0)) (which, in the regression case, is simply the mean-squared-error plus a constant[3]). With the optimal loss of each model class in hand, we can then compare the BIC over our set of candidates.
As one expects, in a similar vein to the bias-variance tradeoff, there is a clear optimal (lowest) BIC. As the dimension increases, the accuracy gets better and better, but at the cost of the complexity of the model (and therefore, its generalisability[4]). The linear model is simple, but has high loss. The cubic has marginally lower loss, but at the expense of a complexity increase that isn't worth it. The degree 15 polynomial has the lowest loss of them all, but is penalised heavily for its complexity (as it should be - it is clearly overfitting). The Goldilocks choice is unsurprisingly the quadratic model, f2(x), because the tradeoff between accuracy and complexity is just right.
Other than the fact that the BIC simply does not hold in the singular case, it also points us towards a limitation of regular models. Once you pick your model class to optimise, every point on the loss landscape has a fixed model complexity. If your goal is to minimise the BIC, you only have one choice: to find the single point that optimises the loss at the bottom of the well.
In fact, in our particular case, we can calculate the KL divergence [5] for the linear model f2(x,w),
K(w0,w1)=16w21+12(w0−13)2+245,which we can see a plot of below.
In singular models, this all changes: within the same model class, different models w in parameter space W have different effective dimensionalities as measured by the RLCT λ. The learning procedure does the work of the statistician for us, because the loss landscape contains information about both the accuracy and the complexity.
Watanabe's Free Energy Formula for Singular Models
Free Energy, Generalisation and Model Selection
Fundamentally, we care about the free energy Fn=−logZn because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior. In particular, given a compact neighbourhood of parameter space W⊆W we can define its local free energy
Fn(W)=−log(∫Wφ(w)e−nLn(w)dw),providing a direct tool for comparing different regions of the posterior, and thus different models. Since there is a correspondence
SmallFn(W)⟺Large posterior concentration∫Wp(w|Dn)dw,we say the posterior prefers a region W when it has low free energy relative to other regions of W.
But this isn't the only reason to care about the free energy. In fact, it is explicitly related to generalisation, at least in the Bayesian sense.
In the frequentist framework that real-world deep learning takes place in (i.e. estimating a single parameter ^w using SGD), we typically split our dataset Dn into a training set and a test set. We then say a model defined by ^w generalises well when it has low loss on the test set - in other words, it performs well on data that it hasn't seen before.
In the Bayesian paradigm, generalisation can be formulated according to a number of different estimation methods and quantities, depending on how you extract information from the posterior (e.g. you could estimate a maximum-likelihood parameter, a maximum a posterior estimate, or an average over samples from the posterior, etc.). Let's focus on one for the moment that involves the Bayes predictive distribution given by
p(y|x,Dn)=Ew[p(y|x,w)]=∫Wp(y|x,w)p(w|Dn)dw,which weights the probability of an output y given an input x according to the posterior measure. The Bayes generalisation loss is then given by
Gn=EX[−logp(y|x,Dn)]=−∬RN+Mq(y,x)logp(y|x,Dn)dxdy.Intuitively, it is the expected loss of the predictive distribution over all possible inputs x and outputs y. It can be shown with relative ease [6] that the Bayes generalisation loss is equal to the average increase in free energy,
Gn=EXn+1[Fn+1]−Fn.In an informal-yet-conceptually-correct way, we can treat this difference as being the "derivative with respect to n" [7].
It follows immediately that the generalisation loss of a region W⊆W is
Gn(W)=EXn+1[Fn+1(W)]−Fn(W).Ergo, to understand the information contained in the posterior, and which regions contain models with low generalisation error, we want to calculate the free energy. In even modestly simple settings this integral is intractable, thus why we need to calculate its asymptotic form.
The Free Energy Formula for Singular Models
This subsection is a little bit more technical. If this overwhelms you, I recommend skipping ahead to the next subsection where I interpret the free energy formula.
As I explained in DSLT1, finding the asymptotic form of the free energy Fn as n→∞ when I(w) is degenerate is hard, and depends of theorems of algebraic geometry and distribution theory. This formula has been refined over the course of many papers [8], adjusted and generalised with various hypotheses. Here we will focus on the form given in [Wat13], which also applies to the unrealisable case where the set of true parameters W0 may be empty. Thus we instead care about the set of optimal parameters
Wopt={w∈W|L(w)=minw′∈WL(w′)},and if W0 is non-empty, then W0=Wopt. As it stands, the free energy formula in the unrealisable case depends on a hypothesis of relatively finite variance [9].
Watanabe shows that the free energy of W asymptotically satisfies
Fn=nLn(w(0))+λlogn+Un√λlogn2+Op(1)asn→∞.where:
We will interpret the formula in a moment, but let me briefly clarify what it means to call w(0) a "most singular point", a notion is made precise in [Lin11, Proposition 3.9]. (See the below figure, too).
The gist is that every singularity w∈Wopt has an associated local RLCT λw defined by considering small neighbourhoods around the point. The global RLCT of W is defined to be the minimum over these optimal points, λ=minw∈Wopt{λw}, and an optimal point w(0)∈Wopt is a most singular point if λw(0)=λ [10]. Note also that the formula currently depends on an optimal parameter being in the interior of W, so you can think of a most singular point as being a local minimum of K(w).
The Widely Applicable Bayesian Information Criterion
In the asymptotic limit as n→∞, we can ignore the last two terms of the free energy formula and arrive at the Widely Applicable Bayesian Information Criterion (WBIC) across the full parameter space W,
WBIC=nLn(w(0))+λlogn.Notice how the WBIC formula is the same as the BIC except that complexity is measured by λ instead of d2. But in DSLT1, we explained how
Thus the WBIC is a generalisation of the BIC, containing the regular result as a special case.
Though the WBIC can be used to compare model classes like in the case of the BIC, its real power is what it tells us about the information geometry of the posterior within the same class of models. To this end, we can calculate the local free energy of a compact neighbourhood W⊆W
Fn(W)=nLn(w(0)W)+λWlogn,where w(0)W∈Wopt is the most singular optimal point in W with associated RLCT λW.
The Accuracy-Complexity Tradeoff
The WBIC shows that the free energy of a region W⊆W is comprised of the accuracy and complexity (or, to the physicists, energy and entropy) of the most singular point in W
Fn(W)=nLn(w(0)W)inaccuracy+λWcomplexitylogn.What makes this so profound is that in singular models, different regions W⊂W can have different RLCT's λW, each with a different tradeoff between accuracy and complexity, unlike the regular model case where every region has a fixed complexity:
Regular models:Fn(W)=n(Inaccuracy(W))+logn(Constant Complexity)Singular models:Fn(W)=n(Inaccuracy(W))+logn(Complexity(W))So, the region in W that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam's Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model.
Interpreting the terms in the free energy formula leads us to three main points:
Why Singular Models (Can) Generalise Well
Armed with our free energy formula, we can now understand why singular models have the capacity to generalise well.
Recall that the Bayes generalisation error can be expressed as the difference Gn=EXn+1[Fn+1]−Fn, which can be interpreted as the "derivative" of the free energy with respect to n. Then since Fn=nLn(w(0))+λlogn, Watanabe is able to prove in [Wat18, Chapter 8] that, asymptotically, the Bayes generalisation error is
E[Gn]=L(w(0))+λn+o(1n).In fact, he goes one step further by considering two other forms of generalisation, the leave-one-out-cross-validation-loss CV, and the WAIC (an empirically measurable form of the WBIC), and shows that asymptotically
E[Gn]≈E[CV]≈E[WAIC]=L(w(0))+λn+o(1n).Once again, we find that the RLCT λ plays a central role in the learning process. Most importantly, we have a correspondence:
Optimal parameters w(0) with low RLCTλ⟺Low generalisation error.On top of this, we can carry out the same analysis on our local Fn(W) to find that the local generalisation loss of W is
E[Gn(W)]=L(w(0)W)+λWn.Since the RLCT can differ region to region, this tells us that:
In singular models, different regions W have different generalisation error.All of this is to say: under any reasonable conception of Bayesian generalisation, the RLCT plays a central role in minimising generalisation loss, asymptotically. And since λ≤d2 in singular models, the generalisation error of a singular model will always be better than that of a regular model with the same number of parameters. In DSLT3 we will show neural networks are singular, so:
This is why neural networks can generalise well!
...sort of. Don't forget, we are in the Bayesian paradigm here, and it is not a given that Stochastic Gradient Descent (SGD) finds the same regions of parameter space that the Bayesian posterior says are "good". We postulate that in some sense they are equivalent, and the work of Mingard et al. in Is SGD a Bayesian Sampler? Well, Almost. agrees with this postulate. But, formalising this relationship does remain a key open problem in conclusively applying SLT to modern deep learning with SGD and its variants.
From points to local neighbourhoods
For those with less background knowledge, there is an important conceptual shift we have made here that I want to elaborate on briefly.
In frequentist statistics we care about particular point estimates ^w in parameter space W. But in Bayesian statistics, we care about measurable regions W⊂W of parameter space, and the probability to which the posterior assigns those regions. This is a powerful shift in perspective, and points towards why SLT is placed in a Bayesian paradigm: the observation the geometry of K(w) contains a lot more information than simple point estimates do lends itself naturally to Bayesian statistics.
But in modern deep learning, we only ever have access to a point estimate ^w at the end of training via SGD. Sampling from the Bayesian posterior for large neural networks would not only be silly, but it would be, essentially, computationally impossible. So does this mean SLT has absolutely no applicability to modern deep learning?
Not at all. By studying the local geometry of the loss landscape K(w) - which is to say, arbitrarily small neighbourhoods W of the posterior - we are able to analyse the set of points that are arbitrarily close to the singularities of K(w).
What Watanabe shows is that the singularities contained in these small neighbourhoods affect the geometry of the other points in the neighbourhood. If W contains one true parameter w(0)∈W0, the other points in W may not be equal minima of K(w), but they are extremely close to being so. The same logic applies for the RLCT of a region: perhaps there is only one most singular point in W, but any nearby parameter within the small neighbourhood will define a model whose functional output is nearly identical to that of f(x,w(0)) which has a lower complexity.
In focusing only on points, we lose an extraordinary amount of information. So, we localise to neighbourhoods.
Intuitive Examples to Interpret the WBIC
It's time we looked at a specific example to build intuition about what the WBIC is telling us.
I have constructed this toy example specifically to illustrate the main points here, and a lot of the details about the sub-leading terms in the free energy expansion are obfuscated, as well as the random fluctuations that make Kn(w) different to K(w). But, it is conceptually correct, and helps to illustrate the dominant features of the learning process as n→∞. Don't take it too literally, but do take it seriously.
We will start with some calculations, and then visualise what this means for the posterior as n→∞.
Example 1: True parameters are preferred according to their RLCT
Let's consider a one parameter model d=1 with KL divergence defined by
K(w)=(w+1)2(w−1)4,on the region W=[−2,2] with uniform prior φ(w)=141(w∈W). There are two singularities in the set of true parameters,
W0={−1,1},which we will label as w(0)−1 and w(0)1 respectively. The Fisher information at true parameters is just the Hessian, which in the one dimensional case is simply the second derivative,
I(w(0))=J(w(0))=d2Kdw2∣∣w=w(0).An easy calculation shows
d2Kdw2=2(w−1)2(15w2+10w−1),meaning w(0)−1 is a regular point and w(0)1 is a singular point since
d