In 2018, Zhang et al. showed that deep neural networks can achieve perfect training loss on randomly labeled data.
This was a Big Deal.
It meant that existing generalization theory couldn't explain why deep neural networks generalize. That's because classical approaches to proving that a given model class (=neural network architecture) would generalize involved showing that it lacks the expressivity to fit noise. If a model class can fit noise arbitrarily well, the resulting bounds break.
So something needed to change.
Evidently, you can't prove tight generalization bounds for entire model classes, so theorists turned to studying generalization bounds for individual models within a model class. If you can empirically show that a model's performance doesn't change substantially when you perturb it (by adding noise to the inputs, weights, training samples, etc.), then you can theoretically prove that that model will generalize to new data.
As a result, the bounds have gotten tighter, but they're still not exactly flattering.
What's really needed is a secret third thing. It's not about either model classes or individual models but about model subclasses. While the model class as a whole may be too complex to obtain tight generalization bounds, individual subclasses can achieve an optimal trade-off between accuracy and complexity. For singular Bayesian learning machines, this trade-off happens automatically.
This more or less answers why models are able to generalize but not how they do it.
Singular learning theory (SLT) provides one possible path towards understanding the how of generalization. This approach is grounded in the geometry of the loss landscape, which in turn is grounded in the symmetries of the model class and data. If this direction pans out, then learning theory is posed for a revolution analogous to the transition between thermodynamics and statistical physics.
Introduction
The central aim of classical learning theory is to bound various kinds of error: in particular, the approximation error, generalization error, and optimization error.
In the previous post in this series, we looked at the approximation error, which measures the performance of a model class's hypothetically optimal model. If your model class isn't expressive enough to do a reasonable job of modeling the data, then it doesn't matter that your model generalizes to new data or that your learning process actually reaches that optimum in practice.
In this post, we'll look at bounding the generalization error, which measures how well a modelfw∈F parametrized by w∈W transfers from a finite training set to additional samples drawn from the same distribution. We won't cover the related question of out-of-distribution generalization, which asks how a model will perform on data from a different distribution.
Last time, we started to see that approximation and generalization could not be separated in practice. Here, we'll see the same holds for generalization and optimization. SLT and many other strands of generalization theory point to a deep relation, which views generalization in nearly dynamical terms involving changes to probability distributions.
In the next post in this sequence, we'll examine the optimization error, which measures how close learning processes get to global optima. Coming at the connection with generalization from the other direction, we'll view optimization in quasistatic terms as a process of selecting among qualitatively distinct generalization strategies.
This is a self-contained introduction to generalization theory, as developed in three different schools of learning theory: classical learning theory, deep learning theory, and singular learning theory. There's a lot to cover:
Measuring "generalization" — First, we treat the question of how to quantify generalization, and how this differs between the paradigms of empirical risk minimization and Bayesian inference.
Bounding generalization — Next, we look at the different kinds of bounds learning theorists study. This will include classical approaches like uniform convergence and Probably Approximately Correct (PAC) learning, as well as more modern approaches like PAC Bayes.
Measuring "complexity" — Then, we'll cover the three different types of complexity measures that show up in these bounds: model-class complexity, single-model complexity, and model-subclass complexity. These roughly correspond to the different perspectives of classical learning theory, deep learning theory, and singular learning theory, respectively.
Thermodynamics of generalization — Finally, we'll examine the model-subclass complexity measures from a thermodynamic point of view. This raises the natural question: what would a statistical physics of generalization look like and what might we learn about how neural networks generalize?
Measuring "generalization"
The generalization error
To recap, classical learning theory is primarily grounded in the paradigm of Empirical Risk Minimization (ERM). ERM is a story about two kinds of risk.
The empirical riskRn is what you might recognize as the "training loss",
Rn(fw):=n∑i=1ℓ(fw(xi),yi),
where the average is taken over the n samples in your dataset Dn={(xi,yi)}ni=1.[1] Learning theorists maintain a sensible distinction between the individual-sample "loss" function ℓ and the average-loss "risk" Rn.
The population risk (or "true" risk) R is the expectation of the loss over the true distributionq(x,y) from which the dataset is drawn:
R(fw):=Eq(x,y)[ℓ(fw(x),y)]=∫ℓ(fw(x),y)q(x,y)dxdy.
In its broadest form, generalization is the question of how a model transfers from a finite dataset to new data. In practice, learning theorists study the generalization error (or generalization gap) Gn(fw),
Gn(fw):=|gn(fw)|:=|R(fw)−Rn(fw)|,
which addresses a more narrow question: how far apart are the empirical risk and population risk?
Derived generalization metrics
The generalization error, Gn, is a function of both the learned model fw and dataset Dn, so by taking expectation values over data, weights, or both, we derive a family of related generalization metrics.
The relevant distributions include:
p(w|Dn): the "learning algorithm", some mapping from a dataset to weights. If we're using a deterministic algorithm, this becomes a delta function. In Bayesian learning, this is the posterior. For neural networks, think of the distribution of final weights over initializations and training schedules for a set of fixed optimizer hyperparameters.
q(Dn)=∏ni=1q(xi,yi): the true probability of the dataset. We're assuming samples are i.i.d.
p(w,Dn)=p(w|Dn)q(Dn): the joint distribution over learned weights and datasets.
Ep(w,Dn)[Gn(fw)]: the average generalization error over learned weights and datasets.
Eq(Dn)[Gn(fw)]: the average generalization error over datasets for a fixed model fw.
Ep(w|Dn)[Gn(fw)]: the average generalization error over learned weights for a fixed dataset Dn.
Gn(Ep(w|Dn)[fw]): the generalization error for the average prediction made by the learned models for a fixed dataset.
And so on.
Bayesian generalization metrics
In a Bayesian setting, we replace the deterministic prediction ^y=fw(x) with a distribution p(y|x,w), which induces a likelihood p(Dn|w)=∏ni=1p(yi|xi,w)q(xi). Under certain assumptions, empirical risk minimization maps onto a corresponding Bayesian problem, in which the empirical risk is the negative log likelihood (up to a constant).[3]
The Bayesian learning algorithm is the posterior,
p(w|Dn)=p(Dn|w)φ(w)p(Dn).
But we're not just interested in updating our beliefs over weights — we also want to use our new beliefs to make predictions. Theoretically, the optimal way to do this is to average novel predictions over the ensemble of weights, weighing each machine according to their posterior density. This yields the predictive distribution,
p(y|x,Dn)=Ep(w|Dn)[p(y|x,w)]=∫p(y|x,w)p(w|Dn)dw.
In practice, evaluating these kinds of integrals is intractable. A more tractable alternative is Gibbs estimation, where we draw a particular choice of weights w′∼p(w|Dn) and make predictions using the corresponding model p(y|x,w′). This procedure is closer to the kind of paradigm we're in with neural networks: the model we train is a single draw from the distribution of final weights over different initializations and learning schedules.
Predicting according to either the predictive distribution or Gibbs estimation respectively yields two different kinds of generalization error:
The Bayes generalization error is the KL divergence between the true distribution and the predictive distribution:
Making predictions for a given model and sampling according to the posterior are non-commutative — it's not possible to move the expectation value in or out of the above logarithm without changing the result. Thus, we get two different approaches to prediction and two different kinds of generalization error.
Bounding generalization
When deriving generalization bounds, we have to decide not only which generalization metric to bound but also how to bound it. There are three main decisions:
"Global" or "local"? Do you want the results to hold over the entire set of weights or for a particular choice of weights?
Dataset-dependent or -independent? Do you want the results to hold for your particular set of samples or across all possible choices of datasets?
Uniform, average, or probabilistic? Do you want the bound to hold always, on average, or with some minimum probability?
Global vs. Local
The first and most important question is whether to bound generalization error over the entire model class or only for a particular model. Classical learning theory takes a global approach. Modern deep learning theory takes a more local approach. Singular learning theory suggests a third mesoscopic approach that obtains generalization bounds for model subclasses.
Global bounds. The global approach is a worst-case approach. The upper bound has to hold for all models, so it's only as tight as your worst-generalizing model. Finding these worst cases typically involves asking how well the model class can fit random noise. A model trained on purely random noise can't generalize at all because there's no signal, so fitting random noise perfectly means worst-case generalization error is terrible. This turns generalization into a question of expressivity, which is operationalized by complexity metrics like the VC dimension, the Rademacher complexity, and covering numbers.
Because neural networks often can fit noise perfectly, they're highly complex according to these measures, and the resulting bounds become vacuous: they trivially predict that test error will be less than 100%.
The reason we observe good generalization in practice is that not all models within a class are equally complex. For example, the success of weight-pruning shows that many weights can safely be set to zero and eliminated. Thus, average generalization can be very different from worst-case generalization.
Local bounds. Instead of studying how well a model class can express noise, you can study how robust a specific model is to noise — noise applied to the inputs (margin theory), noise applied to the weights (minimum sharpness/flatness, compression theory), or noise applied to the training samples (algorithmic stability & robustness). To obtain a generalization bound, it's possible to exchange these kinds of empirical robustness-to-noise for predicted robustness-to-new-data.
Though these results have led to real improvements, the bounds remain far too loose to explain large real-world models. Even for MNIST, the best upper bounds for % test error are an order of magnitude larger than what we actually observe in state-of-the-art models. Will progress eventually catch up, or do these limitations reflect a more fundamental problem with this approach?
Mesoscopic bounds. Instead of studying how much a single model react in response to noise, you can study how the distribution of learned models react in response to noise. With a change of perspective, this turns studying generalization into a question of comparing model subclasses.
This point of view is already implicit in the areas of deep learning theory that obtain generalization bounds involving weight norm, compression, minimum sharpness/flatness, and various information-theoretic quantities. It is made explicit in singular learning theory, which proposes a set of canonical observables for studying model complexity and generalization, such as the learning coefficient, the singular fluctuation, and multiplicity. For idealized Bayesian learners, SLT predicts the values (not just upper bounds) of both Bayes and Gibbs generalization error in the large-data limit.
Dataset-dependence
A second question is whether to bound generalization error for the given dataset or across the distribution of possible datasets. Practically, we only have a single dataset and want to understand how models will generalize for that particular dataset. Theoretically, it's usually easier to marginalize out the dependence on a specific dataset.
Marginalizing the dataset is often justified on the grounds that we're more interested in how the model performs in the limit of large amounts of data. In these limits, the influence of any single data point vanishes. However, this assumption may not hold in situations where the distributions have fatter tails like RL.
Flavors of generalization bounds
The last question is mostly aesthetic: how do we dress up our bounds? Since generalization is made up of both data and weights, this breaks down into two subquestions: how to dress up the bound over data and how to dress up the bound over weights?
Uniform bounds. The strongest kind of bounds are ones that holds uniformly over the model class, that is, bounds involving universal quantifiers:
Gn(fw)<ϵ,∀w∈W
This is defined relative to weights because we're rarely interested in bounds that hold uniformly over datasets. Except for trivial distributions, there are always pathological draws of data that will exhibit poor generalization.
Probabilistic bounds. Weaker (in the sense that they don't always hold) but often tighter (the upper bound is lower) are bounds that hold "in probability":
P(Gn(fw)<ϵ)>1−δ.
Though one could formulate a probabilistic bound relative to the distribution over weights, in practice, these bounds are always defined relative to the distribution over data.
Expectation bounds. Weaker yet tighter still are bounds that hold "in expectation." These are bounds that hold after taking an expectation value over weights/datasets:
E[Gn(fw)]<ϵ.
Unlike uniform bounds that are usually reserved to weights and probabilistic bounds that are usually reserved to data, expectation bounds treat weights and data on more equal footing.
Combining bounds. Given these choices, it's possible to mix and match. Because these operations are non-commutative, you can swap the order to double the bounds. Vary the choice of underlying probability distribution, and the horrible profusion of different generalization bounds grows further. To simplify matters somewhat, it's often possible to transform bounds in one format to another.
Named families of bounds
Many of the particular combinations of choices for how and what to bound have dedicated names. Unfortunately, these names are neither exhaustive nor exclusive nor consistent.
A few of the more important ones:
Uniform convergence: bound Gn in probability over draws of a dataset and uniformly over all possible choices of model w∈W and for all possible choices of data-generating distribution q(x,y).
Probably Approximately Correct (PAC): bound Gn in probability over draws of the dataset Dn for a particular data-generating distribution and uniformly over the model class.
Mean-Approximately Correct (MAC): bound Gn in expectation over the joint distribution. In other words, bound Ep(w,Dn)[Gn].
PAC Bayes: bound Ep(w|Dn)[Gn] in probability over draws of the dataset.
Single draw: bound Gn in probability over p(w,Dn).
Mean-hypothesis bounds: bound Gn(Ep(w|Dn)[fw]).
In principle, the qualitative shape of the bound is orthogonal to the question of which specific quantities show up in the bounds. In practice, the distinction often gets muddled. For example, "PAC Bayes" refers both to the general idea of a PAC-style bound on Ep(w|Dn)[Gn] and to the specific idea of bounds that depend on a KL divergence between the posterior and prior. So it goes.
Uniform convergence
Uniform convergence is the most classical of the classical approaches to generalization theory. The term comes from analysis, where we say that a sequence of functions fnconverges uniformly to a limiting function f if for every ϵ>0 there exists a natural number N such that for all n>N and for all inputs x,
|fn(x)−f(x)|<ϵ.
In learning theory, we're interested in the empirical risk converging uniformly to the true risk, which means that the difference between the empirical risk and true risk becomes arbitrarily small for every model provided you have enough data (Hellström et al. 2023, Shalev-Shwartz et al. 2010):
supq∈QEq(Dn)[supw∈WGn(fw)]n→∞⟶0.
Confusingly, the expectation value over data can be replaced by a probabilistic bound. The important bit of uniform convergence is that it is a form of "worst-case" analysis: you obtain a bound that holds uniformly both across all possible models in the model class and across all possible choices of data distribution q∈Q.
There are two problems with uniform convergence from the perspective of deep learning:
Generalization is not uniform across the model class. Given two models with the same performance on the training dataset, one may nevertheless generalize better than the other on new data.
Generalization is not uniform across data-generating distributions. The fact that neural networks can generalize well when trained on real-world datasets and generalize poorly when trained on random data suggests our bounds should be sensitive to the choice of data distribution.
Probably Approximately Correct (PAC)
The PAC-learning framework (Valiant 1984) addresses the second weakness of uniform convergence bounds: PAC-style bounds are defined relative to some fixed data distribution. The bounds still hold uniformly over the model class but are now defined in probability over data:
Pq(Dn)[supw∈WGn(fw)<ϵ]≥1−δ.
Your model class will "probably" (with probability at least 1−δ) be "approximately correct" (with generalization error less than ϵ).
In particular, learning theorists then try to relate ϵ and δ to each other and other hyperparameters like the number of parameters d, the number of samples n, the weight norm |w|, etc. Then, you can invert the relation to figure out how large you should make your model and how many samples you need to obtain a desired error with a desired probability.
For example, if the model class F is finite, then, it satisfies the PAC bound so long as the sample size obeys
n≥12ϵ2(log2|F|+log1δ).
The minimum n that satisfies a bound of this kind is called the sample complexity.
Historically, much of the work in PAC-style generalization theory has looked like trying to replace the log2|F| term above with a suitable finite complexity measure when the model class becomes infinite (see comments by Telgarsky 2021).
Mean Approximately Correct (MAC)
PAC-learning is more relaxed than uniform convergence, but it's still a worst-case approach where the bounds have to hold across the entire model class. The relaxation is in the data distribution. Instead, we can consider an average-case bound that holds on expectation over learned models and different draws of the data,
Ep(w,Dn)[Gn(fw)]<ϵ.
A practical example is the following bound,
∣∣Ep(w,Dn)[gn(fw)]∣∣≤√I(w∥Dn)2n,
in terms of the mutual information between weights and data
I(w∥Dn)=DKL(p(w,Dn)∥p(w)q(Dn)).
PAC Bayes
PAC Bayes (McAllester 1999, Shawe-Taylor and Williamson, 1997) keeps the expectation value over weights but swaps the expectation value over data with a probabilistic bound over data:
Pq(Dn)[Ep(w|Dn)[Gn(fw)]<ϵ]<1−δ.
It's called "PAC Bayes" because the distribution over data is usually taken to be a Bayesian posterior. Even when the distribution isn't a posterior, the Bayesian posterior is still often the learning algorithm that minimizes the resulting bound.
where φ is the prior over w and β controls the tradeoff between accuracy and complexity.
Data-dependent priors. When dealing with PAC-Bayes bounds involving a KL divergence between the prior and posterior, a common dirty trick of the trade is to adopt "data-dependent priors." If φ(w) is allowed to depend on data, then you can obtain smaller KL divergences on the right-hand side, thus tightening the bound.
One such approach (Ambroladze et al., 2006; Dziugaite et al., 2021) involves splitting the dataset Dn into two subsets, Bm and Pn−m. The learning procedure remains the same as before (i.e., you condition the posterior on the entire dataset), but now you evaluate the training loss inside the bound solely on Pn−m. You can then let the prior depend on Bm without violating the manipulations used to obtain this result.
Yes, this is a bit nasty, but we'll see later that this is getting at a deeper and (possibly) theoretically justified point: the idea that we should view generalization in terms of how a probability distribution changes in response to additional data.
Measuring "complexity"
Look at the preceding examples of generalization bounds, and you'll notice the upper bounds all involve some kind of complexity measure: e.g., the number of models for a finite model class |F|, the mutual information between weights and data I(w∥Dn), and the Kullback-Leibler divergence between the prior and posterior, DKL(p∥φ).
This is true for more or less all generalization bounds. By moving terms around (this requires some care when there are expectations or probabilities involved), we can re-express generalization bounds as,
L(w)≤B(Ln(w),Λn(w)),
where Λn(w) is some notion of complexity, and B is a monotonically increasing function in both arguments. These expressions formalize Occam's razor: achieving high performance on novel data (low L(w)) requires trading off high accuracy on the training data (low Ln(w)) against simplicity (low Λn(w)).
In this section, we'll skip over the details of how various bounds are derived[4] and instead directly examine the notions of complexity that show up in the final relations.
Flavors of complexity. At a high-level, there are three main flavors of complexity measures. These map onto the distinction between global, local, and mesoscopic generalization bounds discussed earlier. These are, respectively:
Model-class complexities: Λn(w)=Λn(W) independent of the particular choice of w∈W.
Single-model complexities: associated to a fixed choice of weights w.
Model-subclass complexities: Λn(w)=Λn(Wα), where Wα⊂W.
Model-class complexity
Most notions of model-class (as well as single-model) complexity measure how well a model class can express noise. If your model class can fit more noise, then it's more complex.
Vapnik-Chervonenkis (VC) dimension
We'll start with the best-known model-class complexity measure: the VC dimension. This requires us to go back to binary classification.
We start by defining the growth function, which is the maximum number of different ways a model class can classify an arbitrary set of inputs (x1,…,xm)∈Xm:
If the growth function saturates its upper bound, then we say that the model class "shatters" that set of samples. This means the model class always contains a model that can perfectly classify the given inputs under any possible relabeling.
So a straight line can shatter a set of three non-collinear points on a 2D, but not four. A parabola can shatter four points but not five. And so on.
Then, the VC dimension is the largest set of points that can be shattered by a given model class.
dVC:=max{m∈N:gF(m)=2m}
In terms of the VC dimension dVC we end up satisfying the PAC bound if our sample size is
n≥cϵ(dVClog1ϵ+log1δ),
where c is a constant. We replaced the term depending on the size of F in the original PAC bound with a term that depends on the VC dimension of a potentially infinite model class.
The VC dimension is straightforward to generalize to more realistic settings (e.g., the Natarajan dimension for multi-class classification and Pollard's pseudo-dimension for regression). It's also possible to derive tighter bounds by including information about the shape of the data distribution as in the fat-shattering dimension.
The problem with applying these variants to neural networks is that we are typically working in or near the over-parametrized regime, where the number of parameters is larger than the number of samples. In this regime, as Zhang et al. showed, we can typically shatter the entire training set, so dVC is on the order of n and the bound is not satisfied for reasonable tolerances.[5]
Rademacher complexity
The basic idea behind the Rademacher complexity, like the VC dimension, is to measure how well a function class can fit random noise. First, let us define the empirical Rademacher complexity of a model class W with respect to a dataset Dn:
RadDn(W):=Ep(σ1…σn)[supw∈W1nn∑i=1σifw(xi)].
The idea is to choose a set of random Rademacher variables, {σi}mi=1 that take on the values −1 and 1 with equal probability. We apply these random sign flips to each of our individual predictions and take the average. Then, we maximize the average flipped prediction over models. Finally, we average over different sign flips.
Intuitively, imagine separating your dataset into a train and test set. The Rademacher complexity measures the maximum discrepancy over the model class between your predictions for the two sets.
From this, the Rademacher complexity of W with respect to a probability distribution, q(Dn) is the expectation of the empirical Rademacher complexity taken over draws of the dataset:
Radn(W):=Eq(Dn)[RadDn(W)].
In terms of the Rademacher complexity, we obtain the PAC bound when
ϵ≤2Radn(W)+3c√log(2/δ)2n,
where c is a constant that depends on the range of functions in F.
Since the Rademacher complexity is uniform across W, it runs into the same issues as the VC dimension. Another problem is that it is not necessarily easy to calculate. In practice, learning theorists often resort to further upper-bounding the Rademacher complexity in terms of covering numbers.
Covering numbers
The covering number is the number of spherical balls of a given size needed to cover a given space (with overlaps allowed). The idea is that we're "coarse-graining" the model class with a finite collection of representative elements.
In terms of Nr(W), the covering number of W for balls of radius r=ϵ/2c, we satisfy the PAC bound if
n≥2cϵ2(log2Nr(W)+log1δ),
where c is some constant.
By default, Nr(W) will scale exponentially with model dimension d, so the resulting bounds don't become much tighter than VC. However, conceptually, this is getting closer at the idea of model-subclass complexity. The differences are that covering numbers still treat each subclass as identical, and we're not yet coarse-graining in any principled way.
Single-model complexity
Instead of constraining entire model classes, it's much more tractable to study the complexity of individual models. Most notions of single-model complexity are based on the idea of complexity as sensitivity to noise: noise applied to the inputs (margin theory), noise applied to the weights (minimum sharpness/flatness), and noise applied to the training samples (algorithmic stability & robustness).
Margins
Margin theory studies the eponymous margin, which is the minimum distance from a sample to the decision boundary of a classifier. The larger the margin, the more robust the classifier is to small perturbations in the input data, thus the better we expect it to generalize to novel data.
This dates back to the good old days of support vector machines (SVMs) for which you can explicitly calculate margins and use them to design more robust classifiers (see Vapnik (1995)) . With DNNs, calculating margins is intractable.
The more fundamental problem is that the assumption that "large margins = better generalization" doesn't hold for realistic models. Which of the following two decision boundaries would you say generalizes better?
Margin theory certainly seems to have some explanatory power, especially in settings with simpler and more structured data (e.g., Mohamadi et al. 2023). The basic idea of studying sensitivity to input noise remains a good one. It's just that the specific focus on margins appears to be overemphasized.
Minimum flatness/sharpness
Where margin theory hinges on the intuition that robustness to input perturbations should lead to generalization, "minimum flatness" hinges on the intuition that robustness to weight perturbations should lead to generalization.
If the curvature (as measured by the Hessian) of the loss landscape is low, then a small change to the weights will mean a small change in the loss. So we might expect the loss landscape shouldn't change much if we perturb it by introducing novel samples (=ask it to generalize). Conversely, if the curvature is high, the model is more complex and likely to generalize poorly.
As is true for all many ideas in deep learning, this dates back to a paper by Hochreiter and Schmidhuber written in the 90's. They offer intuition built on the minimum description length (MDL) principle from algorithmic complexity theory. In the language of MDL, fewer bits are needed to specify the location of "flat" minima, and simpler functions should generalize better.
Keskar et al. (2017) demonstrate that minibatch SGD converges to flatter minima and that this correlates strongly with the generalization error. However, Neyshabur et al. (2017a) and Dinh et al. (2017) point out that this naive approach is flawed: sharpness can be arbitrarily scaled by reparametrizations, and models trained on random labels routinely have flatter minima than models trained on true labels. To remedy this, Kwon et al. (2021) propose adaptive sharpness, which is invariant to reparametrizations and seems to correlate well with generalization error.
As it turns out, adaptive sharpness isn't enough to save sharpness theory. Andriushchenko et al. (2023) find that counter to classical intuitions, sharpness — and even the smarter reparametrization-invariant notions — correlate poorly or even negatively with generalization in transformers on larger datasets. In their words, "sharpness is not a good indicator of generalization in the modern setting."
Yet another notion of sensitivity to noise is sensitivity of the final model to changes in the training set. This is the idea behind Bousquet and Elisseeff's (2000)algorithmic stability: if changing a training sample doesn't change a model's performance, the empirical risk should have low variance.
A learning algorithm is said to have uniform stabilityβ if
It's an upper bound to how much the loss changes on any one sample (xi,yi) in the training set upon omitting that sample during training.
This leads to the following PAC-style bound for the specific weights obtained by the learning algorithm (Bousquet and Elisseef 2002):
Gn(fw)≤2β+(4nβ+1)√log1δ2n.
This avenue of research has strong parallels to work on influence functions (e.g., Grosse et al. 2023) and to work on differential privacy starting with Dwork et al. (2006), which is about making sure that you can't tell whether a sample belongs to the training set. It's also been extended to algorithmic robustness by Xu and Mannor (2012), which looks at sensitivity to changing the entire dataset rather than individual examples.
This is getting closer to the perspective we already glimpsed with the PAC-Bayes data-dependent priors, where model complexity was related to how the model changes in response to additional data. It's a perspective we'll see again in SLT.
Model-subclass complexity
So far, we've seen generalization as a question about expressivity of a model class (how well it can fit random noise), and we've seen generalization as a question about robustness of specific models (how well they can "resist" small amounts of noise). Another possibility is to view generalization as a question about the robustness of the distribution of learned models p(w|Dn).
In this section, we'll explore a variety of complexity measures that involve examining how a distribution of learning machines changes, particularly in response to new data. We'll see in the next section that this turns studying generalization into a problem of comparing subclasses of weights.
Information-theoretic metrics
One of the more principled approaches to studying generalization in modern deep learning theory comes from a strand of information-theoretic complexity measures that developed in parallel to (and in isolation from) modern PAC-Bayes theory. The complexity measures that show up in these bounds are information-theoretic metrics in terms of the various distributions over weights and data. The associated bounds are in expectation over weights — not uniform.
We already saw one example in terms of the mutual information between weights and data, I(w∥Dn), which is the Kullback-Leibler divergence between the joint distribution and the product of the marginal distributions,
I(w∥Dn)=DKL(p(w,Dn)∥p(w)q(Dn)).
This may be the most intuitive notion of complexity so far: models are more complex if they contain more information about the dataset.
We saw another example in the KL divergence between the posterior and prior DKL(p∥φ) that showed up in the PAC-Bayes example. Models are "simpler" if their weights are "closer" to their initial values.
Unfortunately, these bounds run into problems when applying them to real-world systems because these information-theoretic metrics are typically intractable to calculate or even estimate.
Compressibility
Compression-based bounds build on the same idea behind minimum flatness/sharpness of robustness to changes in weights. If the model is highly compressible and you can discard many weights without impacting performance, the model's effective dimensionality is lower than its apparent dimensionality. It's simpler than it looks and should therefore generalize.
Arora et al. (2018) provide the seminal treatment. They convert a given model into a set of simpler, compressed models, which lets them apply the tooling of the typical model-class-complexity bounds to this family. The framework doesn't directly predict the generalization of the original model but of the constructed subclass.
Zhou et al. (2019) combine this idea with the idea of data-dependent priors that shows up in PAC-Bayes theory. The posterior is chosen to be a Gaussian centered at the learned weights and the prior is chosen to be the set of compressed weights (obtained by pruning). Similar to the approach taken by Dziugaite and Roy (2017), they add in a dash of minimum sharpness by overlaying Gaussian noise over the non-zero weights. The resulting bounds are among the first to be non-vacuous for neural networks, but they're still far from tight.
The main limitation of this approach seems to be that these subclasses are currently constructed in a rather ad-hoc way. For example, singular posteriors are not asymptotically Gaussian, so modeling posteriors as Gaussian is unfounded. In addition, neural networks have many more kinds of degeneracy than just weights that can be pruned (see for example Hanin and Rolnick 2019). What this means is that current bounds likely severely underestimate how compressible neural networks really are.
One contender for a more natural notion of compressibility is SLT's learning coefficient. Though the exact link with ideas such as the Minimum Description Length (MDL) principle and compressibility is an open question, it's clear that the learning coefficient captures a much richer set of degeneracies that, for example, the number of weights you can prune.
The free energy / stochastic complexity
Given a likelihood and prior, we can define the Bayesian free energy (or stochastic complexity)Fn as the negative logarithm of the marginal likelihood (or model evidence):
Summary
In 2018, Zhang et al. showed that deep neural networks can achieve perfect training loss on randomly labeled data.
This was a Big Deal.
It meant that existing generalization theory couldn't explain why deep neural networks generalize. That's because classical approaches to proving that a given model class (=neural network architecture) would generalize involved showing that it lacks the expressivity to fit noise. If a model class can fit noise arbitrarily well, the resulting bounds break.
So something needed to change.
Evidently, you can't prove tight generalization bounds for entire model classes, so theorists turned to studying generalization bounds for individual models within a model class. If you can empirically show that a model's performance doesn't change substantially when you perturb it (by adding noise to the inputs, weights, training samples, etc.), then you can theoretically prove that that model will generalize to new data.
As a result, the bounds have gotten tighter, but they're still not exactly flattering.
What's really needed is a secret third thing. It's not about either model classes or individual models but about model subclasses. While the model class as a whole may be too complex to obtain tight generalization bounds, individual subclasses can achieve an optimal trade-off between accuracy and complexity. For singular Bayesian learning machines, this trade-off happens automatically.
This more or less answers why models are able to generalize but not how they do it.
Singular learning theory (SLT) provides one possible path towards understanding the how of generalization. This approach is grounded in the geometry of the loss landscape, which in turn is grounded in the symmetries of the model class and data. If this direction pans out, then learning theory is posed for a revolution analogous to the transition between thermodynamics and statistical physics.
Introduction
The central aim of classical learning theory is to bound various kinds of error: in particular, the approximation error, generalization error, and optimization error.
In the previous post in this series, we looked at the approximation error, which measures the performance of a model class's hypothetically optimal model. If your model class isn't expressive enough to do a reasonable job of modeling the data, then it doesn't matter that your model generalizes to new data or that your learning process actually reaches that optimum in practice.
In this post, we'll look at bounding the generalization error, which measures how well a model fw∈F parametrized by w∈W transfers from a finite training set to additional samples drawn from the same distribution. We won't cover the related question of out-of-distribution generalization, which asks how a model will perform on data from a different distribution.
Last time, we started to see that approximation and generalization could not be separated in practice. Here, we'll see the same holds for generalization and optimization. SLT and many other strands of generalization theory point to a deep relation, which views generalization in nearly dynamical terms involving changes to probability distributions.
In the next post in this sequence, we'll examine the optimization error, which measures how close learning processes get to global optima. Coming at the connection with generalization from the other direction, we'll view optimization in quasistatic terms as a process of selecting among qualitatively distinct generalization strategies.
Outline
This is a self-contained introduction to generalization theory, as developed in three different schools of learning theory: classical learning theory, deep learning theory, and singular learning theory. There's a lot to cover:
Measuring "generalization"
The generalization error
To recap, classical learning theory is primarily grounded in the paradigm of Empirical Risk Minimization (ERM). ERM is a story about two kinds of risk.
The empirical risk Rn is what you might recognize as the "training loss",
Rn(fw):=n∑i=1ℓ(fw(xi),yi),where the average is taken over the n samples in your dataset Dn={(xi,yi)}ni=1.[1] Learning theorists maintain a sensible distinction between the individual-sample "loss" function ℓ and the average-loss "risk" Rn.
The population risk (or "true" risk) R is the expectation of the loss over the true distribution q(x,y) from which the dataset is drawn:
R(fw):=Eq(x,y)[ℓ(fw(x),y)]=∫ℓ(fw(x),y) q(x,y) dxdy.In its broadest form, generalization is the question of how a model transfers from a finite dataset to new data. In practice, learning theorists study the generalization error (or generalization gap) Gn(fw),
Gn(fw):=|gn(fw)|:=|R(fw)−Rn(fw)|,which addresses a more narrow question: how far apart are the empirical risk and population risk?
Derived generalization metrics
The generalization error, Gn, is a function of both the learned model fw and dataset Dn, so by taking expectation values over data, weights, or both, we derive a family of related generalization metrics.
The relevant distributions include:
From these distributions, we obtain:[2]
And so on.
Bayesian generalization metrics
In a Bayesian setting, we replace the deterministic prediction ^y=fw(x) with a distribution p(y|x,w), which induces a likelihood p(Dn|w)=∏ni=1p(yi|xi,w)q(xi). Under certain assumptions, empirical risk minimization maps onto a corresponding Bayesian problem, in which the empirical risk is the negative log likelihood (up to a constant).[3]
The Bayesian learning algorithm is the posterior,
p(w|Dn)=p(Dn|w)φ(w)p(Dn).But we're not just interested in updating our beliefs over weights — we also want to use our new beliefs to make predictions. Theoretically, the optimal way to do this is to average novel predictions over the ensemble of weights, weighing each machine according to their posterior density. This yields the predictive distribution,
p(y|x,Dn)=Ep(w|Dn)[p(y|x,w)]=∫p(y|x,w)p(w|Dn)dw.In practice, evaluating these kinds of integrals is intractable. A more tractable alternative is Gibbs estimation, where we draw a particular choice of weights w′∼p(w|Dn) and make predictions using the corresponding model p(y|x,w′). This procedure is closer to the kind of paradigm we're in with neural networks: the model we train is a single draw from the distribution of final weights over different initializations and learning schedules.
Predicting according to either the predictive distribution or Gibbs estimation respectively yields two different kinds of generalization error:
- The Bayes generalization error is the KL divergence between the true distribution and the predictive distribution:
GBn:=DKL(q(y|x)∥p(y|x,Dn)),=Eq(y|x)[logq(y|x)Ep(w|Dn)[p(y|x,w)]].- The Gibbs generalization error is the posterior-averaged KL divergence between the true distribution and the model p(y|x,w):
GGn:=Ep(w|Dn)[DKL(q(y|x)∥p(y|x,w))],=Ep(w|Dn)[Eq(y|x)[logq(y|x)p(y|x,w)]].Making predictions for a given model and sampling according to the posterior are non-commutative — it's not possible to move the expectation value in or out of the above logarithm without changing the result. Thus, we get two different approaches to prediction and two different kinds of generalization error.
Bounding generalization
When deriving generalization bounds, we have to decide not only which generalization metric to bound but also how to bound it. There are three main decisions:
Global vs. Local
The first and most important question is whether to bound generalization error over the entire model class or only for a particular model. Classical learning theory takes a global approach. Modern deep learning theory takes a more local approach. Singular learning theory suggests a third mesoscopic approach that obtains generalization bounds for model subclasses.
Global bounds. The global approach is a worst-case approach. The upper bound has to hold for all models, so it's only as tight as your worst-generalizing model. Finding these worst cases typically involves asking how well the model class can fit random noise. A model trained on purely random noise can't generalize at all because there's no signal, so fitting random noise perfectly means worst-case generalization error is terrible. This turns generalization into a question of expressivity, which is operationalized by complexity metrics like the VC dimension, the Rademacher complexity, and covering numbers.
Because neural networks often can fit noise perfectly, they're highly complex according to these measures, and the resulting bounds become vacuous: they trivially predict that test error will be less than 100%.
The reason we observe good generalization in practice is that not all models within a class are equally complex. For example, the success of weight-pruning shows that many weights can safely be set to zero and eliminated. Thus, average generalization can be very different from worst-case generalization.
Local bounds. Instead of studying how well a model class can express noise, you can study how robust a specific model is to noise — noise applied to the inputs (margin theory), noise applied to the weights (minimum sharpness/flatness, compression theory), or noise applied to the training samples (algorithmic stability & robustness). To obtain a generalization bound, it's possible to exchange these kinds of empirical robustness-to-noise for predicted robustness-to-new-data.
Though these results have led to real improvements, the bounds remain far too loose to explain large real-world models. Even for MNIST, the best upper bounds for % test error are an order of magnitude larger than what we actually observe in state-of-the-art models. Will progress eventually catch up, or do these limitations reflect a more fundamental problem with this approach?
Mesoscopic bounds. Instead of studying how much a single model react in response to noise, you can study how the distribution of learned models react in response to noise. With a change of perspective, this turns studying generalization into a question of comparing model subclasses.
This point of view is already implicit in the areas of deep learning theory that obtain generalization bounds involving weight norm, compression, minimum sharpness/flatness, and various information-theoretic quantities. It is made explicit in singular learning theory, which proposes a set of canonical observables for studying model complexity and generalization, such as the learning coefficient, the singular fluctuation, and multiplicity. For idealized Bayesian learners, SLT predicts the values (not just upper bounds) of both Bayes and Gibbs generalization error in the large-data limit.
Dataset-dependence
A second question is whether to bound generalization error for the given dataset or across the distribution of possible datasets. Practically, we only have a single dataset and want to understand how models will generalize for that particular dataset. Theoretically, it's usually easier to marginalize out the dependence on a specific dataset.
Marginalizing the dataset is often justified on the grounds that we're more interested in how the model performs in the limit of large amounts of data. In these limits, the influence of any single data point vanishes. However, this assumption may not hold in situations where the distributions have fatter tails like RL.
Flavors of generalization bounds
The last question is mostly aesthetic: how do we dress up our bounds? Since generalization is made up of both data and weights, this breaks down into two subquestions: how to dress up the bound over data and how to dress up the bound over weights?
Uniform bounds. The strongest kind of bounds are ones that holds uniformly over the model class, that is, bounds involving universal quantifiers:
Gn(fw)<ϵ,∀w∈WThis is defined relative to weights because we're rarely interested in bounds that hold uniformly over datasets. Except for trivial distributions, there are always pathological draws of data that will exhibit poor generalization.
Probabilistic bounds. Weaker (in the sense that they don't always hold) but often tighter (the upper bound is lower) are bounds that hold "in probability":
P(Gn(fw)<ϵ)>1−δ.Though one could formulate a probabilistic bound relative to the distribution over weights, in practice, these bounds are always defined relative to the distribution over data.
Expectation bounds. Weaker yet tighter still are bounds that hold "in expectation." These are bounds that hold after taking an expectation value over weights/datasets:
E[Gn(fw)]<ϵ.Unlike uniform bounds that are usually reserved to weights and probabilistic bounds that are usually reserved to data, expectation bounds treat weights and data on more equal footing.
Combining bounds. Given these choices, it's possible to mix and match. Because these operations are non-commutative, you can swap the order to double the bounds. Vary the choice of underlying probability distribution, and the horrible profusion of different generalization bounds grows further. To simplify matters somewhat, it's often possible to transform bounds in one format to another.
Named families of bounds
Many of the particular combinations of choices for how and what to bound have dedicated names. Unfortunately, these names are neither exhaustive nor exclusive nor consistent.
A few of the more important ones:
In principle, the qualitative shape of the bound is orthogonal to the question of which specific quantities show up in the bounds. In practice, the distinction often gets muddled. For example, "PAC Bayes" refers both to the general idea of a PAC-style bound on Ep(w|Dn)[Gn] and to the specific idea of bounds that depend on a KL divergence between the posterior and prior. So it goes.
Uniform convergence
Uniform convergence is the most classical of the classical approaches to generalization theory. The term comes from analysis, where we say that a sequence of functions fn converges uniformly to a limiting function f if for every ϵ>0 there exists a natural number N such that for all n>N and for all inputs x,
|fn(x)−f(x)|<ϵ.In learning theory, we're interested in the empirical risk converging uniformly to the true risk, which means that the difference between the empirical risk and true risk becomes arbitrarily small for every model provided you have enough data (Hellström et al. 2023, Shalev-Shwartz et al. 2010):
supq∈QEq(Dn)[supw∈WGn(fw)]n→∞⟶0.Confusingly, the expectation value over data can be replaced by a probabilistic bound. The important bit of uniform convergence is that it is a form of "worst-case" analysis: you obtain a bound that holds uniformly both across all possible models in the model class and across all possible choices of data distribution q∈Q.
There are two problems with uniform convergence from the perspective of deep learning:
Probably Approximately Correct (PAC)
The PAC-learning framework (Valiant 1984) addresses the second weakness of uniform convergence bounds: PAC-style bounds are defined relative to some fixed data distribution. The bounds still hold uniformly over the model class but are now defined in probability over data:
Pq(Dn)[supw∈WGn(fw)<ϵ]≥1−δ.Your model class will "probably" (with probability at least 1−δ) be "approximately correct" (with generalization error less than ϵ).
In particular, learning theorists then try to relate ϵ and δ to each other and other hyperparameters like the number of parameters d, the number of samples n, the weight norm |w|, etc. Then, you can invert the relation to figure out how large you should make your model and how many samples you need to obtain a desired error with a desired probability.
For example, if the model class F is finite, then, it satisfies the PAC bound so long as the sample size obeys
n≥12ϵ2(log2|F|+log1δ).The minimum n that satisfies a bound of this kind is called the sample complexity.
Historically, much of the work in PAC-style generalization theory has looked like trying to replace the log2|F| term above with a suitable finite complexity measure when the model class becomes infinite (see comments by Telgarsky 2021).
Mean Approximately Correct (MAC)
PAC-learning is more relaxed than uniform convergence, but it's still a worst-case approach where the bounds have to hold across the entire model class. The relaxation is in the data distribution. Instead, we can consider an average-case bound that holds on expectation over learned models and different draws of the data,
Ep(w,Dn)[Gn(fw)]<ϵ.A practical example is the following bound,
∣∣Ep(w,Dn)[gn(fw)]∣∣≤√I(w∥Dn)2n,in terms of the mutual information between weights and data
I(w∥Dn)=DKL(p(w,Dn)∥p(w)q(Dn)).PAC Bayes
PAC Bayes (McAllester 1999, Shawe-Taylor and Williamson, 1997) keeps the expectation value over weights but swaps the expectation value over data with a probabilistic bound over data:
Pq(Dn)[Ep(w|Dn)[Gn(fw)]<ϵ]<1−δ.It's called "PAC Bayes" because the distribution over data is usually taken to be a Bayesian posterior. Even when the distribution isn't a posterior, the Bayesian posterior is still often the learning algorithm that minimizes the resulting bound.
In the real world, the bounds end up looking something like (Dziugaite et al., 2021):
Ep(w|Dn)[R(fw)−1βRn(fw)]≤12nβ(1−β)(DKL(p(w|Dn)∥φ(w))+log1δ),where φ is the prior over w and β controls the tradeoff between accuracy and complexity.
Data-dependent priors. When dealing with PAC-Bayes bounds involving a KL divergence between the prior and posterior, a common dirty trick of the trade is to adopt "data-dependent priors." If φ(w) is allowed to depend on data, then you can obtain smaller KL divergences on the right-hand side, thus tightening the bound.
One such approach (Ambroladze et al., 2006; Dziugaite et al., 2021) involves splitting the dataset Dn into two subsets, Bm and Pn−m. The learning procedure remains the same as before (i.e., you condition the posterior on the entire dataset), but now you evaluate the training loss inside the bound solely on Pn−m. You can then let the prior depend on Bm without violating the manipulations used to obtain this result.
Yes, this is a bit nasty, but we'll see later that this is getting at a deeper and (possibly) theoretically justified point: the idea that we should view generalization in terms of how a probability distribution changes in response to additional data.
Measuring "complexity"
Look at the preceding examples of generalization bounds, and you'll notice the upper bounds all involve some kind of complexity measure: e.g., the number of models for a finite model class |F|, the mutual information between weights and data I(w∥Dn), and the Kullback-Leibler divergence between the prior and posterior, DKL(p∥φ).
This is true for more or less all generalization bounds. By moving terms around (this requires some care when there are expectations or probabilities involved), we can re-express generalization bounds as,
L(w)≤B(Ln(w),Λn(w)),where Λn(w) is some notion of complexity, and B is a monotonically increasing function in both arguments. These expressions formalize Occam's razor: achieving high performance on novel data (low L(w)) requires trading off high accuracy on the training data (low Ln(w)) against simplicity (low Λn(w)).
In this section, we'll skip over the details of how various bounds are derived[4] and instead directly examine the notions of complexity that show up in the final relations.
Flavors of complexity. At a high-level, there are three main flavors of complexity measures. These map onto the distinction between global, local, and mesoscopic generalization bounds discussed earlier. These are, respectively:
Model-class complexity
Most notions of model-class (as well as single-model) complexity measure how well a model class can express noise. If your model class can fit more noise, then it's more complex.
Vapnik-Chervonenkis (VC) dimension
We'll start with the best-known model-class complexity measure: the VC dimension. This requires us to go back to binary classification.
We start by defining the growth function, which is the maximum number of different ways a model class can classify an arbitrary set of inputs (x1,…,xm)∈Xm:
gF(m):=max(x1,…,xm)∈Xm|{(fw(x1),…,fw(xm)):w∈W}|≤2m.If the growth function saturates its upper bound, then we say that the model class "shatters" that set of samples. This means the model class always contains a model that can perfectly classify the given inputs under any possible relabeling.
So a straight line can shatter a set of three non-collinear points on a 2D, but not four. A parabola can shatter four points but not five. And so on.
Then, the VC dimension is the largest set of points that can be shattered by a given model class.
dVC:=max{m∈N:gF(m)=2m}In terms of the VC dimension dVC we end up satisfying the PAC bound if our sample size is
n≥cϵ(dVClog1ϵ+log1δ),where c is a constant. We replaced the term depending on the size of F in the original PAC bound with a term that depends on the VC dimension of a potentially infinite model class.
The VC dimension is straightforward to generalize to more realistic settings (e.g., the Natarajan dimension for multi-class classification and Pollard's pseudo-dimension for regression). It's also possible to derive tighter bounds by including information about the shape of the data distribution as in the fat-shattering dimension.
The problem with applying these variants to neural networks is that we are typically working in or near the over-parametrized regime, where the number of parameters is larger than the number of samples. In this regime, as Zhang et al. showed, we can typically shatter the entire training set, so dVC is on the order of n and the bound is not satisfied for reasonable tolerances.[5]
Rademacher complexity
The basic idea behind the Rademacher complexity, like the VC dimension, is to measure how well a function class can fit random noise. First, let us define the empirical Rademacher complexity of a model class W with respect to a dataset Dn:
RadDn(W):=Ep(σ1…σn)[supw∈W1nn∑i=1σifw(xi)].The idea is to choose a set of random Rademacher variables, {σi}mi=1 that take on the values −1 and 1 with equal probability. We apply these random sign flips to each of our individual predictions and take the average. Then, we maximize the average flipped prediction over models. Finally, we average over different sign flips.
Intuitively, imagine separating your dataset into a train and test set. The Rademacher complexity measures the maximum discrepancy over the model class between your predictions for the two sets.
From this, the Rademacher complexity of W with respect to a probability distribution, q(Dn) is the expectation of the empirical Rademacher complexity taken over draws of the dataset:
Radn(W):=Eq(Dn)[RadDn(W)].In terms of the Rademacher complexity, we obtain the PAC bound when
ϵ≤2Radn(W)+3c√log(2/δ)2n,where c is a constant that depends on the range of functions in F.
Since the Rademacher complexity is uniform across W, it runs into the same issues as the VC dimension. Another problem is that it is not necessarily easy to calculate. In practice, learning theorists often resort to further upper-bounding the Rademacher complexity in terms of covering numbers.
Covering numbers
The covering number is the number of spherical balls of a given size needed to cover a given space (with overlaps allowed). The idea is that we're "coarse-graining" the model class with a finite collection of representative elements.
In terms of Nr(W), the covering number of W for balls of radius r=ϵ/2c, we satisfy the PAC bound if
n≥2cϵ2(log2Nr(W)+log1δ),where c is some constant.
By default, Nr(W) will scale exponentially with model dimension d, so the resulting bounds don't become much tighter than VC. However, conceptually, this is getting closer at the idea of model-subclass complexity. The differences are that covering numbers still treat each subclass as identical, and we're not yet coarse-graining in any principled way.
Single-model complexity
Instead of constraining entire model classes, it's much more tractable to study the complexity of individual models. Most notions of single-model complexity are based on the idea of complexity as sensitivity to noise: noise applied to the inputs (margin theory), noise applied to the weights (minimum sharpness/flatness), and noise applied to the training samples (algorithmic stability & robustness).
Margins
Margin theory studies the eponymous margin, which is the minimum distance from a sample to the decision boundary of a classifier. The larger the margin, the more robust the classifier is to small perturbations in the input data, thus the better we expect it to generalize to novel data.
This dates back to the good old days of support vector machines (SVMs) for which you can explicitly calculate margins and use them to design more robust classifiers (see Vapnik (1995)) . With DNNs, calculating margins is intractable.
The more fundamental problem is that the assumption that "large margins = better generalization" doesn't hold for realistic models. Which of the following two decision boundaries would you say generalizes better?
Margin theory certainly seems to have some explanatory power, especially in settings with simpler and more structured data (e.g., Mohamadi et al. 2023). The basic idea of studying sensitivity to input noise remains a good one. It's just that the specific focus on margins appears to be overemphasized.
Minimum flatness/sharpness
Where margin theory hinges on the intuition that robustness to input perturbations should lead to generalization, "minimum flatness" hinges on the intuition that robustness to weight perturbations should lead to generalization.
If the curvature (as measured by the Hessian) of the loss landscape is low, then a small change to the weights will mean a small change in the loss. So we might expect the loss landscape shouldn't change much if we perturb it by introducing novel samples (=ask it to generalize). Conversely, if the curvature is high, the model is more complex and likely to generalize poorly.
As is true for
allmany ideas in deep learning, this dates back to a paper by Hochreiter and Schmidhuber written in the 90's. They offer intuition built on the minimum description length (MDL) principle from algorithmic complexity theory. In the language of MDL, fewer bits are needed to specify the location of "flat" minima, and simpler functions should generalize better.Keskar et al. (2017) demonstrate that minibatch SGD converges to flatter minima and that this correlates strongly with the generalization error. However, Neyshabur et al. (2017a) and Dinh et al. (2017) point out that this naive approach is flawed: sharpness can be arbitrarily scaled by reparametrizations, and models trained on random labels routinely have flatter minima than models trained on true labels. To remedy this, Kwon et al. (2021) propose adaptive sharpness, which is invariant to reparametrizations and seems to correlate well with generalization error.
As it turns out, adaptive sharpness isn't enough to save sharpness theory. Andriushchenko et al. (2023) find that counter to classical intuitions, sharpness — and even the smarter reparametrization-invariant notions — correlate poorly or even negatively with generalization in transformers on larger datasets. In their words, "sharpness is not a good indicator of generalization in the modern setting."
There's a more fundamental problem: neural networks are singular, so modeling the loss landscapes locally with paraboloids is invalid. Studying sensitivity to changes in weights requires more advanced tooling from algebraic geometry.
Algorithmic stability and robustness
Yet another notion of sensitivity to noise is sensitivity of the final model to changes in the training set. This is the idea behind Bousquet and Elisseeff's (2000) algorithmic stability: if changing a training sample doesn't change a model's performance, the empirical risk should have low variance.
A learning algorithm is said to have uniform stability β if
maxi{∣∣∣ℓ(fw(Dn)(xi),yi)−ℓ(fw(D∖in)(xi),yi)∣∣∣}≤β.It's an upper bound to how much the loss changes on any one sample (xi,yi) in the training set upon omitting that sample during training.
This leads to the following PAC-style bound for the specific weights obtained by the learning algorithm (Bousquet and Elisseef 2002):
Gn(fw)≤2β+(4nβ+1)√log1δ2n.This avenue of research has strong parallels to work on influence functions (e.g., Grosse et al. 2023) and to work on differential privacy starting with Dwork et al. (2006), which is about making sure that you can't tell whether a sample belongs to the training set. It's also been extended to algorithmic robustness by Xu and Mannor (2012), which looks at sensitivity to changing the entire dataset rather than individual examples.
This is getting closer to the perspective we already glimpsed with the PAC-Bayes data-dependent priors, where model complexity was related to how the model changes in response to additional data. It's a perspective we'll see again in SLT.
Model-subclass complexity
So far, we've seen generalization as a question about expressivity of a model class (how well it can fit random noise), and we've seen generalization as a question about robustness of specific models (how well they can "resist" small amounts of noise). Another possibility is to view generalization as a question about the robustness of the distribution of learned models p(w|Dn).
In this section, we'll explore a variety of complexity measures that involve examining how a distribution of learning machines changes, particularly in response to new data. We'll see in the next section that this turns studying generalization into a problem of comparing subclasses of weights.
Information-theoretic metrics
One of the more principled approaches to studying generalization in modern deep learning theory comes from a strand of information-theoretic complexity measures that developed in parallel to (and in isolation from) modern PAC-Bayes theory. The complexity measures that show up in these bounds are information-theoretic metrics in terms of the various distributions over weights and data. The associated bounds are in expectation over weights — not uniform.
We already saw one example in terms of the mutual information between weights and data, I(w∥Dn), which is the Kullback-Leibler divergence between the joint distribution and the product of the marginal distributions,
I(w∥Dn)=DKL(p(w,Dn)∥p(w)q(Dn)).This may be the most intuitive notion of complexity so far: models are more complex if they contain more information about the dataset.
We saw another example in the KL divergence between the posterior and prior DKL(p∥φ) that showed up in the PAC-Bayes example. Models are "simpler" if their weights are "closer" to their initial values.
Many of the bounds involving these KL divergences can be generalized in terms of other divergences, such as the broader family of f-divergences, Rényi divergences, (conditional) maximal leakages, Wasserstein metrics, etc. There are many more tricks to squeeze out a few extra bits of generalization here and there.
Unfortunately, these bounds run into problems when applying them to real-world systems because these information-theoretic metrics are typically intractable to calculate or even estimate.
Compressibility
Compression-based bounds build on the same idea behind minimum flatness/sharpness of robustness to changes in weights. If the model is highly compressible and you can discard many weights without impacting performance, the model's effective dimensionality is lower than its apparent dimensionality. It's simpler than it looks and should therefore generalize.
Arora et al. (2018) provide the seminal treatment. They convert a given model into a set of simpler, compressed models, which lets them apply the tooling of the typical model-class-complexity bounds to this family. The framework doesn't directly predict the generalization of the original model but of the constructed subclass.
Zhou et al. (2019) combine this idea with the idea of data-dependent priors that shows up in PAC-Bayes theory. The posterior is chosen to be a Gaussian centered at the learned weights and the prior is chosen to be the set of compressed weights (obtained by pruning). Similar to the approach taken by Dziugaite and Roy (2017), they add in a dash of minimum sharpness by overlaying Gaussian noise over the non-zero weights. The resulting bounds are among the first to be non-vacuous for neural networks, but they're still far from tight.
The main limitation of this approach seems to be that these subclasses are currently constructed in a rather ad-hoc way. For example, singular posteriors are not asymptotically Gaussian, so modeling posteriors as Gaussian is unfounded. In addition, neural networks have many more kinds of degeneracy than just weights that can be pruned (see for example Hanin and Rolnick 2019). What this means is that current bounds likely severely underestimate how compressible neural networks really are.
One contender for a more natural notion of compressibility is SLT's learning coefficient. Though the exact link with ideas such as the Minimum Description Length (MDL) principle and compressibility is an open question, it's clear that the learning coefficient captures a much richer set of degeneracies that, for example, the number of weights you can prune.
The free energy / stochastic complexity
Given a likelihood and prior, we can define the Bayesian free energy (or stochastic complexity) Fn as the negative logarithm of the marginal likelihood (or model evidence):
Fn=−logp(Dn),=−log∫Wp(w|Dn)φ(w)dw,=−log∫we−nLn(w)φ(w)dw,where Ln(w)=