Thanks for the very comprehensive review on generalisation!
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, "Statistical Modelling: The Two Cultures", touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
I agree with the post that approximation-generalisation-optimisation is very much entangled in this puzzle. The question should by why a model + training algorithm combo is likely to select generalising solutions among many other solutions (i.e. those with good training fit) and not just whether the best fit model generalise, whether particular optimisation algorithm can find it or how simple the model class is.
SLT gave the answer that models with very singular solutions + bayesian training will generalise very well. This still leaves open the questions of why DL models on "natural" datasets has very singular solutions and how much does the bayesian results tell us about DL models with SGD-type training.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like "generalise and be safe".
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, "Statistical Modelling: The Two Cultures", touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
Yeah it surprises me that Zhang et al. (2018) has had the impact it did when, like you point out, the ideas have been around for so long. Deep learning theorists like Telgarsky point to it as a clear turning point.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like "generalise and be safe".
This I can stand behind.
Repeating a question I asked Jesse earlier, since others might be interested in the answer: how come we tend to hear more about PAC bounds than MAC bounds?
I think this mostly has to do with the fact that learning theory grew up in/next to computer science where the focus is usually worst-case performance (esp. in algorithmic complexity theory). This naturally led to the mindset of uniform bounds. That and there's a bit of historical contingency: people started doing it this way, and early approaches have a habit of sticking.
In 2018, Zhang et al. showed that deep neural networks can achieve perfect training loss on randomly labeled data.
This was a Big Deal.
It meant that existing generalization theory couldn't explain why deep neural networks generalize. That's because classical approaches to proving that a given model class (=neural network architecture) would generalize involved showing that it lacks the expressivity to fit noise. If a model class can fit noise arbitrarily well, the resulting bounds break.
So something needed to change.
Evidently, you can't prove tight generalization bounds for entire model classes, so theorists turned to studying generalization bounds for individual models within a model class. If you can empirically show that a model's performance doesn't change substantially when you perturb it (by adding noise to the inputs, weights, training samples, etc.), then you can theoretically prove that that model will generalize to new data.
As a result, the bounds have gotten tighter, but they're still not exactly flattering.
What's really needed is a secret third thing. It's not about either model classes or individual models but about model subclasses. While the model class as a whole may be too complex to obtain tight generalization bounds, individual subclasses can achieve an optimal trade-off between accuracy and complexity. For singular Bayesian learning machines, this trade-off happens automatically.
This more or less answers why models are able to generalize but not how they do it.
Singular learning theory (SLT) provides one possible path towards understanding the how of generalization. This approach is grounded in the geometry of the loss landscape, which in turn is grounded in the symmetries of the model class and data. If this direction pans out, then learning theory is posed for a revolution analogous to the transition between thermodynamics and statistical physics.
The central aim of classical learning theory is to bound various kinds of error: in particular, the approximation error, generalization error, and optimization error.
In the previous post in this series, we looked at the approximation error, which measures the performance of a model class's hypothetically optimal model. If your model class isn't expressive enough to do a reasonable job of modeling the data, then it doesn't matter that your model generalizes to new data or that your learning process actually reaches that optimum in practice.
In this post, we'll look at bounding the generalization error, which measures how well a model parametrized by transfers from a finite training set to additional samples drawn from the same distribution. We won't cover the related question of out-of-distribution generalization, which asks how a model will perform on data from a different distribution.
Last time, we started to see that approximation and generalization could not be separated in practice. Here, we'll see the same holds for generalization and optimization. SLT and many other strands of generalization theory point to a deep relation, which views generalization in nearly dynamical terms involving changes to probability distributions.
In the next post in this sequence, we'll examine the optimization error, which measures how close learning processes get to global optima. Coming at the connection with generalization from the other direction, we'll view optimization in quasistatic terms as a process of selecting among qualitatively distinct generalization strategies.
This is mainly based on lecture notes by Telgarsky (2021), a recent monograph by Hellström et al. (2023), and Watanabe (2009, 2018).
This is a self-contained introduction to generalization theory, as developed in three different schools of learning theory: classical learning theory, deep learning theory, and singular learning theory. There's a lot to cover:
To recap, classical learning theory is primarily grounded in the paradigm of Empirical Risk Minimization (ERM). ERM is a story about two kinds of risk.
The empirical risk is what you might recognize as the "training loss",
where the average is taken over the samples in your dataset .[1] Learning theorists maintain a sensible distinction between the individual-sample "loss" function and the average-loss "risk" .
The population risk (or "true" risk) is the expectation of the loss over the true distribution from which the dataset is drawn:
In its broadest form, generalization is the question of how a model transfers from a finite dataset to new data. In practice, learning theorists study the generalization error (or generalization gap) ,
which addresses a more narrow question: how far apart are the empirical risk and population risk?
The generalization error, , is a function of both the learned model and dataset , so by taking expectation values over data, weights, or both, we derive a family of related generalization metrics.
The relevant distributions include:
From these distributions, we obtain:[2]
And so on.
In a Bayesian setting, we replace the deterministic prediction with a distribution , which induces a likelihood . Under certain assumptions, empirical risk minimization maps onto a corresponding Bayesian problem, in which the empirical risk is the negative log likelihood (up to a constant).[3]
The Bayesian learning algorithm is the posterior,
But we're not just interested in updating our beliefs over weights — we also want to use our new beliefs to make predictions. Theoretically, the optimal way to do this is to average novel predictions over the ensemble of weights, weighing each machine according to their posterior density. This yields the predictive distribution,
In practice, evaluating these kinds of integrals is intractable. A more tractable alternative is Gibbs estimation, where we draw a particular choice of weights and make predictions using the corresponding model . This procedure is closer to the kind of paradigm we're in with neural networks: the model we train is a single draw from the distribution of final weights over different initializations and learning schedules.
Predicting according to either the predictive distribution or Gibbs estimation respectively yields two different kinds of generalization error:
Making predictions for a given model and sampling according to the posterior are non-commutative — it's not possible to move the expectation value in or out of the above logarithm without changing the result. Thus, we get two different approaches to prediction and two different kinds of generalization error.
When deriving generalization bounds, we have to decide not only which generalization metric to bound but also how to bound it. There are three main decisions:
The first and most important question is whether to bound generalization error over the entire model class or only for a particular model. Classical learning theory takes a global approach. Modern deep learning theory takes a more local approach. Singular learning theory suggests a third mesoscopic approach that obtains generalization bounds for model subclasses.
Global bounds. The global approach is a worst-case approach. The upper bound has to hold for all models, so it's only as tight as your worst-generalizing model. Finding these worst cases typically involves asking how well the model class can fit random noise. A model trained on purely random noise can't generalize at all because there's no signal, so fitting random noise perfectly means worst-case generalization error is terrible. This turns generalization into a question of expressivity, which is operationalized by complexity metrics like the VC dimension, the Rademacher complexity, and covering numbers.
Because neural networks often can fit noise perfectly, they're highly complex according to these measures, and the resulting bounds become vacuous: they trivially predict that test error will be less than 100%.
The reason we observe good generalization in practice is that not all models within a class are equally complex. For example, the success of weight-pruning shows that many weights can safely be set to zero and eliminated. Thus, average generalization can be very different from worst-case generalization.
Local bounds. Instead of studying how well a model class can express noise, you can study how robust a specific model is to noise — noise applied to the inputs (margin theory), noise applied to the weights (minimum sharpness/flatness, compression theory), or noise applied to the training samples (algorithmic stability & robustness). To obtain a generalization bound, it's possible to exchange these kinds of empirical robustness-to-noise for predicted robustness-to-new-data.
Though these results have led to real improvements, the bounds remain far too loose to explain large real-world models. Even for MNIST, the best upper bounds for % test error are an order of magnitude larger than what we actually observe in state-of-the-art models. Will progress eventually catch up, or do these limitations reflect a more fundamental problem with this approach?
Mesoscopic bounds. Instead of studying how much a single model react in response to noise, you can study how the distribution of learned models react in response to noise. With a change of perspective, this turns studying generalization into a question of comparing model subclasses.
This point of view is already implicit in the areas of deep learning theory that obtain generalization bounds involving weight norm, compression, minimum sharpness/flatness, and various information-theoretic quantities. It is made explicit in singular learning theory, which proposes a set of canonical observables for studying model complexity and generalization, such as the learning coefficient, the singular fluctuation, and multiplicity. For idealized Bayesian learners, SLT predicts the values (not just upper bounds) of both Bayes and Gibbs generalization error in the large-data limit.
A second question is whether to bound generalization error for the given dataset or across the distribution of possible datasets. Practically, we only have a single dataset and want to understand how models will generalize for that particular dataset. Theoretically, it's usually easier to marginalize out the dependence on a specific dataset.
Marginalizing the dataset is often justified on the grounds that we're more interested in how the model performs in the limit of large amounts of data. In these limits, the influence of any single data point vanishes. However, this assumption may not hold in situations where the distributions have fatter tails like RL.
The last question is mostly aesthetic: how do we dress up our bounds? Since generalization is made up of both data and weights, this breaks down into two subquestions: how to dress up the bound over data and how to dress up the bound over weights?
Uniform bounds. The strongest kind of bounds are ones that holds uniformly over the model class, that is, bounds involving universal quantifiers:
This is defined relative to weights because we're rarely interested in bounds that hold uniformly over datasets. Except for trivial distributions, there are always pathological draws of data that will exhibit poor generalization.
Probabilistic bounds. Weaker (in the sense that they don't always hold) but often tighter (the upper bound is lower) are bounds that hold "in probability":
Though one could formulate a probabilistic bound relative to the distribution over weights, in practice, these bounds are always defined relative to the distribution over data.
Expectation bounds. Weaker yet tighter still are bounds that hold "in expectation." These are bounds that hold after taking an expectation value over weights/datasets:
Unlike uniform bounds that are usually reserved to weights and probabilistic bounds that are usually reserved to data, expectation bounds treat weights and data on more equal footing.
Combining bounds. Given these choices, it's possible to mix and match. Because these operations are non-commutative, you can swap the order to double the bounds. Vary the choice of underlying probability distribution, and the horrible profusion of different generalization bounds grows further. To simplify matters somewhat, it's often possible to transform bounds in one format to another.
Many of the particular combinations of choices for how and what to bound have dedicated names. Unfortunately, these names are neither exhaustive nor exclusive nor consistent.
A few of the more important ones:
In principle, the qualitative shape of the bound is orthogonal to the question of which specific quantities show up in the bounds. In practice, the distinction often gets muddled. For example, "PAC Bayes" refers both to the general idea of a PAC-style bound on and to the specific idea of bounds that depend on a KL divergence between the posterior and prior. So it goes.
Uniform convergence is the most classical of the classical approaches to generalization theory. The term comes from analysis, where we say that a sequence of functions converges uniformly to a limiting function if for every there exists a natural number such that for all and for all inputs ,
In learning theory, we're interested in the empirical risk converging uniformly to the true risk, which means that the difference between the empirical risk and true risk becomes arbitrarily small for every model provided you have enough data (Hellström et al. 2023, Shalev-Shwartz et al. 2010):
Confusingly, the expectation value over data can be replaced by a probabilistic bound. The important bit of uniform convergence is that it is a form of "worst-case" analysis: you obtain a bound that holds uniformly both across all possible models in the model class and across all possible choices of data distribution .
There are two problems with uniform convergence from the perspective of deep learning:
The PAC-learning framework (Valiant 1984) addresses the second weakness of uniform convergence bounds: PAC-style bounds are defined relative to some fixed data distribution. The bounds still hold uniformly over the model class but are now defined in probability over data:
Your model class will "probably" (with probability at least ) be "approximately correct" (with generalization error less than ).
In particular, learning theorists then try to relate and to each other and other hyperparameters like the number of parameters , the number of samples , the weight norm , etc. Then, you can invert the relation to figure out how large you should make your model and how many samples you need to obtain a desired error with a desired probability.
For example, if the model class is finite, then, it satisfies the PAC bound so long as the sample size obeys
The minimum that satisfies a bound of this kind is called the sample complexity.
Historically, much of the work in PAC-style generalization theory has looked like trying to replace the term above with a suitable finite complexity measure when the model class becomes infinite (see comments by Telgarsky 2021).
PAC-learning is more relaxed than uniform convergence, but it's still a worst-case approach where the bounds have to hold across the entire model class. The relaxation is in the data distribution. Instead, we can consider an average-case bound that holds on expectation over learned models and different draws of the data,
A practical example is the following bound,
in terms of the mutual information between weights and data
PAC Bayes (McAllester 1999, Shawe-Taylor and Williamson, 1997) keeps the expectation value over weights but swaps the expectation value over data with a probabilistic bound over data:
It's called "PAC Bayes" because the distribution over data is usually taken to be a Bayesian posterior. Even when the distribution isn't a posterior, the Bayesian posterior is still often the learning algorithm that minimizes the resulting bound.
In the real world, the bounds end up looking something like (Dziugaite et al., 2021):
where is the prior over and controls the tradeoff between accuracy and complexity.
Data-dependent priors. When dealing with PAC-Bayes bounds involving a KL divergence between the prior and posterior, a common dirty trick of the trade is to adopt "data-dependent priors." If is allowed to depend on data, then you can obtain smaller KL divergences on the right-hand side, thus tightening the bound.
One such approach (Ambroladze et al., 2006; Dziugaite et al., 2021) involves splitting the dataset into two subsets, and . The learning procedure remains the same as before (i.e., you condition the posterior on the entire dataset), but now you evaluate the training loss inside the bound solely on . You can then let the prior depend on without violating the manipulations used to obtain this result.
Yes, this is a bit nasty, but we'll see later that this is getting at a deeper and (possibly) theoretically justified point: the idea that we should view generalization in terms of how a probability distribution changes in response to additional data.
Look at the preceding examples of generalization bounds, and you'll notice the upper bounds all involve some kind of complexity measure: e.g., the number of models for a finite model class , the mutual information between weights and data , and the Kullback-Leibler divergence between the prior and posterior, .
This is true for more or less all generalization bounds. By moving terms around (this requires some care when there are expectations or probabilities involved), we can re-express generalization bounds as,
where is some notion of complexity, and is a monotonically increasing function in both arguments. These expressions formalize Occam's razor: achieving high performance on novel data (low ) requires trading off high accuracy on the training data (low ) against simplicity (low ).
In this section, we'll skip over the details of how various bounds are derived[4] and instead directly examine the notions of complexity that show up in the final relations.
Flavors of complexity. At a high-level, there are three main flavors of complexity measures. These map onto the distinction between global, local, and mesoscopic generalization bounds discussed earlier. These are, respectively:
Most notions of model-class (as well as single-model) complexity measure how well a model class can express noise. If your model class can fit more noise, then it's more complex.
We'll start with the best-known model-class complexity measure: the VC dimension. This requires us to go back to binary classification.
We start by defining the growth function, which is the maximum number of different ways a model class can classify an arbitrary set of inputs :
If the growth function saturates its upper bound, then we say that the model class "shatters" that set of samples. This means the model class always contains a model that can perfectly classify the given inputs under any possible relabeling.
So a straight line can shatter a set of three non-collinear points on a 2D, but not four. A parabola can shatter four points but not five. And so on.
Then, the VC dimension is the largest set of points that can be shattered by a given model class.
In terms of the VC dimension we end up satisfying the PAC bound if our sample size is
where