This post is a follow-up to Understanding “Deep Double Descent”.
I was talking to Rohin at NeurIPS about my post on double descent, and he asked the very reasonable question of why exactly I think double descent is so important. I realized that I hadn't fully explained that in my previous post, so the goal of this post is to further address the question of why you should care about double descent from an AI safety standpoint. This post assumes you've read my Understanding “Deep Double Descent” post, so you should read that first before reading this if you haven't already.
Specifically, I think double descent demonstrates the in my opinion very important yet counterintuitive result that larger models can actually be simpler than smaller models. On its face, this sounds somewhat crazy—how can a model with more parameters be simpler? But in fact I think this is just a very straightforward consequence of double descent: in the double descent paradigm, larger models with zero training error generalize better than smaller models with zero training error because they do better on SGD's inductive biases. And if you buy that SGD's inductive biases are approximately simplicity, that means that larger models with zero training error are simpler than smaller models with zero training error.
Obviously, larger models do have more parameters than smaller ones, so if that's your measure of simplicity, larger models will always be more complicated, but for other measures of simplicity that's not necessarily the case. For example, it could hypothetically be the case that larger models have lower Kolmogorov complexity. Though I don't actually think that's true in the case of K-complexity, I think that's only for the boring reason that model weights have a lot of noise. If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.
Really, what I'm trying to do here is dispel what I see as the myth that as ML models get more powerful simplicity will stop mattering for them. In a Bayesian setting, it is a fact that the impact of your prior on your posterior (for those regions where your prior is non-zero) becomes negligible as you update on more and more data. I have sometimes heard it claimed that as a consequence of this result, as we move to doing machine learning with ever larger datasets and ever bigger models, the impact of our training processes' inductive biases will become negligible. However, I think that's quite wrong, and I think double descent does a good job of showing why, because all of the performance gains you get past the interpolation threshold are coming from your implicit prior. Thus, if you suspect modern ML to mostly be in that regime, what will matter in terms of which techniques beat out other techniques is how good they are at compressing their data into the “actually simplest” model that fits it.
Furthermore, even just from the simple Bayesian perspective, I suspect you can still get double descent. For example, suppose your training process looks like the following: you have some hypothesis class that keeps getting larger as you train and at each time step you select the best a posteriori hypothesis. I think that this setup will naturally yield a double descent for noisy data: first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you're selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood. I think this is a good model for how modern machine learning works and what double descent is doing.
All of this is only for models with zero training error, however—before you reach zero training error larger models can certainly have more essential complexity than smaller ones. That being said, if you don't do very many steps of training then your inductive biases will also matter a lot because you haven't updated that much on your data yet. In the double descent framework, the only region where your inductive biases don't matter very much is right on the interpolation threshold—before the interpolation threshold or past it they should still be quite relevant.
Why does any of this matter from a safety perspective, though? Ever since I read Belkin et al. I've had double descent as part of my talk version of “Risks from Learned Optimization” because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.
Negating the impact of the prior not having support over some hypotheses requires realizability (see Embedded World-Models). ↩︎
Note that double descent happens even without explicit regularization, so the prior we're talking about here is the implicit one imposed by the architecture you've chosen and the fact that you're training it via SGD. ↩︎
Which is exactly what you should expect if you think Occam's razor is the right prior: if two hypotheses have the same likelihood but one generalizes better, according to Occam's razor it must be because it's simpler. ↩︎
One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. "train forever".
If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models.
To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.
Evan's response (copied from a direct message, before I was approved to post here):
The idea I'm objecting to is that there's a sharp change from one regime (larger family of models) to the other (better inductive bias). I'd say that both factors smoothly improve performance over the full range of model sizes. I don't fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.
One man's modus ponens is another man's modus tollens: while you seem to go from "double descent is real" to "larger models are simpler", I go from "larger models are more complex" to "something crazy is happening with double descent".
Possible crux: I think the empirical evidence justifies "double descent is a real effect that occurs in some situations", but not "double descent is clearly real and happens in the vast majority of realistic settings".
Planned summary for the Alignment newsletter:
Note that, in your example, if we do see double descent, it's because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.
As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the "best" hypothesis need not be the truth. It could be that posterior(truth) < posterior(memorization hypothesis) < posterior(almost-right hypothesis that predicts the noise "by luck").
Overall, my guess is that while you could engineer this if you tried, but it wouldn't happen "naturally" in synthetic examples (though it might happen for datasets like MNIST, because maybe there's some property of those datasets that causes double descent to happen).
That first stage is not just a "likelihood descent", it is a "likelihood + prior descent", since you are choosing hypotheses based on the posterior, not based on the likelihood. And with Bayesian methods it's quite possible that you never get to perfect likelihood, because the prior contains enough information to weed out the memorization hypotheses.
Yep, that's exactly my model.
If "best" here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
True for the Bayesian case, though unclear in the ML case—I think it's quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).
Sorry, "best" meant "the one that was chosen", i.e. highest posterior, which need not be the truth. I agree that the truth generalizes at least as well as any other hypothesis.
I agree it's unclear for the ML case, just because double descent happens and I have no idea why and "the prior doesn't start affecting things until after interpolation" does explain that even though it itself needs explaining.
I think we would benefit from tabooing the word "simple". It seems to me that when people use the word "simple" in the context of ML, they are usually referring to either smoothness/Lipschitzness or minimum description length. But it's easy to see that these metrics don't always coincide. A random walk is smooth, but its minimum description length is long. A tall square wave is not smooth, but its description length is short. L2 regularization makes a model smoother without reducing its description length. Quantization reduces a model's description length without making it smoother. I'm actually not aware of any argument that smoothness and description length are or should be related--it seems like this might be an unexamined premise.
Based on your paper, the argument for mesa-optimizers seems to be about description length. But if SGD's inductive biases target smoothness, it's not clear why we should expect SGD to discover mesa-optimizers. Perhaps you think smooth functions tend to be more compressible than functions which aren't smooth. I don't think that's enough. Imagine a Venn diagram where compressible functions are a big circle. Mesa-optimizers are a subset, and the compressible functions discovered by SGD are another subset. The question is whether these two subsets are overlapping. Pointing out that they're both compressible is not a strong argument for overlap: "all cats are mammals, and all dogs are mammals, so therefore if you see a cat, it's also likely to be a dog".
When I read your paper, I get a sense that an optimizers outperform by allowing one to collapse a lot of redundant functionality into a single general method. It seems like maybe it's the act of compression that gets you an agent, not the property of being compressible. If our model is a smooth function which could in principle be compressed using a single general method, I'm not seeing why the reapplication of that general method in a very novel context is something we should expect to happen.
BTW I actually do think minimum description length is something we'll have to contend with long term. It's just too useful as an inductive bias. (Eliminating redundancies in your cognition seems like a basic thing an AGI will need to do to stay competitive.) But I'm unconvinced SGD possesses the minimum description length inductive bias. Especially if e.g. the flat minima story is the one that's true (as opposed to e.g. the lottery ticket story).
Also, I'm less confident that what I wrote above applies to RNNs.
Does anyone know if double decent happens when you look at the posterior predictive rather than just the output of SGD? I wouldn't be too surprised if it does, but before we start talking about the bayesian perspective, I'd like to see evidence that this isn't just an artifact of using optimization instead of integration.
I'm confused by this. As I see it, the consequence of that result is that as you move to larger datasets, holding model size fixed, the impact of inductive biases will decrease. This is consistent with double descent, where as you have larger and larger datasets, you get into the underparameterized regime, which follows the normal story of more data = better.
As you increase the size of your models, holding dataset size fixed, the impact of inductive biases should increase, since you need more information to pick out the right model, but the data provides exactly the same amount of information. And that's consistent with what happens in the overparameterized regime.
On its face, this statement seems implausible to me. Are you saying, for instance, that the model of a dog that a human has in their head should be simpler than the model of a dog that an image classifier has?
I just edited the last sentence to be clearer in terms of what I actually mean by it.
What double descent definitely says is that for a fixed dataset, larger models with zero training error are simpler than smaller models with zero training error. I think it does say somewhat more than that also, which is that larger models do have a real tendency towards being better at finding simpler models in general. That being said, the dataset on which the concept of a dog in your head was trained on is presumably way larger than that of any ML model, so even if your brain is really good at implementing Occam's razor and finding simple models, your model is still probably going to be more complicated.