Financial status: This is independent research. I welcome financial support to make further posts like this possible.
Epistemic status: The thread of research I’m reviewing here has been contentious in the past so I expect to make updates based on the comments.
This is my attempt to understand the lottery ticket hypothesis.
I review the original paper, as well as two follow-up papers and one related paper.
I also go over four previous lesswrong posts on this subject.
The lottery ticket hypothesis is a hypothesis about how neural network training works. It was proposed in 2018 by Jonathan Frankle, a PhD student at MIT, and Michael Carbin, a professor at MIT. It suggests that, in a certain sense, much of the action in neural network training is during initialization, not during optimization.
Research that sheds light on neural network training is relevant to alignment because neural network architectures may eventually become large enough to express dangerous patterns of cognition, and it seems unlikely that these patterns of cognition can be detected by input/output evaluations alone, so our only choices seem to be (1) abandon the contemporary machine learning paradigm and seek a new paradigm, or (2) augment the contemporary machine learning paradigm with some non-input/output method sufficient to avoid deploying dangerous patterns of cognition. Insights into contemporary machine learning effectiveness is relevant both to determining whether course (1) or (2) is more promising, and to executing course (2) if that turns out to be the better course.
The lottery ticket hypothesis
The lottery ticket hypothesis (or LTH), as originally articulated by Frankle and Carbin, says:
A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.
A "subnetwork" means that you take the original neural network and clamp some of the weights to zero, so a network with weights has subnetworks. The lottery ticket conjecture, which is an extension of the lottery ticket hypothesis but which Frankle and Carbin are careful to point out is not tested directly in their original paper says:
SGD seeks out and trains a subset of well-initialized weights. Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket.
The lottery ticket conjecture is much more interesting from the perspective of understanding neural network training, and seems to be referred to as the "Lottery Ticket Hypothesis" both in the literature and on this site, so I will follow that convention in the remainder of this post. I will take the original lottery ticket hypothesis as an implication.
In neural network training, in the absence of the lottery ticket hypothesis, we think of the role of SGD as pushing the weights of the overall network towards a local minimum of a loss function, which measures the performance of the network on a training set:
On this view, the point in the loss landscape at which we begin optimizing — that is, the way we initialize the weights at the beginning of training — is not really where the main action is. Initialization might well be important, but the main point of initialization is to start in some not-completely-crazy part of the loss landscape, after which the optimization algorithm does the real work.
The lottery ticket hypothesis says that actually we should view a neural network as an ensemble of a huge number of sparse subnetworks, and that there is some as-yet poorly understood property of the initial weights of each of these subnetworks that determines to how quickly they will learn during training, and how well they will generalize at the end of training. The lottery ticket hypothesis says that among this huge ensemble of subnetworks, some number of subnetworks have this "trainability" property by virtue of having been initialized in accord with this as-yet poorly understood property. What the optimization algorithm is implicitly doing, then, is (1) identifying which subnetworks have this property, (2) training and upweighting them, and (3) downweighting the other networks that do not have this property.
Now this "lottery ticket view" of what is happening during neural network training does not exactly overturn the classical "whole network optimization view". The two views are of course compatible. But the lottery ticket hypothesis does make predictions, such as that it might be possible to give our entire network the as-yet poorly understood initialization property and improve training performance.
Now it’s not that the winning "lottery ticket" is already trained, it just has some property that causes it to be efficiently trainable. There is some follow-up work concerning untrained subnetworks, but I do not believe that it suggests that neural network training consists of simply picking a subnetwork that already solves the task at hand. I discuss that work below under "supermasks" and "weight-agnostic networks".
Also, if some network does contain a subnetwork that would perform well at a task on its own, that does not mean that there is a neuron within the network that expresses the output of this particular subnetwork. The output of any one neuron will be a combination of the outputs of all the subnetworks that do not mask that neuron out, which will generally include exponentially many subnetworks.
So how did Frankle and Carbin actually investigate this? They used the following procedure:
Train a dense neural network on a computer vision task
After training, pick a subnetwork that discards the bottom X% of weights ranked by absolute magnitude, for some values of X from 5% to 95%
Now reset the remaining weights to the value they had when the original dense network was initialized
Train this reduced subnetwork on the same computer vision task
Compare this to training the same reduced subnetwork initialized with freshly randomized weights
It turns out that the subnetworks produced by step 4 train faster and ultimately generalize better than the subnetworks produced by step 5. On this basis the authors conjecture that there was some special property of the initialization of this particular subnetwork, and that due to this property it trained efficiently relative to its peers, and that it was thus implicitly upweighted by the optimization algorithm.
In many of the experiments in the paper, the authors actually iterate steps 2-4 several times, pruning the weights gradually over several re-training phases rather than all at once after training the dense network just once.
When running experiments with larger architectures (VGG-19 and ResNet), the authors find:
We continue to find winning tickets for all of these architectures; however, our method for finding them, iterative pruning, is sensitive to the particular learning rate used.
In some follow-up work, Frankle and Carbin also found it necessary to use "late resetting", which means resetting the weights of the subnetwork not to their original values from initialization but to their values from 1% - 7% of the way through training the dense network.
Deconstructing lottery tickets: signs, zeros, and the supermask
The suggestion that there might be some property of the initialization of a neural network that causes it to learn quickly and generalize well has prompted follow-up work trying to uncover the exact nature of that property. Zhou et al from Uber AI investigated the lottery ticket hypothesis and found the following.
First, among the weights that the sparse subnetwork is reset to in step 3, only the sign matters. If you replace all the positive weights with 1 and all the negative weights with -1 then the subnetwork still trains efficiently and generalizes well, but if you randomize all the weights then this property disappears.
Second, if instead of clamping the pruned weights to zero you clamp them to their initial value from the dense network, then the good performance disappears. They hypothesize that clamping small-magnitude weights to zero is actually acting as a form of training since perhaps those weights were heading towards zero anyway.
Thirdly, and most fascinatingly, that they can find some "supermasks" such that merely applying the mask to an untrained dense network already produces better-than-random results. It is not that the supermask identifies a subnetwork that already solves the task. The untrained subnetwork identified by a supermask actually performs very poorly by the standards of any trained supervised learning system: 20% error on MNIST compared to 12% achieved by linear classification and 0.18% achieved by trained convnets. But 20% error is much better than the 90% error you would expect from chance predictions. The authors suggest that we think of masking as a form of training.
One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers
In this paper, Morcos et al show that a "winning lottery ticket" subnetwork found by training a dense network on one dataset with one optimization algorithm still retains its attractive properties of efficient training and good generalization when the subnetwork is later trained on a different dataset or optimized by a different optimizer.
Weight-agnostic neural networks
This paper from Google Brain finds neural network architectures that perform well no matter which weights are inserted into them. They demonstrate networks solving reinforcement learning problems when all of their weights are set to 1, and then still solving the same problem when all of their weights are set to 2, 10, etc. The paper uses a completely different method to find architectures from the one used by Frankle and Carbin so this paper is a bit outside the lottery ticket literature, but it provides further evidence that weight training may not be the entirety of "where the action is at" in neural network training.
Daniel on the lottery ticket hypothesis and scaling
Daniel Kokotajlo asks whether the lottery ticket hypothesis, if true, would suggest that machine learning will continue to solve more problems as we apply more computing power to it. The reason is that more computing power means we can train networks with more parameters, which means we are searching over more "lottery tickets", which might mean we are able to solve increasingly difficult problems where lottery tickets are harder and harder to come by.
Evan on the lottery ticket hypothesis and deep double descent
Evan Hubinger speculates about whether the lottery ticket hypothesis might explain the deep double descent phenomenon:
My guess as to how double descent works if the Lottery Tickets Hypothesis is true is that in the interpolation regime SGD gets to just focus on the winning tickets and ignore the others—since it doesn't have to use the full model capacity—whereas on the interpolation threshold SGD is forced to make use of the full network (to get the full model capacity), not just the winning tickets, which hurts generalization. [...] That's just speculation on my part, however
John on the lottery ticket hypothesis and parameter tangent spaces
John Swentworth proposes an update to the lottery ticket hypothesis informed by recent results that show that the weights of large neural networks actually don’t change very much over the course of training on practical machine learning problems:
At initialization, we randomly choose θ0, and that determines the parameter tangent space - that’s our set of "lottery tickets". The SGD training process then solves the equations - it picks out the lottery tickets which perfectly match the data. In practice, there will be many such lottery tickets - many solutions to the equations - because modern nets are extremely overparameterized. SGD effectively picks one of them at random
I don’t yet understand this proposal. In what way do we decompose this parameter tangent space into "lottery tickets"? Are the lottery tickets the cross product of subnetworks and points in the parameter tangent space? The subnetworks alone? If the latter then how does this differ from the original lottery ticket hypothesis?
John quotes the following synopsis of the lottery ticket hypothesis:
When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.
The "supermask" results above do suggest that this synopsis is accurate, so far as I can tell, but it’s important to realize that "decent" might mean "better than random but worse than linear regression", and the already-present subnetwork does not just get reinforced during training, it also gets trained to a very significant extent. There is a thread between Daniel Kokotajlo and Daniel Filan about this synopsis that references several papers I haven’t reviewed yet. They seem to agree that this synopsis is at least not implied by the experiments in the original lottery ticket hypothesis paper, which I agree is true.
Abram on the lottery ticket hypothesis and deception
Abram points out that the lottery ticket hypothesis being true could be disheartening news from the perspective of safety:
My Contentious Position for this subsection: Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.
Daniel provides the following helpful summary of Abram’s argument in a comment:
[the parameter tangent space version of the lottery ticket hypothesis] seems to be saying that the training process basically just throws away all the tickets that score less than perfectly, and randomly selects one of the rest. This means that tickets which are deceptive agents and whatnot are in there from the beginning, and if they score well, then they have as much chance of being selected at the end as anything else that scores well. And since we should expect deceptive agents that score well to outnumber aligned agents that score well... we should expect deception.
I previously attempted to summarize this post by Abram.
It’s exciting to see these insights being developed within the mainstream machine learning literature. It’s also exciting to see their safety implications beginning to be fleshed out here. I hope this post helps by summarizing some of the experimental results that have led to these hypotheses.
We would expect 90% error from chance predictions since MNIST is a handwritten digit recognition dataset with 10 possible labels. ↩︎