Updating the Lottery Ticket Hypothesis

by johnswentworth2 min read18th Apr 202124 comments

40

Lottery Ticket HypothesisMachine LearningAI
Frontpage

Epistemic status: not confident enough to bet against someone who’s likely to understand this stuff.

The lottery ticket hypothesis of neural network learning (as aptly described by Daniel Kokotajlo) roughly says:

When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.

This is a very simple, intuitive, and useful picture to have in mind, and the original paper presents interesting evidence for at least some form of the hypothesis. Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big. (See this comment for some clarification of this claim.)

Meanwhile, a cluster of research has shown that large neural networks approximate certain Bayesian models, involving phrases like “neural tangent kernel (NTK)” or “Gaussian process (GP)”. Mingard et al. show that these models explain the large majority of the good performance we see from large neural networks in practice. This view also implies a version of the lottery ticket hypothesis, but it has different implications for what the “lottery tickets” are. They’re not subcircuits of the initial net, but rather subcircuits of the parameter tangent space of the initial net.

This post will sketch out what that means.

Let’s start with the jargon: what’s the “parameter tangent space” of a neural net? Think of the network as a function  with two kinds of inputs: parameters , and data inputs . During training, we try to adjust the parameters so that the function sends each data input  to the corresponding data output  - i.e. find  for which , for all . Each data point gives an equation which  must satisfy, in order for that data input to be exactly mapped to its target output. If our initial parameters  happen to be close enough to a solution to those equations, then we can (approximately) solve this using a linear approximation: we look for  such that 

The right-hand-side of that equation is essentially the parameter tangent space. More precisely, (what I'm calling) the parameter tangent space at  is the the set of functions  of the form

… for some .

In other words: the parameter tangent space is the set of functions which can be written as linear approximations (with respect to the parameters) of the network.

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall  found during training is actually nearly a solution to the linearly-approximated equations.

Major upshot of all this: the space-of-models “searched over” during training is approximately just the parameter tangent space.

At initialization, we randomly choose , and that determines the parameter tangent space - that’s our set of “lottery tickets”. The SGD training process then solves the equations - it picks out the lottery tickets which perfectly match the data. In practice, there will be many such lottery tickets - many solutions to the equations - because modern nets are extremely overparameterized. SGD effectively picks one of them at random (that’s one of the main results of the Mingard et al work).

Summary:

  • The “parameter tangent space” of a network is the set of functions which can be written as linear approximations (with respect to the parameters) of the network.
  • The parameter tangent space at the network’s randomly-chosen initial parameters is roughly the set of “lottery tickets”.
  • SGD (effectively) throws out any lottery tickets which don’t perfectly match the data, then randomly picks one of the remaining tickets.

Of course this brushes some things under the rug - e.g. different "lottery tickets" don't have exactly the same weight, and different architectures may have different type signatures. But if you find the original lottery ticket hypothesis to be a useful mental model, than I expect this to generally be an upgrade to that mental model. It maintains most of the conceptual functionality, but is probably more realistic.

Thankyou to Evan, Ajeya, Rohin, Edouard, and TurnTrout for a discussion which led to this post.

40

24 comments, sorted by Highlighting new comments since Today at 3:02 AM
New Comment

Thanks! I'm afraid I don't understand the math yet but I'll keep trying. In the meantime:

I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big.

Can you say more about why? It's not obvious to me that they are not big enough. Would you agree they probably contain edge detectors, circle detectors, etc. at initialization? Also, it seems that some subnetworks/tickets are already decent at the task at initialization, see e.g. this paper. Is that not "dog-recognizing subcircuits at initialization?" Or something similar?

The problem is what we mean by e.g. "dog recognizing subcircuit". The simplest version would be something like "at initialization, there's already one neuron which lights up in response to dogs" or something like that. (And that's basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That's the version which I'd call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits. I would argue that dog-detectors are a lot more special than random circuits even a priori, but not so much more special that the space-of-functions-that-special is less than exponentially large. (For very small circuits like edge detectors, it's more plausible that some neurons implement that function right from the start.)

The thing in the paper you linked is doing something different from that. At initialization, the neurons in the subcircuits they're finding would not light up in recognition of a dog, because they're still connected to a bunch of other stuff that's not in the subcircuit - the subcircuit only detects dogs once the other stuff is disconnected. And, IIUC, SGD should not reliably "find" those tickets: because no neurons in the subcircuit are significantly correlated with dogs, SGD doesn't have any reason to upweight them for dog-recognition. So what's going on in that paper is different from what's going on in normal SGD-trained nets (or at least not the full story). 

I was in a discussion yesterday that made it seem pretty plausible that you're wrong -- this paper suggests that the over-parameterization needed to ensure that some circuit is (approximately) present at the beginning is not that large.

function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits.

I haven't actually read the paper I'm referencing, but my understanding is that this argument doesn't work out because the number of possible circuits of size N is balanced by the high number of subgraphs in a graph of size M (where M is only logarithmically larger than N).

That being said, I don't know whether "present at the beginning" is the same as "easily found by gradient descent".

See my comment here. The paper you link (like others I've seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.

Ah, I should have read comments more carefully.

I agree with your comment that the claim "I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization" is ambiguous -- "contains" can mean "under pruning".

Obviously, this is an important distinction: if the would-be dog-recognizing circuit behaves extremely differently due to intersection with a lot of other circuits, it could be much harder to find. But why is "a single neuron lighting up" where you draw the line?

It seems clear that at least some relaxation of that requirement is tenable. For example, if no one neuron lights up in the correct pattern, but there's a linear combination of neurons (before the output layer) which does, then it seems we're good to go: GD could find that pretty easily.

I guess this is where the tangent space model comes in; if in practice (for large networks) we stay in the tangent space, then a linear combination of neurons is basically exactly as much as we can relax your requirement.

But without the tangent-space hypothesis, it's unclear where to draw the line, and your claim that an existing neuron already behaving in the desired way is "what would be necessary for the lottery ticket intuition" isn't clear to me. (Is there a more obvious argument for this, according to you?)

Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I'm not particularly worried about where that line "should" be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.

Wait... so:

  1. The tangent-space hypothesis implies something close to "gd finds a solution if and only if there's already a dog detecting neuron" (for large networks, that is) -- specifically it seems to imply something pretty close to "there's already a feature", where "feature" means a linear combination of existing neurons within a single layer
  2. gd in fact trains NNs to recognize dogs
  3. Therefore, we're still in the territory of "there's already a dog detector"

...yeah?

Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.

One important thing to note here is that the LTH paper doesn't demonstrate that SGD "finds" a ticket: just that the subnetwork you get by training and pruning could be trained alone in isolation to higher accuracy. That doesn't mean that the weights in the original training are the same when the network is trained in isolation!

So in particular I basically disagree with the opening summary of the content of the "lottery ticket hypothesis". I think a better summary is found in the abstract of the original paper:

dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations

Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.

Yup, I agree that that quote says something which is probably true, given current evidence.

I don't know what the referent of 'that quote' is. If you mean the passage I quoted from the original lottery ticket hypothesis ("LTH") paper, then I highly recommend reading a follow-up paper which describes how and why it's wrong for large networks. The abstract of the paper I'm citing here:

We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).


I don't think "picking a winning lottery ticket" is a good way analogy for what that implies

Again, assuming "that" refers to the claim in the original LTH paper, I also don't think it's a good analogy. But by default I think that claim is what "the lottery ticket hypothesis" refers to, given that it's a widely cited paper that has spawned a large number of follow-up works.

Oh that's definitely interesting. Thanks for the link.

IIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?

This basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.

Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization.

I think there are papers showing exactly this, like Deconstructing Lottery Tickets and What is the Best Multi-Stage Architecture for Object Recognition?. Another paper, describing the second paper:

We also compare to random, untrained weights because Jarrett et al. (2009) showed — quite strikingly — that the combination of random convolutional filters, rectification, pooling, and local normalization can work almost as well as learned features. They reported this result on relatively small networks of two or three learned layers and on the smaller Caltech-101 dataset (Fei-Fei et al., 2004). It is natural to ask whether or not the nearly optimal performance of random filters they report carries over to a deeper network trained on a larger dataset.

(My interpretation of their results is 'yeah actually randomly initialized convs do pretty well on imagenet'; I remember coming across a paper that answer that question more exactly and getting a clearer 'yes' answer but I can't find it at the moment; I remember them freezing a conv architecture and then only training the fully connected net at the end.)

Why do you doubt this? Are you seeing a bunch of evidence that I'm not? Or are you imagining new architectures that people haven't done these tests for yet / have done these tests and the new architectures fail?

[Maybe your standards are higher than mine--in the DLT paper, they're able to get 65% performance on CIFAR-10 by just optimizing a binary mask on the randomly initialized parameters, which is ok but not good.]

In hindsight, I probably should have explained this more carefully. "Today’s neural networks already contain dog-recognizing subcircuits at initialization" was not an accurate summary of exactly what I think is implausible.

Here's a more careful version of the claim:

  • I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it's not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the "lottery ticket" intuition (i.e. training just picks out some pre-existing useful functionality) to work.
  • The original lottery ticket paper (and subsequent work) found subcircuits which detect dogs (or something comparable) after disconnecting them from the rest of the network (i.e. after pruning). The "lottery ticket" story is not actually a good intuition for what's going on there; the pruning step itself does a whole bunch of optimization work, and "it's just upweighting a pre-existing function" is not an accurate intuition for that optimization.

(The paper on randomized filters is orthogonal to this - it sounds like they're finding that simple features like edge detection do show up in randomly initialized nets, which is totally plausible. But they still need optimization for the deeper/higher-level parts of the net; they're just using random initialization for the first few layers, assuming I'm understanding it correctly.)

If there were a robust dog-detection neuron at initialization, and SGD just learned to use that neuron's output, then the "lottery ticket" intuition would be a very good description of what's going on. But pruning is a very different operation, one which fundamentally changes what a circuit is computing. The first paper you link uses the phrase "masking is training", which seems like a good way to think about it. A masking operation isn't just "picking a winning lottery ticket", it's changing the functional behavior of the nodes.

But this is what would be necessary for the "lottery ticket" intuition (i.e. training just picks out some pre-existing useful functionality) to work.

I don't think I agree, because of the many-to-many relationship between neurons and subcircuits.  Or, like, I think the standard of 'reliability' for this is very low. I don't have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it's real before leaning on it too much?

To be clear, I think I agree with your refinement as a more detailed picture of what's going on; I guess I just think you're overselling how wrong the naive version is?

Plausible.

Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.

That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically "what's your relative contribution to good subcircuits vs. bad subcircuits?"

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall  found during training is actually nearly a solution to the linearly-approximated equations.

Trying to check if I'm understanding correctly: does that mean that despite SGD doing a lot of successive changes that use the gradient at the successive parameter values, these "even out" such that they end up equivalent to a single update from the initial parameters?

Sort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that's what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations.

(Caveat to this: if you directly attempt a Newton step on this sort of system, you'll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)

By Newton step, do you mean one step of Newton's method?