Updating the Lottery Ticket Hypothesis

[-]Vanessa Kosoy5y150

IIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?

[-]johnswentworth5y30

This basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.

[-]Daniel Kokotajlo5y140

Thanks! I'm afraid I don't understand the math yet but I'll keep trying. In the meantime:

I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big.

Can you say more about why? It's not obvious to me that they are not big enough. Would you agree they probably contain edge detectors, circle detectors, etc. at initialization? Also, it seems that some subnetworks/tickets are already decent at the task at initialization, see e.g. this paper. Is that not "dog-recognizing subcircuits at initialization?" Or something similar?

[-]johnswentworth5y40

The problem is what we mean by e.g. "dog recognizing subcircuit". The simplest version would be something like "at initialization, there's already one neuron which lights up in response to dogs" or something like that. (And that's basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That's the version which I'd call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits. I would argue that dog-detectors are a lot more special than random circuits even a priori, but not so much more special that the space-of-functions-that-special is less than exponentially large. (For very small circuits like edge detectors, it's more plausible that some neurons implement that function right from the start.)

The thing in the paper you linked is doing something different from that. At initialization, the neurons in the subcircuits they're finding would not light up in recognition of a dog, because they're still connected to a bunch of other stuff that's not in the subcircuit - the subcircuit only detects dogs once the other stuff is disconnected. And, IIUC, SGD should not reliably "find" those tickets: because no neurons in the subcircuit are significantly correlated with dogs, SGD doesn't have any reason to upweight them for dog-recognition. So what's going on in that paper is different from what's going on in normal SGD-trained nets (or at least not the full story).

[-]abramdemski5y70

I was in a discussion yesterday that made it seem pretty plausible that you're wrong -- this paper suggests that the over-parameterization needed to ensure that some circuit is (approximately) present at the beginning is not that large.

function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits.

I haven't actually read the paper I'm referencing, but my understanding is that this argument doesn't work out because the number of possible circuits of size N is balanced by the high number of subgraphs in a graph of size M (where M is only logarithmically larger than N).

That being said, I don't know whether "present at the beginning" is the same as "easily found by gradient descent".

[-]johnswentworth5y40

See my comment here. The paper you link (like others I've seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.

[-]abramdemski5y60

Ah, I should have read comments more carefully.

I agree with your comment that the claim "I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization" is ambiguous -- "contains" can mean "under pruning".

Obviously, this is an important distinction: if the would-be dog-recognizing circuit behaves extremely differently due to intersection with a lot of other circuits, it could be much harder to find. But why is "a single neuron lighting up" where you draw the line?

It seems clear that at least some relaxation of that requirement is tenable. For example, if no one neuron lights up in the correct pattern, but there's a linear combination of neurons (before the output layer) which does, then it seems we're good to go: GD could find that pretty easily.

I guess this is where the tangent space model comes in; if in practice (for large networks) we stay in the tangent space, then a linear combination of neurons is basically exactly as much as we can relax your requirement.

But without the tangent-space hypothesis, it's unclear where to draw the line, and your claim that an existing neuron already behaving in the desired way is "what would be necessary for the lottery ticket intuition" isn't clear to me. (Is there a more obvious argument for this, according to you?)

[-]johnswentworth5y40

Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I'm not particularly worried about where that line "should" be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.

[-]abramdemski5y60

Wait... so:

The tangent-space hypothesis implies something close to "gd finds a solution if and only if there's already a dog detecting neuron" (for large networks, that is) -- specifically it seems to imply something pretty close to "there's already a feature", where "feature" means a linear combination of existing neurons within a single layer
gd in fact trains NNs to recognize dogs
Therefore, we're still in the territory of "there's already a dog detector"

...yeah?

[-]johnswentworth5y20

Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.

[-]DanielFilan5y40

One important thing to note here is that the LTH paper doesn't demonstrate that SGD "finds" a ticket: just that the subnetwork you get by training and pruning could be trained alone in isolation to higher accuracy. That doesn't mean that the weights in the original training are the same when the network is trained in isolation!

[-]DanielFilan5y40

So in particular I basically disagree with the opening summary of the content of the "lottery ticket hypothesis". I think a better summary is found in the abstract of the original paper:

dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations

[-]johnswentworth5y20

Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.

[-]DanielFilan5y60

Yup, I agree that that quote says something which is probably true, given current evidence.

I don't know what the referent of 'that quote' is. If you mean the passage I quoted from the original lottery ticket hypothesis ("LTH") paper, then I highly recommend reading a follow-up paper which describes how and why it's wrong for large networks. The abstract of the paper I'm citing here:

We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).

I don't think "picking a winning lottery ticket" is a good way analogy for what that implies

Again, assuming "that" refers to the claim in the original LTH paper, I also don't think it's a good analogy. But by default I think that claim is what "the lottery ticket hypothesis" refers to, given that it's a widely cited paper that has spawned a large number of follow-up works.

[-]johnswentworth5y20

Oh that's definitely interesting. Thanks for the link.

[-]Vaniver5y60

Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization.

I think there are papers showing exactly this, like Deconstructing Lottery Tickets and What is the Best Multi-Stage Architecture for Object Recognition?. Another paper, describing the second paper:

We also compare to random, untrained weights because Jarrett et al. (2009) showed — quite strikingly — that the combination of random convolutional filters, rectification, pooling, and local normalization can work almost as well as learned features. They reported this result on relatively small networks of two or three learned layers and on the smaller Caltech-101 dataset (Fei-Fei et al., 2004). It is natural to ask whether or not the nearly optimal performance of random filters they report carries over to a deeper network trained on a larger dataset.

(My interpretation of their results is 'yeah actually randomly initialized convs do pretty well on imagenet'; I remember coming across a paper that answer that question more exactly and getting a clearer 'yes' answer but I can't find it at the moment; I remember them freezing a conv architecture and then only training the fully connected net at the end.)

Why do you doubt this? Are you seeing a bunch of evidence that I'm not? Or are you imagining new architectures that people haven't done these tests for yet / have done these tests and the new architectures fail?

[Maybe your standards are higher than mine--in the DLT paper, they're able to get 65% performance on CIFAR-10 by just optimizing a binary mask on the randomly initialized parameters, which is ok but not good.]

[-]johnswentworth5y110

In hindsight, I probably should have explained this more carefully. "Today’s neural networks already contain dog-recognizing subcircuits at initialization" was not an accurate summary of exactly what I think is implausible.

Here's a more careful version of the claim:

I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it's not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the "lottery ticket" intuition (i.e. training just picks out some pre-existing useful functionality) to work.
The original lottery ticket paper (and subsequent work) found subcircuits which detect dogs (or something comparable) after disconnecting them from the rest of the network (i.e. after pruning). The "lottery ticket" story is not actually a good intuition for what's going on there; the pruning step itself does a whole bunch of optimization work, and "it's just upweighting a pre-existing function" is not an accurate intuition for that optimization.

(The paper on randomized filters is orthogonal to this - it sounds like they're finding that simple features like edge detection do show up in randomly initialized nets, which is totally plausible. But they still need optimization for the deeper/higher-level parts of the net; they're just using random initialization for the first few layers, assuming I'm understanding it correctly.)

If there were a robust dog-detection neuron at initialization, and SGD just learned to use that neuron's output, then the "lottery ticket" intuition would be a very good description of what's going on. But pruning is a very different operation, one which fundamentally changes what a circuit is computing. The first paper you link uses the phrase "masking is training", which seems like a good way to think about it. A masking operation isn't just "picking a winning lottery ticket", it's changing the functional behavior of the nodes.

[-]Vaniver5y30

But this is what would be necessary for the "lottery ticket" intuition (i.e. training just picks out some pre-existing useful functionality) to work.

I don't think I agree, because of the many-to-many relationship between neurons and subcircuits. Or, like, I think the standard of 'reliability' for this is very low. I don't have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it's real before leaning on it too much?

To be clear, I think I agree with your refinement as a more detailed picture of what's going on; I guess I just think you're overselling how wrong the naive version is?

[-]johnswentworth5y20

Plausible.

Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.

[-]Vaniver5y30

That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically "what's your relative contribution to good subcircuits vs. bad subcircuits?"

[-]adamShimi5y50

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall found during training is actually nearly a solution to the linearly-approximated equations.

Trying to check if I'm understanding correctly: does that mean that despite SGD doing a lot of successive changes that use the gradient at the successive parameter values, these "even out" such that they end up equivalent to a single update from the initial parameters?

[-]johnswentworth5y50

Sort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that's what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations.

(Caveat to this: if you directly attempt a Newton step on this sort of system, you'll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)

[-]adamShimi5y40

By Newton step, do you mean one step of Newton's method?

[-]johnswentworth5y20

Yes.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

40

Updating the Lottery Ticket Hypothesis

40