interstice

NTK/GP Models of Neural Nets Can't Learn Features

There's an important distinction^{[1]} to be made between these two claims:

A) Every function with large volume in parameter-space is simple

B) Every simple function has a large volume in parameter space

For a method of inference to qualify as a 'simplicity prior', you want both claims to hold. This is what lets us derive bounds like 'Solomonoff induction matches the performance of any computable predictor', since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they'll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity , it will have parameter-volume at least .

Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove -- it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial 'clustering', that is, there are *some* simple functions that have large volumes. But this still doesn't show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment, this is not the right notion of 'universality' for classification problems. I strongly suspect^{[2]} that the NTK/GP can be shown to not be 'universally data-efficient' as I outlined there, but as far as I'm aware no one's looked into the issue formally yet. Empirically, I think the results we have so far suggest that the NTK/GP is a decent first-order approximation for simple tasks that tends to perform worse on the more difficult problems that require non-trivial feature learning/efficiency.

I actually posted basically the same thing underneath another one of your comments a few weeks ago, but maybe you didn't see it because it was only posted on LW, not the alignment forum ↩︎

Basically, because in the NTK/GP limit the functions for all the neurons in a given layer are sampled from a single computable distribution, so I think you can show that the embedding is 'effectively finite' in some sense(although note it

*is*a universal approximator for fixed input dimension) ↩︎

NTK/GP Models of Neural Nets Can't Learn Features

Hmm, so regarding the linear combinations, it's true that there are *some* linear combinations that will change by in the large-width limit -- just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don't have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute "we expect most linear combinations to change" though -- the CLT argument implies that we should expect almost all combinations to *not* appreciably change. Not sure what effect this would have on the PCA and still think it's plausible that it doesn't change at all(actually, I think Greg Yang states that it doesn't change in section 9 of his paper, haven't read that part super carefully though)

And the tangent kernel not changing does not imply that transfer learning won’t work

So I think I was a bit careless in saying that the NTK can't do transfer learning at all -- a more exact statement might be "the NTK does the minimal amount of transfer learning possible". What I mean by this is, *any* learning algorithm can do transfer learning if the task we are 'transferring' to is sufficiently similar to the original task -- for instance, if it's just the exact same task but with a different data sample. I claim that the 'transfer learning' the NTK does is of this sort. As you say, since the tangent kernel doesn't change at all, the net effect is to move where the network starts in the tangent space. Disregarding convergence speed, the impact this has on generalization is determined by the values set by the old function on axes of the NTK outside of the span of the partial derivatives at the new function's data points. This means that, for the NTK to transfer anything from one task to another, it's not enough for the tasks to both feature, for instance, eyes. It's that the eyes have to correlate with the output in the *exact same way* in both tasks. Indeed, the transfer learning could actually hurt the generalization. Nor is its effect invariant under simple transformations like flipping the sign of the target function(this would change beneficial transfer to harmful). By default, for functions that aren't simple multiples, I expect the linear correlation between values on different axes to be about 0, even if the functions share many meaningful features. So while the NTK can do 'transfer learning' in a sense, it's about as weak as possible, and I strongly doubt that this sort of transfer is sufficient to explain transfer learning's successes in practice(but don't have empirical proof).

I do think the empirical results pretty strongly suggest that the NTK/GP model captures everything important about neural nets, at least in terms of their performance on the original task.

It's true that NTK/GP perform pretty closely to finite nets on the tasks we've tried them on so far, but those tasks are pretty simple and we already had decent non-NN solutions. Generally the pattern is '"GP matches NNs on really simple tasks, NTK on somewhat harder ones". I think the data we have is consistent with this breaking down as we move to the harder problems that have no good non-NN solutions. I would be very interested in seeing an experiment with NTK on, say, ImageNet for this reason, but as far as I know no one's done so because of the prohibitive computational cost.

I only found one directly-relevant study, which is on way too small and simple a system for me to draw much of a conclusion from it, but it does seem to have worked.

Thanks for the link -- will read this tomorrow.

BTW, thanks for humoring me throughout this thread. This is really useful, and my understanding is updating considerably.

And thank you for engaging in detail -- I have also found this very helpful in forcing me to clarify(partially to myself) what my actual beliefs are.

NTK/GP Models of Neural Nets Can't Learn Features

I don't think taking linear combinations will help, because adding terms to the linear combination will also increase the magnitude of the original activation vector -- e.g. if you add together units, the magnitude of the sum of their original activations will with high probability be , dwarfing the O(1) change due to change in the activations. But regardless, it can't help with transfer learning at all, since the tangent kernel(which determines learning in this regime) doesn't change by definition.

What empirical results do you think are being contradicted? As far as I can tell, the empirical results we have are 'NTK/GP have similar performance to neural nets on some, but not all, tasks'. I don't think transfer/feature learning is addressed at all. You might say these results are suggestive evidence that NTK/GP captures everything important about neural nets, but this is precisely what is being disputed with the transfer learning arguments.

I can imagine doing an experiment where we find the 'empirical tangent kernel' of some finite neural net at initialization, solve the linear system, and then analyze the activations of the resulting network. But it's worth noting that this is not what is usually meant by 'NTK' -- that usually includes taking the infinite-width limit at the same time. And to the extent that we expect the activations to change at all, we no longer have reason to think that this linear system is a good approximation of SGD. That's what the above mathematical results mean -- the same mathematical analysis that implies that network training is like solving a linear system, *also* implies that the activations don't change at all.

NTK/GP Models of Neural Nets Can't Learn Features

The result that NTK does not learn features in the large N limit is not in dispute at all -- it's right there on page 15 of the original NTK paper, and indeed holds after arbitrarily many steps of backprop. I don't think that there's really much room for loopholes in the math here. See Greg Yang's paper for a lengthy proof that this holds for all architectures. Also worth noting that when people 'take the NTK limit' they often don't initialize an actual net at all, they instead use analytical expressions for what the inner product of the gradients would be at N=infinity to compute the kernel directly.

NTK/GP Models of Neural Nets Can't Learn Features

The asymmetry between the output function and the intermediate neuron functions comes from backprop -- from the fact that the gradients are backprop-ed through weight matrices with entries of magnitude O(). So the gradient of the output w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron in the preceding layer is O(), since you're just multiplying by a vector with those entries. Then by induction all other preceding layers' gradients are the sum of N random things of size O(1/N), and so are of size O() again. So taking a step of backprop will change the output function by O(1) but the intermediate functions by O(), vanishing in the large-width limit.

(This is kind of an oversimplification since it is possible to have changing intermediate functions while doing backprop, as mentioned in the linked paper. But this is the essence of why it's possible in *some* limits to move around using backprop without changing the intermediate neurons)

Recent Progress in the Theory of Neural Networks

Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you'd need assumptions about how quickly the distribution changes, but then it's plausible that you could get high probability bounds based on the most recent performance. For small datasets, the PAC-Bayes bounds suffer because they scale as . (I may edit the post to be clearer about this)

Agreed that analyzing how the bounds change under different conditions could be insightful though. Ultimately I suspect that effective bounds will require powerful ways to extract 'the signal from the noise', and examining the signal will likely be useful for understanding if a model has truly learned what it is supposed to.

Understanding “Deep Double Descent”

The neural tangent kernel guys have a paper where they give a heuristic argument explaining the double descent curve(in number of parameters) using the NTK.

Understanding “Deep Double Descent”

Nice survey. The result about double descent even occurring in dataset size is especially surprising.

Regarding the 'sharp minima can generalize' paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don't rule out flatness as an explanation for the success of SGD. The sharp minima they construct with this property are also rather unnatural: essentially they multiply the weights of layer 1 by a constant and divide the weights of layer 2 by the same constant. The piecewise linearity of ReLU means the output function is unchanged. For large , the network is now highly sensitive to perturbations in layer 2. These solutions don't seem like they would be found by SGD, so it could still be that, for solutions found by SGD, flatness and generalization are correlated.

Self-supervised learning & manipulative predictions

No worries, I also missed the earlier posts when I wrote mine. There's lots of stuff on this website.

I endorse your rephrasing of example 1. I think my position is that it's just not that hard to create a "self-consistent probability distribution". For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if "pyrite" itself was easy to predict, then the weights that lead to it outputting "pyrite" will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)

I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)

For example 2, the point about gradient descent is this: while it might be the case that outputting "Help I'm stuck in a GPU Factory000" would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, "Help I'm stuck in a GPU Factory" decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn't change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent

For reasons elaborated upon in this post and its comments, I'm kinda skeptical of these results. Basically the claims made are

(A) That the parameter->function map is "biased towards simple functions". It's important to distinguish simple --> large volume and large volume --> simple. Simple --> large volume is the property that Solomonoff induction has and what makes it universal, but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)

(B) SGD being equivalent to random selection. The evidence is empirical performance of Gaussian processes being similar to neural nets on simple tasks. But this may break down on more difficult problems(link is about the NTK, not GP, but they tend to perform similarly, indeed NTK usually performs better than GP)

Overall I think it's likely we'll need to actually analyze SGD in a non-kernel limit to get a satisfactory understanding of "what's really going on" with neural nets.