The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:

And then to decompose training loss across specific parameters:

I've added vector arrows to emphasize that is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We'll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:

.

(This is pretty standard, but I've included a derivation at the end.)

Since this is a dot product, it decomposes into a sum over the individual parameters:

So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as

So based on this, I'm going to define my own version of LCA, called . Suppose the gradient computed at training iteration is (which is a vector). uses the approximation , giving . But the SGD update is given by (where is the learning rate), which implies that , which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn't surprising -- we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!

Yet, the experiments in the paper sometimes show positive LCAs. What's up with that? There are a few differences between and the actual method used in the paper:

1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.

2. approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.

3. uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between and to reduce the approximation error.

I think those are the only differences (though it's always hard to tell if there's some unmentioned detail that creates another difference), which means that whenever the paper says "these parameters had positive LCA", that effect can be attributed to some combination of the above 3 factors.

----

Derivation of turning the path integral into a dot product with an average:

where

, where the average is defined as .

rohinmshah's Shortform

by Rohin Shah 18th Jan 20205 comments