Epistemic status: Confused, but trying to explain a concept that I previously thought I understood. I suspect much of what I wrote below is false.

Without taking proper care of a very deep neural network, gradients tend to suddenly become quite large or quite small. If the gradient is too large, then the network parameters will be thrown completely off, possibly causing them to become NaN. If they are too small, then the network will stop training entirely. This problem is called the vanishing and exploding gradients problem.

When I first learned about the vanishing gradients problem, I ended up getting a vague sense of why it occurs. In my head I visualized the sigmoid function.

I then imagined this being applied element-wise to an affine transformation. If we just look at one element, then we can imagine it being the result of a dot product of some parameters, and that number is being plugged in on the x-axis. On the far left and on the far right, the derivative of this function is very small. This means that if we take the partial derivative with respect to some parameter, it will end up being extremely (perhaps vanishingly) small. $^{1}$

Now, I know the way that I was visualizing this was very wrong. There are a few mistakes I made:

1. This picture doesn't tell me anything about why the gradient "vanishes." It's just showing me a picture of where the gradients get small. Gradients also get small when they reach a local minimum. Does this mean that vanishing gradients are sometimes good?

2. I knew that gradient vanishing had something to do with the depth of a network, but I didn't see how the network being deep affected why the gradients got small. I had a rudimentary sense that each layer of sigmoid compounds the problem until there's no gradient left, but this was never presented to me in a precise way, so I just ignored it.

I now think I understand the problem a bit better, but maybe not a whole lot better.

(Note: I have gathered evidence that the vanishing gradient problem is not linked to sigmoids and put it in this comment. I will be glad to see evidence which proves I'm wrong on this one, but I currently believe this is evidence that machine learning professors are teaching it incorrectly).

First, the basics. Without describing the problem in a very general sense, I'll walk through a brief example. In particular, I'll show how we can imagine a forward pass in a simple recurrent neural network that enables a feedback effect to occur. We can then immediately see how gradient vanishing can become a problem within this framework (no sigmoids necessary).

Imagine that there is some sequence of vectors which are defined via the following recursive definition,

$h^{(t)} = W h^{(t - 1)}$

This sequence of vectors can be identified as the sequence of hidden states of the network. Let $W$ admit an orthogonal eigendecomposition. We can then represent this repeated application of the weights matrix as

$h^{(t)} = Q Λ^{t} Q^{⊺} h^{(0)}$

where $Λ$ is a diagonal matrix containing the eigenvalues of $W$ , and $Q$ is an orthogonal matrix. If we consider the eigenvalues, which are the diagonal entries of $Λ$ , we can tell that the ones that are less than one will decay exponentially towards zero, and the values that are greater than one will blow up exponentially towards infinity as $t$ grows in size.

Since $Q$ is orthogonal, the transformation $Q^{⊺} h^{(0)}$ can be thought of as a rotation transformation of the vector $h^{(0)}$ where each coordinate in the new transformation reflects $h^{(0)}$ being projected onto an eigenvector of $W$ . Therefore, when $t$ is very large, as in the case of an unrolled recurrent network, then this matrix calculation will end up getting dominated by the parts of $h^{(0)}$ that point in the same direction as the exploding eigenvectors.

This is a problem because if an input vector ends up pointing in the direction of one of these eigenvectors, the loss function may be very high. From this, it will turn out that in these regions, stochastic gradient descent may massively overshoot. If SDG overshoots, then we end up reversing all of the descent progress that we previously had towards descending down to a local minimum.

As Goodfellow et al. note $^{2}$ , this error is relatively easy to avoid in the case of non-recurrent neural networks, because in that case the weights aren't shared between layers. However, in the case of vanilla recurrent neutral networks, this problem is almost unavoidable. Bengio et al. showed that in cases where a simple neural network is even a depth of 10, this problem will show up with near certainty.

One way to help the problem is by simply clipping the gradients so that they can't reverse all of the descent progress so far. This helps the symptom of exploding gradients, but doesn't fix the problem entirely, since the issue with blown up or vanishing eigenvalues remains.

Therefore, in order to fix this problem, we need to fundamentally re-design the way that the gradients are backpropagated through time, motivating echo state networks, leaky units, skip connections, and LSTMs. I plan to one day go into all of these, but I first need to build up my skills in matrix calculus, which are currently quite poor.

Therefore, I intend to make the next post (and maybe a few more) about matrix calculus. Then perhaps I can revisit this topic and gain a deeper understanding.

$^{1}$ This may be an idiosyncratic error of mine. See page 105 in these lecture notes to see where I first saw the problem of vanishing gradients described.

$^{2}$ See section 10.7 in the Deep Learning Book for a fuller discussion of vanishing and exploding gradients.