In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight.[1] The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease (or grow uncontrollably), slowing the training process.[1] In the worst case, this may completely stop the neural network from further training.[1] As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range, and backpropagation computes gradients by the chain rule. This has the effect of multiplying of these small numbers to compute gradients of the early layers in an -layer network, meaning that the gradient (error signal) decreases exponentially with while the early layers train very slowly.
Back-propagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem",[2] [3] which not only affects many-layered feedforward networks,[4] but also recurrent networks.[5] The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. (The combination of unfolding and backpropagation is termed backpropagation through time.)
When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.
This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio.
A generic recurrent network has hidden states
h1,h2,...
u1,u2,...
x1,x2,...
\theta
xt
ht
xt=G(ht)
xt=ht
L(xT,u1,...,uT)
η
The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form
For a concrete example, consider a typical recurrent network defined by
where
\theta=(Wrec,Win)
\sigma
b
Then,
\nablaxF(xt-1,ut,\theta)=Wrecdiag(\sigma'(xt-1))
|\sigma'|\leq1
\|Wrec\|k
Wrec
\gamma<1
k
\gammak\to0
The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation :The components of
\nabla\thetaF(x,u,\theta)
\sigma(x)
u
ut,ut-1,...
\|\nabla\thetaF(xt-k-1,ut-k,\theta)\|
M>0
\nabla\thetaL
M\gammak
\nabla\thetaL
O(\gamma-1)
If
\gamma\geq1
Following (Doya, 1993),[6] consider this one-neuron recurrent network with sigmoid activation:At the small
\epsilon
u=0
w=5.0
b
[-3,-2]
b
(x,b)=\left(x,ln\left(
x | |
1-x |
\right)-5x\right)
Now consider
\Deltax(T) | |
\Deltax(0) |
\Deltax(T) | |
\Deltab |
T
If
(x(0),b)
x(0)
b
x(T)
\Deltax(T) | |
\Deltax(0) |
\Deltax(T) | |
\Deltab |
If
(x(0),b)
x(0)
x(T)
\Deltax(T) | |
\Deltax(0) |
=0
Note that in this case,
\Deltax(T) | |
\Deltab |
≈
\partialx(T) | |
\partialb |
=\left(
1 | |
x(T)(1-x(T)) |
-5\right)-1
For the general case, the intuition still holds (Figures 3, 4, and 5).
Continue using the above one-neuron network, fixing
w=5,x(0)=0.5,u(t)=0
L(x(T))=(0.855-x(T))2
b
-2.5
b
-2.5
Consequently, attempting to train
b
To overcome this problem, several methods were proposed.
Batch normalization is a standard method for solving both the exploding and the vanishing gradient problems.[8] [9]
One is Jürgen Schmidhuber's multi-level hierarchy of networks (1992) pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation.[10] Here each level learns a compressed representation of the observations that is fed to the next level.
Similar ideas have been used in feed-forward neural networks for unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised backpropagation to classify labeled data. The deep belief network model by Hinton et al. (2006) involves learning the distribution of a high-level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[11] Hinton reports that his models are effective feature extractors over high-dimensional, structured data.[12]
See main article: Long short-term memory.
Another technique particularly used for recurrent neural networks is the long short-term memory (LSTM) network of 1997 by Hochreiter & Schmidhuber.[13] In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three ICDAR 2009 competitions in connected handwriting recognition, without any prior knowledge about the three different languages to be learned.[14] [15]
Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by GPUs) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way"[16] since the original models tackling the vanishing gradient problem by Hinton and others were trained in a Xeon processor, not GPUs.
One of the newest and most effective ways to resolve the vanishing gradient problem is with residual neural networks,[17] or ResNets (not to be confused with recurrent neural networks). ResNets refer to neural networks where skip connections or residual connections are part of the network architecture. These skip connections allow gradient information to pass through the layers, by creating "highways" of information, where the output of a previous layer/activation is added to the output of a deeper layer. This allows information from the earlier parts of the network to be passed to the deeper parts of the network, helping maintain signal propagation even in deeper networks. Skip connections are a critical component of what allowed successful training of deeper neural networks.
ResNets yielded lower training error (and test error) than their shallower counterparts simply by reintroducing outputs from shallower layers in the network to compensate for the vanishing data. Note that ResNets are an ensemble of relatively shallow nets and do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network – rather, they avoid the problem simply by constructing ensembles of many short networks together. (Ensemble by Construction[18])
Rectifiers such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction.[19]
Weight initialization is another approach that has been proposed to reduce the vanishing gradient problem in deep networks.
Kumar suggested that the distribution of initial weights should vary according to activation function used and proposed to initialize the weights in networks with the logistic activation function using a Gaussian distribution with a zero mean and a standard deviation of 3.6/sqrt(N)
, where N
is the number of neurons in a layer.[20]
Recently, Yilmaz and Poli[21] performed a theoretical analysis on how gradients are affected by the mean of the initial weights in deep neural networks using the logistic activation function and found that gradients do not vanish if the mean of the initial weights is set according to the formula: max(−1,-8/N)
. This simple strategy allows networks with 10 or 15 hidden layers to be trained very efficiently and effectively using the standard backpropagation.
Behnke relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid[22] to solve problems like image reconstruction and face localization.
Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g., random guess or more systematically genetic algorithm. This approach is not based on gradient and avoids the vanishing gradient problem.[23]