Residual neural network explained

A residual neural network (also referred to as a residual network or ResNet)[1] is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge (ILSVRC).[2] [3]

As a point of terminology, "residual connection" or "skip connection" refers to the specific architectural motif of where

f

is an arbitrary neural network module. Residual connections had been used before ResNet, such as in, the LSTM network[4] and the highway network.[5] However, the publication of ResNet made it widely popular, appearing in neural networks that are otherwise unrelated to ResNet.

The residual connection stabilizes the training and convergence of deep neural networks with hundreds of layers, and is a common motif in deep neural networks, such as Transformer models (e.g., BERT and GPT models such as ChatGPT), the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.

Mathematics

Residual connection

In a multi-layer neural network model, consider a subnetwork with a certain number of stacked layers (e.g., 2 or 3). Denote the underlying function performed by this subnetwork as H(x), where x is the input to the subnetwork. Residual learning re-parameterizes this subnetwork and lets the parameter layers represent a "residual function" F(x):=H(x)-x. The output y of this subnetwork is then represented as:

\begin y & = F(x) + x\end

The operation of "+\ x" is implemented via a "skip connection" that performs an identity mapping to connect the input of the subnetwork with its output. This connection is referred to as a "residual connection" in later work. The function F(x) is often represented by matrix multiplication interlaced with activation functions and normalization operations (e.g., batch normalization or layer normalization). As a whole, one of these subnetworks is referred to as a "residual block". A deep residual network is constructed by simply stacking these blocks together.

Importantly, the underlying principle of residual blocks is also the principle of the original LSTM cell, a recurrent neural network that predicts an output at time

t+1

as y_ = F(x_) + x_t , which becomes y = F(x) + x during backpropagation through time.[6]

Projection connection

If the function

F

is of type

F:\Rn\to\Rm

where

nm

, then F(x) + x is undefined. To handle this special case, a projection connection is used:y = F(x) + P(x)where

P

is typically a linear projection, defined by

P(x)=Mx

where

M

is a

m x n

matrix. The matrix is trained by backpropagation as any other parameter of the model.

Signal propagation

The introduction of identity mappings facilitates signal propagation in both forward and backward paths, as described below.[7]

Forward propagation

If the output of the \ell-th residual block is the input to the (\ell+1)-th residual block (assuming no activation function between blocks), then the (\ell+1)-th input is:

\begin x_ & = F(x_) + x_\end

Applying this formulation recursively, e.g., \begin x_= F(x_) + x_= F(x_) + F(x_) + x_\end

yields the general relationship:

\begin x_& = x_ + \sum_^ F(x_) \\\end

where L is the index of a residual block and \ell is the index of some earlier block. This formulation suggests that there is always a signal that is directly sent from a shallower block \ell to a deeper block L.

Backward propagation

The residual learning formulation provides the added benefit of addressing the vanishing gradient problem to some extent. However, it is crucial to acknowledge that the vanishing gradient issue is not the root cause of the degradation problem, which is tackled through the use of normalization layers. To observe the effect of residual blocks on backpropagation, consider the partial derivative of a loss function \mathcal with respect to some residual block input x_. Using the equation above from forward propagation for a later residual block

L>\ell

:

\begin \frac& = \frac\frac \\& = \frac \left(1 + \frac \sum_^ F(x_) \right) \\& = \frac + \frac \frac \sum_^ F(x_) \\\end

This formulation suggests that the gradient computation of a shallower layer, \frac, always has a later term \frac that is directly added. Even if the gradients of the F(x_) terms are small, the total gradient \frac resists vanishing thanks to the added term \frac.

Variants of residual blocks

Basic block

A Basic Block is the simplest building block studied in the original ResNet. This block consists of two sequential 3x3 convolutional layers and a residual connection. The input and output dimensions of both layers are equal.

Bottleneck block

A Bottleneck Block consists of three sequential convolutional layers and a residual connection. The first layer in this block is a 1x1 convolution for dimension reduction, e.g., to 1/4 of the input dimension; the second layer performs a 3x3 convolution; the last layer is another 1x1 convolution for dimension restoration. The models of ResNet-50, ResNet-101, and ResNet-152 in are all based on Bottleneck Blocks.

Pre-activation block

The Pre-activation Residual Block applies the activation functions (e.g., non-linearity and normalization) before applying the residual function F. Formally, the computation of a Pre-activation Residual Block can be written as:

\begin x_ & = F(\phi(x_)) + x_\end

where \phi can be any non-linearity activation (e.g., ReLU) or normalization (e.g., LayerNorm) operation. This design reduces the number of non-identity mappings between Residual Blocks. This design was used to train models with 200 to over 1000 layers.

Since GPT-2, the Transformer Blocks have been dominantly implemented as Pre-activation Blocks. This is often referred to as "pre-normalization" in the literature of Transformer models.[8]

Applications

All Transformer architectures include residual connections. Indeed, very deep Transformer models cannot be successfully trained without Residual Connections.[9]

The original Residual Network paper made no claim on being inspired by biological systems. But later research has related ResNet to biologically-plausible algorithms.[10] [11]

A study published in Science in 2023[12] disclosed the complete connectome of an insect brain (of a fruit fly larva). This study discovered "multilayer shortcuts" that resemble the skip connections in artificial neural networks, including ResNets.

History

Previous work

In 1961, Frank Rosenblatt described a three-layer multilayer perceptron (MLP) model with skip connections.[13] The model was referred to as a "cross-coupled system", and the skip connections were forms of cross-coupled connections.

During late 1980s, "skip-layer" connections were sometimes used in neural networks. Examples include.[14] [15] An 1988 paper[16] trained a fully connected feedforward network where each layer residually connects to all subsequent layers, like the later DenseNet (2016).

Degradation problem

Sepp Hochreiter discovered the vanishing gradient problem in 1991[17] and argued that it explained why the then-prevalent forms of recurrent neural networks did not work for long sequences. He and Schrmidhuber later designed the long short-term memory (LSTM, 1997)[18] to solve this problem, which has a "cell state"

ct

that can function as a generalized residual connection. The highway network (2015)[19] applied the idea of an LSTM unfolded in time to feedforward neural networks, resulting in the highway network.During the early days of deep learning, there were attempts to train increasingly deep models. Notable examples included the AlexNet (2012) had 8 layers, and the VGG19 (2014) had 19 layers. However, stacking too many layers led to a steep reduction in training accuracy,[20] known as the "degradation" problem. In theory, adding additional layers to deepen a network should not result in a higher training loss, but this is exactly what happened with VGGNet. If the extra layers can be set as identity mappings, though, then the deeper network would represent the same function as its shallower counterpart. This is the main idea behind residual learning, explained further below. It is hypothesized that the optimizer is not able to approach identity mappings for the parameterized layers.

In 2014, the state of the art was training “very deep neural network” with 20 to 30 layers. The research team attempted to train deeper ones by empirically testing various tricks for training deeper networks until they discovered the deep residual network architecture.[21]

Subsequent work

DenseNet (2016)[22] connects the output of each layer to the input to each subsequent layer:\begin x_ & = F(x_1, x_2, \dots, x_, x_)\end

Neural networks with Stochastic Depth[23] were made possible given the Residual Network architectures. This training procedure randomly drops a subset of layers and lets the signal propagate through the identity skip connection. Also known as "DropPath", this is an effective regularization method for training large and deep models, such as the Vision Transformer (ViT).

References

  1. He. Kaiming. Zhang. Xiangyu. Ren. Shaoqing. Sun. Jian. 10 Dec 2015. Deep Residual Learning for Image Recognition. 1512.03385.
  2. Web site: ILSVRC2015 Results . image-net.org.
  3. Deng . Jia . Dong . Wei . Socher . Richard . Li . Li-Jia . Li . Kai . Fei-Fei . Li . 2009 . ImageNet: A large-scale hierarchical image database . CVPR.
  4. Sepp Hochreiter . Sepp Hochreiter . Jürgen Schmidhuber . Jürgen Schmidhuber . 1997 . Long short-term memory . . 9 . 8 . 1735–1780 . 10.1162/neco.1997.9.8.1735 . 9377276 . 1915014.
  5. 1505.00387 . cs.LG . Rupesh Kumar . Srivastava . Klaus . Greff . Highway Networks . 3 May 2015 . Schmidhuber . Jürgen.
  6. 1602.07261 . cs.CV . Christian . Szegedy . Sergey . Ioffe . Inception-v4, Inception-ResNet and the impact of residual connections on learning . Vincent . Vanhoucke . Alex . Alemi . 2016.
  7. 1603.05027 . Identity Mappings in Deep Residual Networks . He . Kaiming . Zhang . Xiangyu . Ren . Shaoqing . Sun . Jian . 2015. cs.CV .
  8. Web site: Language models are unsupervised multitask learners . Radford . Alec . Wu . Jeffrey . Child . Rewon . Luan . David . Amodei . Dario . Sutskever . Ilya . 14 February 2019 . 19 December 2020 . 6 February 2021 . https://web.archive.org/web/20210206183945/https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf . live.
  9. 2103.03404 . cs.LG . Yihe . Dong . Jean-Baptiste . Cordonnier . Attention is not all you need: pure attention loses rank doubly exponentially with depth . Andreas . Loukas . 2021.
  10. Liao . Qianli . Poggio . Tomaso . 2016 . Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex . 1604.03640.
  11. Xiao . Will . Chen . Honglin . Liao . Qianli . Poggio . Tomaso . 2018 . Biologically-Plausible Learning Algorithms Can Scale to Large Datasets . 1811.03567.
  12. Winding . Michael . Pedigo . Benjamin . Barnes . Christopher . Patsolic . Heather . Park . Youngser . Kazimiers . Tom . Fushiki . Akira . Andrade . Ingrid . Khandelwal . Avinash . Valdes-Aleman . Javier . Li . Feng . Randel . Nadine . Barsotti . Elizabeth . Correia . Ana . Fetter . Fetter . 10 Mar 2023 . The connectome of an insect brain . Science . 379 . 6636 . eadd9330 . 10.1101/2022.11.28.516756v1 . 10.1126/science.add9330 . 7614541 . 36893230 . 254070919 . Hartenstein . Volker . Priebe . Carey . Vogelstein . Joshua . Cardona . Albert . Zlatic . Marta.
  13. Book: Rosenblatt , Frank . 1961 . Principles of neurodynamics. perceptrons and the theory of brain mechanisms .
  14. Rumelhart . David E. . Hinton . Geoffrey E. . Williams . Ronald J. . October 1986 . Learning representations by back-propagating errors . Nature . en . 323 . 6088 . 533–536 . 10.1038/323533a0 . 1476-4687.
  15. Book: Venables . W. N. . Modern Applied Statistics with S-Plus . Ripley . Brain D. . 1994 . Springer . 9783540943501 . 261-262.
  16. J . Lang K. . 1988 . Learning to tell two spirals apart . Proceedings of the 1988 Connectionist Models Summer school . 52–59.
  17. Untersuchungen zu dynamischen neuronalen Netzen . diploma . Sepp . Hochreiter . Technical University Munich, Institute of Computer Science, advisor: J. Schmidhuber . 1991.
  18. Felix A. Gers . Jürgen Schmidhuber . Fred Cummins . 2000 . Learning to Forget: Continual Prediction with LSTM . . 12 . 10 . 2451–2471 . 10.1.1.55.5709 . 10.1162/089976600300015015 . 11032042 . 11598600.
  19. 1507.06228 . cs.LG . Rupesh Kumar . Srivastava . Klaus . Greff . Training Very Deep Networks . 22 July 2015 . Schmidhuber . Jürgen.
  20. 1502.01852 . cs.CV . Kaiming . He . Xiangyu . Zhang . Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification . Ren . Shaoqing . Sun . Jian . 2016.
  21. Web site: Linn . Allison . 2015-12-10 . Microsoft researchers win ImageNet computer vision challenge . 2024-06-29 . The AI Blog . en-US.
  22. Huang. Gao. Liu. Zhuang. van der Maaten. Laurens. Weinberger. Kilian. 2016. Densely Connected Convolutional Networks. 1608.06993.
  23. Huang. Gao. Sun. Yu. Liu. Zhuang. Weinberger. Kilian. 2016. Deep Networks with Stochastic Depth. 1603.09382.