History of artificial neural networks explained

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry.[1] While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron.[1] Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

Later, advances in hardware and the development of the backpropagation algorithm as well as recurrent neural networks and convolutional neural networks, renewed interest in ANNs. The 2010s, saw the development of a deep neural network (a neural network with many layers) called AlexNet.[2] It greatly outperformed other image recognition models, and is thought to have launched the ongoing AI spring, and further increasing interest in ANNs.[3] The transformer architecture was first described in 2017 as a method to teach ANNs grammatical dependencies in language,[4] and is the predominant architecture used by large language models, such as GPT-4. Diffusion models were first described in 2015, and began to be used by image generation models such as DALL-E in the 2020s.

Perceptrons and other early neural networks

See main article: Perceptron. The simplest feedforward network consists of a single weight layer without activation functions. It would be just a linear map, and training it would be linear regression. Linear regression by least squares method was used by Legendre (1805) and Gauss (1795) for the prediction of planetary movement.[5] [6] [7] [8]

Warren McCulloch and Walter Pitts[9] (1943) considered a non-learning computational model for neural networks.[10] This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata.[11]

In the early 1940s, D. O. Hebb[12] created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is unsupervised learning. This evolved into models for long-term potentiation. Researchers started applying these ideas to computational models in 1948 with Turing's B-type machines. Farley and Clark[13] (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956).[14] Rosenblatt[1] (1958) created the perceptron, an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the exclusive-or circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by Nobel laureates Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells.[15]

Some say that research stagnated following Minsky and Papert Perceptrons (1969),[16] .

Frank Rosenblatt (1958)[17] proposed the perceptron, a multilayer perceptron (MLP) with 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and an output layer. He later published a 1962 book also introduced variants and computer experiments, including a version with four-layer perceptrons where the last two layers have learned weights (and thus a proper multilayer perceptron).[18] Some consider that the 1962 book developed and explored all of the basic ingredients of the deep learning systems of today.[19]

Group method of data handling, a method to train arbitrarily deep neural networks was published by Alexey Ivakhnenko and Lapa in 1967, which they regarded as a form of polynomial regression,[20] or a generalization of Rosenblatt's perceptron.[21] A 1971 paper described a deep network with eight layers trained by this method.[22]

The first deep learning multilayer perceptron trained by stochastic gradient descent[23] was published in 1967 by Shun'ichi Amari.[24] In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned internal representations to classify non-linearily separable pattern classes.[25] Subsequent developments in hardware and hyperparameter tunings have made end-to-end stochastic gradient descent the currently dominant training technique.

Backpropagation

See main article: Backpropagation.

Backpropagation is an efficient application of the chain rule derived by Gottfried Wilhelm Leibniz in 1673[26] to networks of differentiable nodes. The terminology "back-propagating errors" was actually introduced in 1962 by Rosenblatt,[27] but he did not know how to implement this, although Henry J. Kelley had a continuous precursor of backpropagation in 1960 in the context of control theory.[28] The modern form of backpropagation was developed multiple times in early 1970s. The earliest published instance was Seppo Linnainmaa's master thesis (1970).[29] [30] Paul Werbos developed it independently in 1971,[31] but had difficulty publishing it until 1982.[32] In 1986, David E. Rumelhart et al. popularized backpropagation.[33]

Recurrent network architectures

See main article: Recurrent neural network.

One origin of RNN was statistical mechanics. The Ising model was developed by Wilhelm Lenz and Ernst Ising in the 1920s[34] as a simple statistical mechanical model of magnets at equilibrium. Glauber in 1963 studied the Ising model evolving in time, as a process towards equilibrium (Glauber dynamics), adding in the component of time.[35] Shun'ichi Amari in 1972 proposed to modify the weights of an Ising model by Hebbian learning rule as a model of associative memory, adding in the component of learning.[36] This was popularized as the Hopfield network (1982).[37]

Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, Cajal observed "recurrent semicircles" in the cerebellar cortex.[38] In 1933, Lorente de Nó discovered "recurrent, reciprocal connections" by Golgi's method, and proposed that excitatory loops explain certain aspects of the vestibulo-ocular reflex.[39] [40] Hebb considered "reverberating circuit" as an explanation for short-term memory.[41] The McCulloch and Pitts paper (1943) considered neural networks that contains cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past.[42]

Two early influential works were the Jordan network (1986) and the Elman network (1990), which applied RNN to study cognitive psychology. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.[43]

LSTM

Sepp Hochreiter's diploma thesis (1991)[44] proposed the neural history compressor, and identified and analyzed the vanishing gradient problem.[45] In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.[46] [47] Hochreiter proposed recurrent residual connections to solve the vanishing gradient problem. This led to the long short-term memory (LSTM), published in 1995. LSTM can learn "very deep learning" tasks[48] with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. That LSTM was not yet the modern architecture, which required a "forget gate", introduced in 1999,[49] which became the standard RNN architecture.

Long short-term memory (LSTM) networks were invented by Hochreiter and Schmidhuber in 1995 and set accuracy records in multiple applications domains.[50] It became the default choice for RNN architecture.

Around 2006, LSTM started to revolutionize speech recognition, outperforming traditional models in certain speech applications.[51] [52] LSTM also improved large-vocabulary speech recognition[53] [54] and text-to-speech synthesis[55] and was used in Google voice search, and dictation on Android devices.[56]

LSTM broke records for improved machine translation,[57] language modeling[58] and Multilingual Language Processing.[59] LSTM combined with convolutional neural networks (CNNs) improved automatic image captioning.[60]

Convolutional neural networks (CNNs)

See main article: Convolutional neural network.

The origin of the CNN architecture is the "neocognitron"[61] introduced by Kunihiko Fukushima in 1980.[62] [63] It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field.The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function.[64] [65] The rectifier has become the most popular activation function for CNNs and deep neural networks in general.[66]

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance.[67] It did so by utilizing weight sharing in combination with backpropagation training.[68] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.[67]

In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.[69] [70]

In 1989, Yann LeCun et al. trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.[71] Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types.Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991[72] and breast cancer detection in mammograms in 1994.[73]

In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system.[74] In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.[75] [76] [77] [78] Max-pooling is often used in modern CNNs.[79]

LeNet-5, a 7-level CNN by Yann LeCun et al. in 1998,[80] that classifies digits, was applied by several banks to recognize hand-written numbers on checks digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.

In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.[81] Behnke (2003) relied only on the sign of the gradient (Rprop)[82] on problems such as image reconstruction and face localization. Rprop is a first-order optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992.[83]

Deep learning

The deep learning revolution started around CNN- and GPU-based computer vision.

Although CNNs trained by backpropagation had been around for decades and GPU implementations of NNs for years,[84] including CNNs, faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning.[85]

A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004. In 2009, Raina, Madhavan, and Andrew Ng reported a 100M deep belief network trained on 30 Nvidia GeForce GTX 280 GPUs, an early demonstration of GPU-based deep learning. They reported up to 70 times faster training.[86]

In 2011, a CNN named DanNet[87] [88] by Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3.[89] It then won more contests.[90] [91] They also showed showed how max-pooling CNNs on GPU improved performance significantly.[92]

Many discoveries were empirical and focused on engineering. For example, in 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU worked better than widely used activation functions prior to 2011.

In October 2012, AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton[93] won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. Further incremental improvements included the VGG-16 network by Karen Simonyan and Andrew Zisserman[94] and Google's Inceptionv3.[95]

The success in image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.[96] [97] [98]

In 2014, the state of the art was training “very deep neural network” with 20 to 30 layers. Stacking too many layers led to a steep reduction in training accuracy,[99] known as the "degradation" problem.[100] In 2015, two techniques were developed concurrently to train very deep networks: highway network[101] and residual neural network (ResNet).[102] The ResNet research team attempted to train deeper ones by empirically testing various tricks for training deeper networks until they discovered the deep residual network architecture.[103]

Generative adversarial networks

See main article: Generative adversarial network.

In 1991, Juergen Schmidhuber published "artificial curiosity", neural networks in a zero-sum game.[104] The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. GANs can be regarded as a case where the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set.[105] It was extended to "predictability minimization" to create disentangled representations of input patterns.[106] [107]

Other people had similar ideas but did not develop them similarly. An idea involving adversarial networks was published in a 2010 blog post by Olli Niemitalo.[108] This idea was never implemented and did not involve stochasticity in the generator and thus was not a generative model. It is now known as a conditional GAN or cGAN.[109] An idea similar to GANs was used to model animal behavior by Li, Gauci and Gross in 2013.[110]

Another inspiration for GANs was noise-contrastive estimation,[111] which uses the same loss function as GANs and which Goodfellow studied during his PhD in 2010–2014.

Generative adversarial network (GAN) by (Ian Goodfellow et al., 2014)[112] became state of the art in generative modeling during 2014-2018 period. Excellent image quality is achieved by Nvidia's StyleGAN (2018)[113] based on the Progressive GAN by Tero Karras et al.[114] Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning deepfakes.[115] Diffusion models (2015)[116] eclipsed GANs in generative modeling since then, with systems such as DALL·E 2 (2022) and Stable Diffusion (2022).

Attention mechanism and Transformer

See main article: Attention (machine learning) and Transformer (deep learning architecture). The human selective attention had been studied in neuroscience and cognitive psychology.[117] Selective attention of audition was studied in the cocktail party effect (Colin Cherry, 1953).[118] (Donald Broadbent, 1958) proposed the filter model of attention.[119] Selective attention of vision was studied in the 1960s by George Sperling's partial report paradigm. It was also noticed that saccade control is modulated by cognitive processes, in that the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve all of the visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene.[120]

These researches inspired algorithms, such as a variant of the Neocognitron.[121] [122] Conversely, developments in neural networks had inspired circuit models of biological visual attention.[123]

A key aspect of attention mechanism is the use of multiplicative operations, which had been studied under the names of higher-order neural networks,[124] multiplication units,[125] sigma-pi units,[126] fast weight controllers,[127] and hyper-networks.[128]

Recurrent attention

During the deep learning era, attention mechanism was developed solve similar problems in encoding-decoding.[129]

The idea of encoder-decoder sequence transduction had been developed in the early 2010s. The papers most commonly cited as the originators that produced seq2seq are two papers from 2014.[130] [131] A seq2seq architecture employs two RNN, typically LSTM, an "encoder" and a "decoder", for sequence transduction, such as machine translation. They became state of the art in machine translation, and was instrumental in the development of attention mechanism and Transformer.

An image captioning model was proposed in 2015, citing inspiration from the seq2seq model.[132] that would encode an input image into a fixed-length vector. (Xu et al 2015),[133] citing (Bahdanau et al 2014),[134] applied the attention mechanism as used in the seq2seq model to image captioning.

Transformer

One problem with seq2seq models was their use of recurrent neural networks, which are not parallelizable as both the encoder and the decoder processes the sequence token-by-token. The decomposable attention attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix" ("alignment" is the terminology used by (Bahdanau et al 2014)). This allowed parallel processing.

The idea of using attention mechanism for self-attention, instead of in an encoder-decoder (cross-attention), was also proposed during this period, such as in differentiable neural computers and neural Turing machines.[135] It was termed intra-attention[136] where an LSTM is augmented with a memory network as it encodes an input sequence.

These strands of development were combined in the Transformer architecture, published in Attention Is All You Need (2017). Subsequently, attention mechanisms were extended within the framework of Transformer architecture.

Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, decomposable attention applied attention mechanism to the feedforward network, which are easy to parallelize.[137] One of its authors, Jakob Uszkoreit, suspected that attention without recurrence is sufficient for language translation, thus the title "attention is all you need".[138]

In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the "Attention is all you need" paper. At the time, the focus of the research was on improving seq2seq for machine translation, by removing its recurrence to processes all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance.[139] Its parallelizability was an important factor to its widespread use in large neural networks.[140]

Unsupervised and self-supervised learning

Self-organizing maps

See main article: Self-organizing map.

Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982.[141] [142] SOMs are neurophysiologically inspired[143] artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.

SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.

Boltzmann machines

During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by Terry Sejnowski, Peter Dayan, Geoffrey Hinton, etc., including the Boltzmann machine,[144] restricted Boltzmann machine,[145] Helmholtz machine,[146] and the wake-sleep algorithm.[147] These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986. (p. 112 [148]).

Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine[149] to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[150] [151]

Deep learning

In 2012, Andrew Ng and Jeff Dean created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.[152]

Other aspects

Knowledge distillation

Knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. The idea of using the output of one neural network to train another neural network was studied as the teacher-student network configuration.[153] In 1992, several papers studied the statistical mechanics of teacher-student network configuration, where both networks are committee machines[154] [155] or both are parity machines.[156]

Another early example of network distillation was also published in 1992, in the field of recurrent neural networks (RNNs). The problem was sequence prediction. It was solved by two RNNs. One of them ("atomizer") predicted the sequence, and another ("chunker") predicted the errors of the atomizer. Simultaneously, the atomizer predicted the internal states of the chunker. After the atomizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end.[157]

A related methodology was model compression or pruning, where a trained network is reduced in size. It was inspired by neurobiological studies showing that the human brain is resistant to damage, and was studied in the 1980s, via methods such as Biased Weight Decay[158] and Optimal Brain Damage.[159]

Hardware-based designs

The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.[160]

Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain. Nanodevices[161] for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices).[162]

External links

Notes and References

  1. Rosenblatt. F.. 1958. The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain. Psychological Review. 65. 6. 386–408. 10.1.1.588.3775. 10.1037/h0042519. 13602029. 12781225 .
  2. Krizhevsky. Alex. Sutskever. Ilya. Hinton. Geoffrey E.. 2017-05-24. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 60. 6. 84–90. 10.1145/3065386. 195908774. 0001-0782. free.
  3. Web site: The data that transformed AI research—and possibly the world. Dave. Gershgorn. Quartz. 26 July 2017 .
  4. Vaswani . Ashish . Shazeer . Noam . Parmar . Niki . Uszkoreit . Jakob . Jones . Llion . Gomez . Aidan N . Kaiser . Łukasz . Polosukhin . Illia . 2017 . Attention is All you Need . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 30.
  5. Merriman, Mansfield. A List of Writings Relating to the Method of Least Squares: With Historical and Critical Notes. Vol. 4. Academy, 1877.
  6. Stigler . Stephen M. . 1981 . Gauss and the Invention of Least Squares . Ann. Stat. . 9 . 3 . 465–474 . 10.1214/aos/1176345451 . free.
  7. Book: Bretscher, Otto . Linear Algebra With Applications . Prentice Hall . 1995 . 3rd . Upper Saddle River, NJ.
  8. Book: Stigler, Stephen M. . Stephen Stigler . The History of Statistics: The Measurement of Uncertainty before 1900 . Harvard . 1986 . 0-674-40340-1 . Cambridge . registration.
  9. McCulloch. Warren. Walter Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics. 1943. 5. 4. 115–133. 10.1007/BF02478259. .
  10. News: Representation of Events in Nerve Nets and Finite Automata. Kleene. S.C.. Annals of Mathematics Studies. 17 June 2017. Princeton University Press. 1956. 34. 3–41.
  11. News: Representation of Events in Nerve Nets and Finite Automata. Kleene. S.C.. Annals of Mathematics Studies. 2017-06-17. Princeton University Press. 1956. 34. 3–41.
  12. Book: Hebb, Donald. [{{google books |plainurl=y |id=ddB4AgAAQBAJ}} The Organization of Behavior]. Wiley. 1949. 978-1-135-63190-1. New York.
  13. Farley. B.G.. W.A. Clark. 1954. Simulation of Self-Organizing Systems by Digital Computer. IRE Transactions on Information Theory. 4. 4. 76–84. 10.1109/TIT.1954.1057468.
  14. Rochester. N.. J.H. Holland. L.H. Habit. W.L. Duda. 1956. Tests on a cell assembly theory of the action of the brain, using a large digital computer. IRE Transactions on Information Theory. 2. 3. 80–93. 10.1109/TIT.1956.1056810.
  15. Book: [{{google books |plainurl=y |id=8YrxWojxUA4C|page=106}} Brain and visual perception: the story of a 25-year collaboration]. David H. Hubel and Torsten N. Wiesel. Oxford University Press US. 2005. 978-0-19-517618-6. 106.
  16. Book: [{{google books |plainurl=y |id=Ow1OAQAAIAAJ}} Perceptrons: An Introduction to Computational Geometry]. Minsky. Marvin. Papert. Seymour. MIT Press. 1969. 978-0-262-63022-1.
  17. Rosenblatt . F. . 1958 . The perceptron: A probabilistic model for information storage and organization in the brain. . Psychological Review . en . 65 . 6 . 386–408 . 10.1037/h0042519 . 13602029 . 1939-1471.
  18. Book: Rosenblatt, Frank . Frank Rosenblatt . Principles of Neurodynamics . Spartan, New York . 1962.
  19. Book: Tappert . Charles C. . 2019 International Conference on Computational Science and Computational Intelligence (CSCI) . IEEE . 2019 . 978-1-7281-5584-5 . 343–348 . Who Is the Father of Deep Learning? . 10.1109/CSCI49370.2019.00067 . 31 May 2021 . https://ieeexplore.ieee.org/document/9070967 . 216043128.
  20. Book: Ivakhnenko . A. G. . [{{google books |plainurl=y |id=rGFgAAAAMAAJ}} Cybernetics and Forecasting Techniques ]. Lapa . V. G. . American Elsevier Publishing Co. . 1967 . 978-0-444-00020-0.
  21. Ivakhnenko . A.G. . March 1970 . Heuristic self-organization in problems of engineering cybernetics . Automatica . en . 6 . 2 . 207–219 . 10.1016/0005-1098(70)90092-0.
  22. Ivakhnenko . Alexey . 1971 . Polynomial theory of complex systems . live . IEEE Transactions on Systems, Man, and Cybernetics . SMC-1 . 4 . 364–378 . 10.1109/TSMC.1971.4308320 . https://web.archive.org/web/20170829230621/http://www.gmdh.net/articles/history/polynomial.pdf . 2017-08-29 . 2019-11-05.
  23. Robbins . H. . Herbert Robbins . Monro . S. . 1951 . A Stochastic Approximation Method . The Annals of Mathematical Statistics . 22 . 3 . 400 . 10.1214/aoms/1177729586 . free.
  24. Amari . Shun'ichi . Shun'ichi Amari . 1967 . A theory of adaptive pattern classifier . IEEE Transactions . EC . 16 . 279–307.
  25. 2212.11279 . cs.NE . Jürgen . Schmidhuber . Jürgen Schmidhuber . Annotated History of Modern AI and Deep Learning . 2022.
  26. Book: Leibniz, Gottfried Wilhelm Freiherr von . The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes (Leibniz published the chain rule in a 1676 memoir) . 1920 . Open court publishing Company . 9780598818461 . en.
  27. Book: Rosenblatt, Frank . Frank Rosenblatt . Principles of Neurodynamics . Spartan, New York . 1962.
  28. Kelley . Henry J. . Henry J. Kelley . 1960 . Gradient theory of optimal flight paths . ARS Journal . 30 . 10 . 947–954 . 10.2514/8.5282.
  29. Seppo . Linnainmaa . Seppo Linnainmaa . 1970 . Masters . The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors . fi . University of Helsinki . 6–7.
  30. Linnainmaa . Seppo . Seppo Linnainmaa . 1976 . Taylor expansion of the accumulated rounding error . BIT Numerical Mathematics . 16 . 2 . 146–160 . 10.1007/bf01931367 . 122357351.
  31. Book: Talking Nets: An Oral History of Neural Networks . 2000 . The MIT Press . 978-0-262-26715-1 . Anderson . James A. . en . 10.7551/mitpress/6626.003.0016 . Rosenfeld . Edward.
  32. Book: Werbos, Paul . Paul Werbos . System modeling and optimization . Springer . 1982 . 762–770 . Applications of advances in nonlinear sensitivity analysis . 2 July 2017 . http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf . https://web.archive.org/web/20160414055503/http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf . 14 April 2016 . live.
  33. Rumelhart . David E. . Hinton . Geoffrey E. . Williams . Ronald J. . October 1986 . Learning representations by back-propagating errors . Nature . en . 323 . 6088 . 533–536 . 10.1038/323533a0 . 1986Natur.323..533R . 1476-4687.
  34. Brush . Stephen G. . 1967 . History of the Lenz-Ising Model . Reviews of Modern Physics . 39 . 4 . 883–893 . 1967RvMP...39..883B . 10.1103/RevModPhys.39.883.
  35. Glauber . Roy J. . February 1963 . Roy J. Glauber "Time-Dependent Statistics of the Ising Model" . Journal of Mathematical Physics . 4 . 2 . 294–307 . 10.1063/1.1703954 . 2021-03-21.
  36. Amari . S.-I. . November 1972 . Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements . IEEE Transactions on Computers . C-21 . 11 . 1197–1206 . 10.1109/T-C.1972.223477 . 0018-9340.
  37. Hopfield . J. J. . 1982 . Neural networks and physical systems with emergent collective computational abilities . Proceedings of the National Academy of Sciences . 79 . 8 . 2554–2558 . 1982PNAS...79.2554H . 10.1073/pnas.79.8.2554 . 346238 . 6953413 . free.
  38. Espinosa-Sanchez . Juan Manuel . Gomez-Marin . Alex . de Castro . Fernando . 2023-07-05 . The Importance of Cajal's and Lorente de Nó's Neuroscience to the Birth of Cybernetics . The Neuroscientist . en . 10.1177/10738584231179932 . 1073-8584 . 37403768 . 10261/348372. free .
  39. de NÓ . R. Lorente . 1933-08-01 . Vestibulo-Ocular Reflex Arc . Archives of Neurology and Psychiatry . 30 . 2 . 245 . 10.1001/archneurpsyc.1933.02240140009001 . 0096-6754.
  40. Larriva-Sahd . Jorge A. . 2014-12-03 . Some predictions of Rafael Lorente de Nó 80 years later . Frontiers in Neuroanatomy . 8 . 147 . 10.3389/fnana.2014.00147 . 1662-5129 . 4253658 . 25520630 . free.
  41. Web site: reverberating circuit . 2024-07-27 . Oxford Reference.
  42. McCulloch . Warren S. . Pitts . Walter . December 1943 . A logical calculus of the ideas immanent in nervous activity . The Bulletin of Mathematical Biophysics . 5 . 4 . 115–133 . 10.1007/BF02478259 . 0007-4985.
  43. Book: Schmidhuber, Jürgen . [ftp://ftp.idsia.ch/pub/juergen/habilitation.pdf Habilitation thesis: System modeling and optimization ]. 1993. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.
  44. S. Hochreiter., "Untersuchungen zu dynamischen neuronalen Netzen". . Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber, 1991.
  45. Book: Hochreiter, S. . A Field Guide to Dynamical Recurrent Networks . 15 January 2001 . John Wiley & Sons . 978-0-7803-5369-5 . Kolen . John F. . Gradient flow in recurrent nets: the difficulty of learning long-term dependencies . etal . Kremer . Stefan C. . .
  46. Schmidhuber . Jürgen . 1992 . [ftp://ftp.idsia.ch/pub/juergen/chunker.pdf Learning complex, extended sequences using the principle of history compression (based on TR FKI-148, 1991) ]. Neural Computation . 4 . 2 . 234–242 . 10.1162/neco.1992.4.2.234 . 18271205.
  47. Book: Schmidhuber, Jürgen . [ftp://ftp.idsia.ch/pub/juergen/habilitation.pdf Habilitation thesis: System modeling and optimization ]. 1993. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.
  48. Schmidhuber . J. . 2015 . Deep Learning in Neural Networks: An Overview . Neural Networks . 61 . 85–117 . 1404.7828 . 10.1016/j.neunet.2014.09.003 . 25462637 . 11715509.
  49. Book: Gers . Felix . 9th International Conference on Artificial Neural Networks: ICANN '99 . Schmidhuber . Jürgen . Cummins . Fred . 1999 . 0-85296-721-7 . 1999 . 850–855 . Learning to forget: Continual prediction with LSTM . 10.1049/cp:19991218.
  50. Hochreiter . Sepp . Sepp Hochreiter . Schmidhuber . Jürgen . 1997-11-01 . Long Short-Term Memory . Neural Computation . 9 . 8 . 1735–1780 . 10.1162/neco.1997.9.8.1735 . 9377276 . 1915014.
  51. Graves . Alex . Schmidhuber . Jürgen . 2005-07-01 . Framewise phoneme classification with bidirectional LSTM and other neural network architectures . Neural Networks . IJCNN 2005 . 18 . 5 . 602–610 . 10.1.1.331.5800 . 10.1016/j.neunet.2005.06.042 . 16112549 . 1856462.
  52. Fernández . Santiago . Graves . Alex . Schmidhuber . Jürgen . 2007 . An Application of Recurrent Neural Networks to Discriminative Keyword Spotting . ICANN'07 . Berlin, Heidelberg . Springer-Verlag . 220–229 . 978-3-540-74693-5 . Proceedings of the 17th International Conference on Artificial Neural Networks.
  53. Web site: Sak . Haşim . Senior . Andrew . Beaufays . Françoise . 2014 . Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling . Google Research.
  54. 1410.4281 . cs.CL . Xiangang . Li . Xihong . Wu . Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition . 2014-10-15.
  55. Fan . Bo . Wang . Lijuan . Soong . Frank K. . Xie . Lei . 2015 . Photo-Real Talking Head with Deep Bidirectional LSTM . 4884–8 . 10.1109/ICASSP.2015.7178899 . 978-1-4673-6997-8 . Proceedings of ICASSP 2015 IEEE International Conference on Acoustics, Speech and Signal Processing.
  56. Web site: Sak . Haşim . Senior . Andrew . Rao . Kanishka . Beaufays . Françoise . Schalkwyk . Johan . September 2015 . Google voice search: faster and more accurate .
  57. Sutskever . Ilya . Vinyals . Oriol . Le . Quoc V. . 2014 . Sequence to Sequence Learning with Neural Networks . Electronic Proceedings of the Neural Information Processing Systems Conference . 27 . 5346 . 1409.3215 . 2014arXiv1409.3215S.
  58. 1602.02410 . cs.CL . Rafal . Jozefowicz . Oriol . Vinyals . Exploring the Limits of Language Modeling . 2016-02-07 . Schuster . Mike . Shazeer . Noam . Wu . Yonghui.
  59. 1512.00103 . cs.CL . Dan . Gillick . Cliff . Brunk . Multilingual Language Processing From Bytes . 2015-11-30 . Vinyals . Oriol . Subramanya . Amarnag.
  60. 1411.4555 . cs.CV . Oriol . Vinyals . Alexander . Toshev . Show and Tell: A Neural Image Caption Generator . 2014-11-17 . Bengio . Samy . Erhan . Dumitru.
  61. Fukushima . K. . 2007 . Neocognitron . Scholarpedia . 2 . 1 . 1717 . 10.4249/scholarpedia.1717 . 2007SchpJ...2.1717F . free.
  62. Fukushima . Kunihiko . Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position . Biological Cybernetics . 1980 . 36 . 4 . 193–202 . 16 November 2013 . 10.1007/BF00344251 . 7370364 . 206775608.
  63. Yann . LeCun . Yoshua . Bengio . Geoffrey . Hinton . Deep learning . Nature . 521 . 7553 . 2015 . 436–444 . 10.1038/nature14539 . 26017442 . 2015Natur.521..436L . 3074096.
  64. K. . Fukushima . Visual feature extraction by a multilayered network of analog threshold elements . IEEE Transactions on Systems Science and Cybernetics . 5 . 4 . 1969 . 322–333 . 10.1109/TSSC.1969.300225.
  65. 2212.11279 . cs.NE . Juergen . Schmidhuber . Juergen Schmidhuber . Annotated History of Modern AI and Deep Learning . 2022.
  66. Ramachandran . Prajit . Barret . Zoph . Quoc . V. Le . October 16, 2017 . Searching for Activation Functions . 1710.05941 . cs.NE.
  67. Phoneme Recognition Using Time-Delay Neural Networks . Waibel . Alex . December 1987 . Tokyo, Japan . Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE).
  68. [Alex Waibel|Alexander Waibel]
  69. Zhang . Wei . 1988 . Shift-invariant pattern recognition neural network and its optical architecture . Proceedings of Annual Conference of the Japan Society of Applied Physics.
  70. Zhang . Wei . 1990 . Parallel distributed processing model with local space-invariant interconnections and its optical architecture . Applied Optics . 29 . 32 . 4790–7 . 10.1364/AO.29.004790 . 20577468 . 1990ApOpt..29.4790Z.
  71. LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1, pp. 541–551, 1989.
  72. Zhang . Wei . 1991 . Image processing of human corneal endothelium based on a learning network . Applied Optics . 30 . 29 . 4211–7 . 10.1364/AO.30.004211 . 20706526 . 1991ApOpt..30.4211Z.
  73. Zhang . Wei . 1994 . Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network . Medical Physics . 21 . 4 . 517–24 . 10.1118/1.597177 . 8058017 . 1994MedPh..21..517Z.
  74. A Neural Network for Speaker-Independent Isolated Word Recognition . Yamaguchi . Kouichi . Sakamoto . Kenji . Akabane . Toshio . Fujimoto . Yoshiji . November 1990 . Kobe, Japan . First International Conference on Spoken Language Processing (ICSLP 90) . 2019-09-04 . 2021-03-07 . https://web.archive.org/web/20210307233750/https://www.isca-speech.org/archive/icslp_1990/i90_1077.html . dead .
  75. J. Weng, N. Ahuja and T. S. Huang, "Cresceptron: a self-organizing neural network which grows adaptively," Proc. International Joint Conference on Neural Networks, Baltimore, Maryland, vol I, pp. 576–581, June, 1992.
  76. J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation of 3-D objects from 2-D images," Proc. 4th International Conf. Computer Vision, Berlin, Germany, pp. 121–128, May, 1993.
  77. J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation using the Cresceptron," International Journal of Computer Vision, vol. 25, no. 2, pp. 105–139, Nov. 1997.
  78. Book: J . Weng . N . Ahuja . TS . Huang . 1993 (4th) International Conference on Computer Vision . Learning recognition and segmentation of 3-D objects from 2-D images . 8619176 . 1993 . 121–128 . 10.1109/ICCV.1993.378228 . 0-8186-3870-2.
  79. Schmidhuber . Jürgen . Deep Learning . Scholarpedia . 2015 . 10 . 11 . 1527–54 . 16764513 . 10.1162/neco.2006.18.7.1527 . 10.1.1.76.1541 . 2309950.
  80. LeCun . Yann . Léon Bottou . Yoshua Bengio . Patrick Haffner . Gradient-based learning applied to document recognition . Proceedings of the IEEE . 1998 . 86 . 11 . 2278–2324 . October 7, 2016 . 10.1109/5.726791 . 10.1.1.32.9552. 14542261 .
  81. Dominik Scherer, Andreas C. Müller, and Sven Behnke: "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition," In 20th International Conference Artificial Neural Networks (ICANN), pp. 92–101, 2010. .
  82. Book: Hierarchical Neural Networks for Image Interpretation.. Sven Behnke. Springer. 2003. Lecture Notes in Computer Science. 2766.
  83. Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992
  84. Oh . K.-S. . Jung . K. . 2004 . GPU implementation of neural networks . Pattern Recognition . 37 . 6 . 1311–1314 . 2004PatRe..37.1311O . 10.1016/j.patcog.2004.01.013.
  85. 1703.09039 . cs.CV . Vivienne . Sze . Yu-Hsin . Chen . Vivienne Sze . Efficient Processing of Deep Neural Networks: A Tutorial and Survey . Yang . Tien-Ju . Emer . Joel . 2017.
  86. Book: Raina . Rajat . Madhavan . Anand . Ng . Andrew Y. . Large-scale deep unsupervised learning using graphics processors . 2009-06-14 . Proceedings of the 26th Annual International Conference on Machine Learning . https://doi.org/10.1145/1553374.1553486 . ICML '09 . New York, NY, USA . Association for Computing Machinery . 873–880 . 10.1145/1553374.1553486 . 978-1-60558-516-1.
  87. Cireşan . Dan Claudiu . Meier . Ueli . Gambardella . Luca Maria . Schmidhuber . Jürgen . 21 September 2010 . Deep, Big, Simple Neural Nets for Handwritten Digit Recognition . Neural Computation . 22 . 12 . 3207–3220 . 1003.0358 . 10.1162/neco_a_00052 . 0899-7667 . 20858131 . 1918673.
  88. Ciresan . D. C. . Meier . U. . Masci . J. . Gambardella . L.M. . Schmidhuber . J. . 2011 . Flexible, High Performance Convolutional Neural Networks for Image Classification . live . International Joint Conference on Artificial Intelligence . 10.5591/978-1-57735-516-8/ijcai11-210 . https://web.archive.org/web/20140929094040/http://ijcai.org/papers11/Papers/IJCAI11-210.pdf . 2014-09-29 . 2017-06-13.
  89. Schmidhuber . J. . 2015 . Deep Learning in Neural Networks: An Overview . Neural Networks . 61 . 85–117 . 1404.7828 . 10.1016/j.neunet.2014.09.003 . 25462637 . 11715509.
  90. Book: Ciresan . Dan . Advances in Neural Information Processing Systems 25 . Giusti . Alessandro . Gambardella . Luca M. . Schmidhuber . Jürgen . 2012 . Curran Associates, Inc. . Pereira . F. . 2843–2851 . 2017-06-13 . Burges . C. J. C. . Bottou . L. . Weinberger . K. Q. . https://web.archive.org/web/20170809081713/http://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images.pdf . 2017-08-09 . live.
  91. Book: Ciresan . D. . Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013 . Giusti . A. . Gambardella . L.M. . Schmidhuber . J. . 2013 . 978-3-642-38708-1 . Lecture Notes in Computer Science . 7908 . 411–418 . Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks . 10.1007/978-3-642-40763-5_51 . 24579167 . Pt 2.
  92. Book: Ciresan . D. . 2012 IEEE Conference on Computer Vision and Pattern Recognition . Meier . U. . Schmidhuber . J. . 2012 . 978-1-4673-1228-8 . 3642–3649 . Multi-column deep neural networks for image classification . 10.1109/cvpr.2012.6248110 . 1202.2745 . 2161592.
  93. Krizhevsky . Alex . Sutskever . Ilya . Hinton . Geoffrey . 2012 . ImageNet Classification with Deep Convolutional Neural Networks . live . NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada . https://web.archive.org/web/20170110123024/http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf . 2017-01-10 . 2017-05-24.
  94. 1409.1556 . cs.CV . Karen . Simonyan . Zisserman . Andrew . Very Deep Convolution Networks for Large Scale Image Recognition . 2014.
  95. Szegedy . Christian . 2015 . Going deeper with convolutions . Cvpr2015. 1409.4842 .
  96. 1411.4555 . cs.CV . Oriol . Vinyals . Alexander . Toshev . Show and Tell: A Neural Image Caption Generator . Bengio . Samy . Erhan . Dumitru . 2014. .
  97. 1411.4952 . cs.CV . Hao . Fang . Saurabh . Gupta . From Captions to Visual Concepts and Back . Iandola . Forrest . Srivastava . Rupesh . Deng . Li . Dollár . Piotr . Gao . Jianfeng . He . Xiaodong . Mitchell . Margaret . Platt . John C . Lawrence Zitnick . C . Zweig . Geoffrey . 2014. .
  98. 1411.2539 . cs.LG . Ryan . Kiros . Ruslan . Salakhutdinov . Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models . Zemel . Richard S . 2014. .
  99. 1502.01852 . cs.CV . Kaiming . He . Xiangyu . Zhang . Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification . Ren . Shaoqing . Sun . Jian . 2016.
  100. He . Kaiming . Zhang . Xiangyu . Ren . Shaoqing . Sun . Jian . 10 Dec 2015 . Deep Residual Learning for Image Recognition . 1512.03385.
  101. 1505.00387 . cs.LG . Rupesh Kumar . Srivastava . Klaus . Greff . Highway Networks . 2 May 2015 . Schmidhuber . Jürgen.
  102. He . Kaiming . Zhang . Xiangyu . Ren . Shaoqing . Sun . Jian . 2016 . Deep Residual Learning for Image Recognition . Las Vegas, NV, USA . IEEE . 770–778 . 1512.03385 . 10.1109/CVPR.2016.90 . 978-1-4673-8851-1 . 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  103. Web site: Linn . Allison . 2015-12-10 . Microsoft researchers win ImageNet computer vision challenge . 2024-06-29 . The AI Blog . en-US.
  104. Schmidhuber . Jürgen . Juergen Schmidhuber . 1991 . A possibility for implementing curiosity and boredom in model-building neural controllers . MIT Press/Bradford Books . 222–227 . Proc. SAB'1991.
  105. Schmidhuber . Jürgen . Juergen Schmidhuber . 2020 . Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991) . Neural Networks . en . 127 . 58–66 . 1906.04493 . 10.1016/j.neunet.2020.04.008 . 32334341 . 216056336.
  106. Schmidhuber . Jürgen . Juergen Schmidhuber . November 1992 . Learning Factorial Codes by Predictability Minimization . Neural Computation . en . 4 . 6 . 863–879 . 10.1162/neco.1992.4.6.863 . 42023620.
  107. Schmidhuber . Jürgen . Eldracher . Martin . Foltin . Bernhard . 1996 . Semilinear predictability minimzation produces well-known feature detectors . Neural Computation . en . 8 . 4 . 773–786 . 10.1162/neco.1996.8.4.773 . 16154391.
  108. Web site: Niemitalo . Olli . February 24, 2010 . A method for training artificial neural networks to generate missing data within a variable context . live . https://web.archive.org/web/20120312111546/http://yehar.com/blog/?p=167 . March 12, 2012 . February 22, 2019 . Internet Archive (Wayback Machine).
  109. Web site: 2019 . GANs were invented in 2010? . 2019-05-28 . reddit r/MachineLearning . en-US.
  110. Li . Wei . Gauci . Melvin . Gross . Roderich . July 6, 2013 . Proceeding of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO '13 . Amsterdam, the Netherlands . ACM . 223–230 . 10.1145/2463372.2465801 . 9781450319638 . A Coevolutionary Approach to Learn Animal Behavior Through Controlled Interaction . Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013).
  111. Gutmann . Michael . Hyvärinen . Aapo . Noise-Contrastive Estimation . International Conference on AI and Statistics.
  112. Goodfellow . Ian . Pouget-Abadie . Jean . Mirza . Mehdi . Xu . Bing . Warde-Farley . David . Ozair . Sherjil . Courville . Aaron . Bengio . Yoshua . 2014 . Generative Adversarial Networks . Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014) . 2672–2680 . https://web.archive.org/web/20191122034612/http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf . 22 November 2019 . 20 August 2019 . live.
  113. Web site: December 14, 2018 . GAN 2.0: NVIDIA's Hyperrealistic Face Generator . October 3, 2019 . SyncedReview.com.
  114. 1710.10196 . cs.NE . T. . Karras . T. . Aila . Progressive Growing of GANs for Improved Quality, Stability, and Variation . 26 February 2018 . Laine . S. . Lehtinen . J..
  115. Web site: Prepare, Don't Panic: Synthetic Media and Deepfakes . live . https://web.archive.org/web/20201202231744/https://lab.witness.org/projects/synthetic-media-and-deep-fakes/ . 2 December 2020 . 25 November 2020 . witness.org.
  116. Sohl-Dickstein . Jascha . Weiss . Eric . Maheswaranathan . Niru . Ganguli . Surya . 2015-06-01 . Deep Unsupervised Learning using Nonequilibrium Thermodynamics . Proceedings of the 32nd International Conference on Machine Learning . en . PMLR . 37 . 2256–2265. 1503.03585 .
  117. Book: Kramer . Arthur F. . Attention: From Theory to Practice . Wiegmann . Douglas A. . Kirlik . Alex . 2006-12-28 . Oxford University Press . 978-0-19-530572-2 . 1 Attention: From History to Application . 10.1093/acprof:oso/9780195305722.003.0001.
  118. Cherry EC . 1953 . Some Experiments on the Recognition of Speech, with One and with Two Ears . The Journal of the Acoustical Society of America . 25 . 5 . 975–79 . 1953ASAJ...25..975C . 10.1121/1.1907229 . 0001-4966 . free . 11858/00-001M-0000-002A-F750-3.
  119. Book: Broadbent, D . Donald Broadbent . Perception and Communication . Pergamon Press . 1958 . London.
  120. Kowler . Eileen . Anderson . Eric . Dosher . Barbara . Blaser . Erik . 1995-07-01 . The role of attention in the programming of saccades . Vision Research . 35 . 13 . 1897–1916 . 10.1016/0042-6989(94)00279-U . 7660596 . 0042-6989.
  121. Fukushima . Kunihiko . 1987-12-01 . Neural network model for selective attention in visual pattern recognition and associative recall . Applied Optics . en . 26 . 23 . 4985–4992 . 10.1364/AO.26.004985 . 20523477 . 1987ApOpt..26.4985F . 0003-6935.
  122. Ba . Jimmy . Multiple Object Recognition with Visual Attention . 2015-04-23 . Mnih . Volodymyr . Kavukcuoglu . Koray. cs.LG . 1412.7755 .
  123. Soydaner . Derya . August 2022 . Attention mechanism in neural networks: where it comes and where it goes . Neural Computing and Applications . en . 34 . 16 . 13371–13385 . 10.1007/s00521-022-07366-3 . 0941-0643.
  124. Giles . C. Lee . Maxwell . Tom . 1987-12-01 . Learning, invariance, and generalization in high-order neural networks . Applied Optics . en . 26 . 23 . 4972–4978 . 10.1364/AO.26.004972 . 20523475 . 0003-6935.
  125. Feldman . J. A. . Ballard . D. H. . 1982-07-01 . Connectionist models and their properties . Cognitive Science . 6 . 3 . 205–254 . 10.1016/S0364-0213(82)80001-3 . 0364-0213.
  126. Book: Rumelhart . David E. . Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 . Mcclelland . James L. . Group . PDP Research . 1987-07-29 . Bradford Books . 978-0-262-68053-0 . Cambridge, Mass . en.
  127. Schmidhuber . Jürgen . January 1992 . Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks . Neural Computation . en . 4 . 1 . 131–139 . 10.1162/neco.1992.4.1.131 . 0899-7667.
  128. Ha . David . HyperNetworks . 2016-12-01 . 1609.09106 . Dai . Andrew . Le . Quoc V.. cs.LG .
  129. Niu . Zhaoyang . Zhong . Guoqiang . Yu . Hui . 2021-09-10 . A review on the attention mechanism of deep learning . Neurocomputing . 452 . 48–62 . 10.1016/j.neucom.2021.03.091 . 0925-2312.
  130. Cho . Kyunghyun . van Merrienboer . Bart . Gulcehre . Caglar . Bahdanau . Dzmitry . Bougares . Fethi . Schwenk . Holger . Bengio . Yoshua . 2014-06-03 . Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation . 1406.1078 .
  131. 1409.3215 . cs.CL . Ilya . Sutskever . Oriol . Vinyals . Sequence to sequence learning with neural networks . 14 Dec 2014 . Le . Quoc Viet.
  132. Vinyals . Oriol . Toshev . Alexander . Bengio . Samy . Erhan . Dumitru . 2015 . Show and Tell: A Neural Image Caption Generator . 3156–3164. 1411.4555 .
  133. Xu . Kelvin . Ba . Jimmy . Kiros . Ryan . Cho . Kyunghyun . Courville . Aaron . Salakhudinov . Ruslan . Zemel . Rich . Bengio . Yoshua . 2015-06-01 . Show, Attend and Tell: Neural Image Caption Generation with Visual Attention . Proceedings of the 32nd International Conference on Machine Learning . en . PMLR . 2048–2057.
  134. Bahdanau . Dzmitry . Neural Machine Translation by Jointly Learning to Align and Translate . 2016-05-19 . Cho . Kyunghyun . Bengio . Yoshua. cs.CL . 1409.0473 .
  135. Graves . Alex . Neural Turing Machines . 2014-12-10 . Wayne . Greg . Danihelka . Ivo. cs.NE . 1410.5401 .
  136. Cheng . Jianpeng . Long Short-Term Memory-Networks for Machine Reading . 2016-09-20 . 1601.06733 . Dong . Li . Lapata . Mirella. cs.CL .
  137. 1606.01933 . cs.CL . Ankur P. . Parikh . Oscar . Täckström . A Decomposable Attention Model for Natural Language Inference . 2016-09-25 . Das . Dipanjan . Uszkoreit . Jakob.
  138. Levy . Steven . 8 Google Employees Invented Modern AI. Here's the Inside Story . live . https://web.archive.org/web/20240320101528/https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/ . 20 March 2024 . 2024-08-06 . Wired . en-US . 1059-1028.
  139. Vaswani . Ashish . Ashish Vaswani . Shazeer . Noam . Parmar . Niki . Uszkoreit . Jakob . Jones . Llion . Gomez . Aidan N . Aidan Gomez . Kaiser . Łukasz . Polosukhin . Illia . 2017 . Attention is All you Need . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 30.
  140. Peng . Bo . RWKV: Reinventing RNNs for the Transformer Era . 2023-12-10 . 2305.13048 . Alcaide . Eric . Anthony . Quentin . Albalak . Alon . Arcadinho . Samuel . Biderman . Stella . Cao . Huanqi . Cheng . Xin . Chung . Michael. cs.CL .
  141. Kohonen . Teuvo . Honkela . Timo . 2007 . Kohonen Network . Scholarpedia . 2 . 1 . 1568 . 2007SchpJ...2.1568K . 10.4249/scholarpedia.1568 . free.
  142. Kohonen . Teuvo . 1982 . Self-Organized Formation of Topologically Correct Feature Maps . Biological Cybernetics . 43 . 59–69 . 10.1007/bf00337288 . 206775459 . 1.
  143. Von der Malsburg . C . 1973 . Self-organization of orientation sensitive cells in the striate cortex . Kybernetik . 14 . 2 . 85–100 . 10.1007/bf00288907 . 4786750 . 3351573.
  144. Ackley . David H. . Hinton . Geoffrey E. . Sejnowski . Terrence J. . 1985-01-01 . A learning algorithm for boltzmann machines . Cognitive Science . 9 . 1 . 147–169 . 10.1016/S0364-0213(85)80012-4 . 2024-08-07 . 0364-0213.
  145. Book: Smolensky, Paul . Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations . Connectionism . MIT Press . 1986 . 0-262-68053-X . Rumelhart . David E. . 194–281 . Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory . McLelland . James L. . https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap6_PDP86.pdf.
  146. Peter . Dayan . Peter Dayan . Hinton . Geoffrey E. . Geoffrey Hinton . Neal . Radford M. . Radford M. Neal . Zemel . Richard S. . Richard Zemel . 1995 . The Helmholtz machine. . Neural Computation . 7 . 5 . 889–904 . 10.1162/neco.1995.7.5.889 . 7584891 . 1890561 . free . 21.11116/0000-0002-D6D3-E.
  147. Hinton . Geoffrey E. . Geoffrey Hinton . Dayan . Peter . Peter Dayan . Frey . Brendan J. . Brendan Frey . Neal . Radford . 1995-05-26 . The wake-sleep algorithm for unsupervised neural networks . Science . 268 . 5214 . 1158–1161 . 1995Sci...268.1158H . 10.1126/science.7761831 . 7761831 . 871473.
  148. Book: Sejnowski, Terrence J. . The deep learning revolution . 2018 . The MIT Press . 978-0-262-03803-4 . Cambridge, Massachusetts.
  149. Book: Smolensky . P. . Paul Smolensky . Parallel Distributed Processing: Explorations in the Microstructure of Cognition . 1986 . 9780262680530 . D. E. Rumelhart . 1 . 194–281 . Information processing in dynamical systems: Foundations of harmony theory. . J. L. McClelland . PDP Research Group . http://portal.acm.org/citation.cfm?id=104290.
  150. Hinton . G. E. . Geoffrey Hinton . Osindero . S. . Teh . Y. . 2006 . A fast learning algorithm for deep belief nets . . 18 . 7 . 1527–1554 . 10.1.1.76.1541 . 10.1162/neco.2006.18.7.1527 . 16764513 . 2309950.
  151. Hinton . Geoffrey . 2009-05-31 . Deep belief networks . Scholarpedia . 4 . 5 . 5947 . 2009SchpJ...4.5947H . 10.4249/scholarpedia.5947 . 1941-6016 . free.
  152. 1112.6209 . cs.LG . Andrew . Ng . Jeff . Dean . Building High-level Features Using Large Scale Unsupervised Learning . 2012.
  153. Watkin . Timothy L. H. . Rau . Albrecht . Biehl . Michael . 1993-04-01 . The statistical mechanics of learning a rule . Reviews of Modern Physics . 65 . 2 . 499–556 . 10.1103/RevModPhys.65.499. 1993RvMP...65..499W .
  154. Schwarze . H . Hertz . J . 1992-10-15 . Generalization in a Large Committee Machine . Europhysics Letters (EPL) . 20 . 4 . 375–380 . 10.1209/0295-5075/20/4/015 . 1992EL.....20..375S . 0295-5075.
  155. Mato . G . Parga . N . 1992-10-07 . Generalization properties of multilayered neural networks . Journal of Physics A: Mathematical and General . 25 . 19 . 5047–5054 . 10.1088/0305-4470/25/19/017 . 1992JPhA...25.5047M . 0305-4470.
  156. Hansel . D . Mato . G . Meunier . C . 1992-11-01 . Memorization Without Generalization in a Multilayered Neural Network . Europhysics Letters (EPL) . 20 . 5 . 471–476 . 10.1209/0295-5075/20/5/015 . 1992EL.....20..471H . 0295-5075.
  157. Schmidhuber . Jürgen . 1992 . [ftp://ftp.idsia.ch/pub/juergen/chunker.pdf Learning complex, extended sequences using the principle of history compression ]. Neural Computation . 4 . 2 . 234–242 . 10.1162/neco.1992.4.2.234 . 18271205 .
  158. Hanson . Stephen . Pratt . Lorien . 1988 . Comparing Biases for Minimal Network Construction with Back-Propagation . Advances in Neural Information Processing Systems . Morgan-Kaufmann . 1.
  159. LeCun . Yann . Denker . John . Solla . Sara . 1989 . Optimal Brain Damage . Advances in Neural Information Processing Systems . Morgan-Kaufmann . 2.
  160. Book: Analog VLSI Implementation of Neural Systems. 8 May 1989. Kluwer Academic Publishers. 978-1-4613-1639-8. Mead. Carver A.. Carver Mead. Ismail. Mohammed. The Kluwer International Series in Engineering and Computer Science. 80. Norwell, MA. 10.1007/978-1-4613-1639-8.
  161. Yang. J. J.. Pickett. M. D.. Li. X. M.. Ohlberg. D. A. A.. Stewart. D. R.. Williams. R. S.. 2008. Memristive switching mechanism for metal/oxide/metal nanodevices. Nat. Nanotechnol.. 3. 7. 429–433. 10.1038/nnano.2008.160. 18654568.
  162. Strukov. D. B.. Snider. G. S.. Stewart. D. R.. Williams. R. S.. 2008. The missing memristor found. Nature. 453. 7191. 80–83. 2008Natur.453...80S. 10.1038/nature06932. 18451858. 4367148.