Attention (machine learning) explained

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to all of the others. In natural language processing, importance is represented by "soft" weight values assigned to each word in a sentence relative to the other words. More generally, attention encodes vector representations called token embeddings across a fixed-width context window that can range from tens to million tokens in size.

Calculation of these weights can occur in parallel when using the Transformer attention model, or sequentially, as required by the original RNN design. Unlike "hard" weights, which are mutable during training and frozen afterwards, "soft" weights change with each input throughout all phases of operation.

Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows the calculation of the hidden representation of a token equal access to any part of a sentence directly, rather than only through the previous hidden state.

Earlier uses attached this mechanism to a serial recurrent neural network's language translation system, but later uses in transformer large language models remove the recurrent neural network and rely instead on the faster parallel attention scheme.

History

See also: Timeline of machine learning. Academic reviews of the history of the attention mechanism are provided in Niu et al.^[1] and Soydaner.^[2]

Predecessors

Selective attention in humans had been well studied in neuroscience and cognitive psychology.^[3] In 1953, Colin Cherry studied selective attention in the context of audition, known as the cocktail party effect.^[4]

In 1958, Donald Broadbent proposed the filter model of attention.^[5] Selective attention of vision was studied in the 1960s by George Sperling's partial report paradigm. It was also noticed that saccade control is modulated by cognitive processes, insofar as the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve the entire visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene.^[6]

These research developments inspired algorithms such as the Neocognitron and its variants.^[7] Meanwhile, developments in neural networks had inspired circuit models of biological visual attention. One well-cited network from 1998, for example, was inspired by the low-level primate visual system. It produced saliency maps of images using handcrafted (not learned) features, which were then used to guide a second neural network in processing patches of the image in order of reducing saliency.^[8]

A key aspect of attention mechanism can be written (schematically) as $\sum_i \langle(\text)_i, (\text)_i\rangle (\text)_i$ where the angled brackets denote dot product. This shows that it involves a multiplicative operation. Multiplicative operations within neural networks had been studied under the names of higher-order neural networks,^[9] multiplication units,^[10] sigma-pi units, fast weight controllers, and hyper-networks.

Recurrent attention

During the deep learning era, attention mechanism was developed solve similar problems in encoding-decoding.

In machine translation, the seq2seq model, as it was proposed in 2014,^[11] would encode an input text into a fixed-length vector, which would then be decoded into an output text. If the input text is long, the fixed-length vector would be unable to carry enough information for accurate decoding. An attention mechanism was proposed to solve this problem.

An image captioning model was proposed in 2015, citing inspiration from the seq2seq model.^[12] that would encode an input image into a fixed-length vector. (Xu et al 2015),^[13] citing (Bahdanau et al 2014), applied the attention mechanism as used in the seq2seq model to image captioning.

Transformer

One problem with seq2seq models was their use of recurrent neural networks, which are not parallelizable as both the encoder and the decoder must process the sequence token-by-token. Decomposable attention attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix" (alignment is the terminology used by Bahdanau et al) in order to allow for parallel processing.

The idea of using attention mechanism for self-attention, instead of in an encoder-decoder (cross-attention), was also proposed during this period, such as in differentiable neural computers and neural Turing machines. It was termed intra-attention where an LSTM is augmented with a memory network as it encodes an input sequence.

These strands of development were brought together in 2017 with the Transformer architecture, published in the Attention Is All You Need paper.

Core calculations

The attention network was designed to identify high correlations patterns amongst words in a given sentence, assuming that it has learned word correlation patterns from the training data. This correlation is captured as neuronal weights learned during training with backpropagation.

This attention scheme has been compared to the query-key analogy of relational databases. The comparison suggests an asymmetric role for these 2 vectors, where one item of interest (the query "that") is matched against all possible items (the keys list of each word in the sentence). However, parallel calculations matches all words of the sentence with itself; therefore the roles of these vectors are symmetric. Possibly because the simplistic database analogy is flawed, much effort has gone into understand Attention further by studying their roles in focused settings, such as in-context learning, masked language tasks, stripped down transformers, bigram statistics, N-gram statistics, pairwise convolutions, and arithmetic factoring.

A Detailed Walk-through

As hand-crafting weights defeats the purpose of machine learning, the model must compute the attention weights on its own. Taking analogy from the language of database queries, we make the model construct a triple of vectors: key, query, and value. The rough idea is that we have a "database" in the form of a list of key-value pairs. The decoder send in a query, and obtain a reply in the form of a weighted sum of the values, where the weight is proportional to how closely the query resembles each key.

The decoder first processes the "" input partially, to obtain an intermediate vector

	d
h
	0

, the 0th hidden vector of decoder. Then, the intermediate vector is transformed by a linear map

W^Q

into a query vector

q₀=

	d
h
	0

W^Q

. Meanwhile, the hidden vectors outputted by the encoder are transformed by another linear map

W^K

into key vectors

k₀=h₀W^K,k₁=h₁W^K,...

. The linear maps are useful for providing the model with enough freedom to find the best way to represent the data.

Now, the query and keys are compared by taking dot products:

q₀

	T,
k
	0

q₀

	T,
k
	1

...

. Ideally, the model should have learned to compute the keys and values, such that

q₀

	T
k
	0

is large,

q₀

	T
k
	1

is small, and the rest are very small. This can be interpreted as saying that the attention weight should be mostly applied to the 0th hidden vector of the encoder, a little to the 1st, and essentially none to the rest.

In order to make a properly weighted sum, we need to transform this list of dot products into a probability distribution over

0,1,...

. This can be accomplished by the softmax function, thus giving us the attention weights:

(w_, w_, \dots) = \mathrm(q_0 k_0^T, q_0 k_1^T, \dots)

This is then used to compute the context vector:

c_0 = w_ v_0 + w_ v_1 + \cdots

where

v₀=h₀W^V,v₁=h₁W^V,...

are the value vectors, linearly transformed by another matrix to provide the model with freedom to find the best way to represent values. Without the matrices

W^Q,W^K,W^V

, the model would be forced to use the same hidden vector for both key and value, which might not be appropriate, as these two tasks are not the same.

This is the dot-attention mechanism. The particular version described in this section is "decoder cross-attention", as the output context vector is used by the decoder, and the input keys and values come from the encoder, but the query comes from the decoder, thus "cross-attention".

More succinctly, we can write it as $c_0 = \mathrm(h_0^d W^Q, HW^K, H W^V) = \mathrm((h_0^d W^Q) \; (H W^K)^T) (H W^V)$ where the matrix

is the matrix whose rows are

h_0,h_1,...

. Note that the querying vector,

	d
h
	0

, is not necessarily the same as the key-value vector

h₀

. In fact, it is theoretically possible for query, key, and value vectors to all be different, though that is rarely done in practice.

Language Translation

Tasks dealing with language can be cast as a problem of translating general sequences, called seq2seq. One way to build such a machine in 2014 is to graft an attention unit to the recurrent Encoder-Decoder (diagram below). With the advent of Transformers in 2017, the serial recurrent network has been replaced by the parallel Attention modules augmented by other features like positional encoding, skip connections, and fully connected networks.

An Example

In the simplest case, the attention unit consists of dot products of the recurrent encoder states and does not need training. In practice, the attention unit consists of 3 trained, fully-connected neural network layers called query, key, and value.

Legend
Label	Description
100	Max. sentence length
300	Embedding size (word dimension)
500	Length of hidden vector
9k, 10k	Dictionary size of input & output languages respectively.
x, Y	9k and 10k 1-hot dictionary vectors. x → x implemented as a lookup table rather than vector multiplication. Y is the 1-hot maximizer of the linear Decoder layer D; that is, it takes the argmax of D's linear layer output.
x	300-long word embedding vector. The vectors are usually pre-calculated from other projects such as GloVe or Word2Vec.
h	500-long encoder hidden vector. At each point in time, this vector summarizes all the preceding words before it. The final h can be viewed as a "sentence" vector, or a thought vector as Hinton calls it.
s	500-long decoder hidden state vector.
E	500 neuron recurrent neural network encoder. 500 outputs. Input count is 800–300 from source embedding + 500 from recurrent connections. The encoder feeds directly into the decoder only to initialize it, but not thereafter; hence, that direct connection is shown very faintly.
D	2-layer decoder. The recurrent layer has 500 neurons and the fully-connected linear layer has 10k neurons (the size of the target vocabulary). The linear layer alone has 5 million (500 × 10k) weights – ~10 times more weights than the recurrent layer.
score	100-long alignment score
w	100-long vector attention weight. These are "soft" weights which changes during the forward pass, in contrast to "hard" neuronal weights that change during the learning phase.
A	Attention module – this can be a dot product of recurrent states, or the query-key-value fully-connected layers. The output is a 100-long vector w.
H	500×100. 100 hidden vectors h concatenated into a matrix
c	500-long context vector = H * w. c is a linear combination of h vectors weighted by w.

Alignment

In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. In the I love you example above, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t, and aime yields an alignment matrix:

	I	love	you
je	0.94	0.02	0.04
t'	0.11	0.01	0.88
aime	0.03	0.95	0.02

Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector.

This view of the attention weights addresses some of the neural network explainability problem. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime.

seq2seq machine translation

Consider the seq2seq language English-to-French translation task. To be concrete, let us consider the translation of "the zone of international control ", which should translate to "la zone de contrôle international ". Here, we use the special token as a control character to delimit the end of input for both the encoder and the decoder.

An input sequence of text

x_0,x_1,...

is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors

h_0,h_1,...

, where

stands for "hidden vector".

After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence

y_0,y_1,...

, autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word:

(

h_0,h_1,...

, "") → "la"

(

h_0,h_1,...

, " la") → "la zone"

(

h_0,h_1,...

, " la zone") → "la zone de"

...
(

h_0,h_1,...

, " la zone de contrôle international") → "la zone de contrôle international "

Here, we use the special token as a control character to delimit the start of input for the decoder. The decoding terminates as soon as "" appears in the decoder output.

Variants

Many variants of attention implement soft weights, such as

fast weight programmers, or fast weight controllers (1992). A "slow" neural network outputs the "fast" weights of another neural network through outer products. The slow network learns by gradient descent. It was later renamed as "linearized self-attention".
Bahdanau-style attention, also referred to as additive attention,
Luong-style attention, which is known as multiplicative attention,
highly parallelizable self-attention introduced in 2016 as decomposable attention and successfully used in transformers a year later,
positional attention and factorized positional attention.

For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, channel attention, or combinations.

These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Core Calculations section above.

Legend
Label	Description
Variables X, H, S, T	Upper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column.
S, T	S, decoder hidden state; T, target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2.
X, H	H, encoder hidden state; X, input word embeddings.
W	Attention coefficients
Qw, Kw, Vw, FC	Weight matrices for query, key, value respectively. FC is a fully-connected weight matrix.
⊕, ⊗	⊕, vector concatenation; ⊗, matrix multiplication.
corr	Column-wise softmax(matrix of all combinations of dot products). The dot products are *x_i x_j in variant #3, h_i* s*_j in variant 1, and column _i (Kw H) * column _j (Qw * S) in variant 2, and column _i (Kw * X) * column _j (Qw * X) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the where is the height of the QKV matrices.

Self-attention

Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences.

For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed lookup table. This gives a sequence of hidden vectors

h_0,h_1,...

. These can then be applied to a dot-product attention mechanism, to obtain

\beginh_0' &= \mathrm(h_0 W^Q, HW^K, H W^V) \\ h_1' &= \mathrm(h_1 W^Q, HW^K, H W^V) \\&\cdots\end

or more succinctly,

H'=Attention(HW^Q,HW^K,HW^V)

. This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights

w_ij=0

for all

i<j

, called "causal masking". This attention mechanism is the "causally masked self-attention".

Mathematical representation

Standard Scaled Dot-Product Attention

For matrices:

Q\in
m x d_k
R
,

K\in
n x d_k
R

and

V\in
n x d_v
R

, the scaled dot-product, or QKV attention is defined as:

\text(\mathbf, \mathbf, \mathbf) = \text\left(\frac\right)\mathbf\in\mathbb^

where

{}^T

denotes transpose and the softmax function is applied independently to every row of its argument. The matrix

contains

queries, while matrices

K,V

jointly contain an unordered set of

key-value pairs. Value vectors in matrix

are weighted using the weights resulting from the softmax operation, so that the rows of the

-by-

d_v

output matrix are confined to the convex hull of the points in

	d_v
R

given by the rows of

To understand the permutation invariance and permutation equivariance properties of QKV attention,^[14] let

A\inR^{m x}

and

B\inR^{n x}

be permutation matrices; and

D\inR^{m x}

an arbitrary matrix. The softmax function is permutation equivariant in the sense that:

softmax(ADB)=Asoftmax(D)B

By noting that the transpose of a permutation matrix is also its inverse, it follows that:

Attention(AQ,BK,BV)=AAttention(Q,K,V)

which shows that QKV attention is equivariant with respect to re-ordering the queries (rows of

); and invariant to re-ordering of the key-value pairs in

K,V

. These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as:

X\mapstoAttention(XT_q,XT_k,XT_v)

is permutation equivariant with respect to re-ordering the rows of the input matrix

in a non-trivial way, because every row of the output is a function of all the rows of the input. Similar properties hold for multi-head attention, which is defined below.

Multi-Head Attention

Multi-head attention $\text(\mathbf, \mathbf, \mathbf) = \text(\text_1, ..., \text_h)\mathbf^O$ where each head is computed with QKV attention as: $\text_i = \text(\mathbf\mathbf_i^Q, \mathbf\mathbf_i^K, \mathbf\mathbf_i^V)$ and

	Q,
W
	i

	K,
W
	i

	V
W
	i

, and

W^O

are parameter matrices.

The permutation properties of QKV attention apply here also. For permutation matrices,

A,B

MultiHead(AQ,BK,BV)=AMultiHead(Q,K,V)

from which we also see that multi-head self-attention:

X\mapstoMultiHead(XT_q,XT_k,XT_v)

is equivariant with respect to re-ordering of the rows of input matrix

Bahdanau (Additive) Attention

$\text(Q, K, V) = \text(e)V$ where

e=\tanh(W_QQ+W_KK)

and

W_Q

and

W_K

are learnable weight matrices.

Luong Attention (General)

$\text(Q, K, V) = \text(QW_aK^T)V$ where

W_a

is a learnable weight matrix.

External links

Dan Jurafsky and James H. Martin (2022) Speech and Language Processing (3rd ed. draft, January 2022), ch. 10.4 Attention and ch. 9.7 Self-Attention Networks: Transformers
Alex Graves (4 May 2020), Attention and Memory in Deep Learning (video lecture), DeepMind / UCL, via YouTube

Notes and References

Niu . Zhaoyang . Zhong . Guoqiang . Yu . Hui . 2021-09-10 . A review on the attention mechanism of deep learning . Neurocomputing . 452 . 48–62 . 10.1016/j.neucom.2021.03.091 . 0925-2312.
Soydaner . Derya . August 2022 . Attention mechanism in neural networks: where it comes and where it goes . Neural Computing and Applications . en . 34 . 16 . 13371–13385 . 10.1007/s00521-022-07366-3 . 0941-0643.
Book: Kramer, Arthur F. . Attention: From Theory to Practice . Wiegmann . Douglas A. . Kirlik . Alex . 2006-12-28 . Oxford University Press . 978-0-19-530572-2 . 1 Attention: From History to Application . 10.1093/acprof:oso/9780195305722.003.0001.
Cherry EC . 1953 . Some Experiments on the Recognition of Speech, with One and with Two Ears . The Journal of the Acoustical Society of America . 25 . 5 . 975–79 . 1953ASAJ...25..975C . 10.1121/1.1907229 . 0001-4966 . free . 11858/00-001M-0000-002A-F750-3.
Book: Broadbent, D . Donald Broadbent . Perception and Communication . Pergamon Press . 1958 . London.
Kowler . Eileen . Anderson . Eric . Dosher . Barbara . Blaser . Erik . 1995-07-01 . The role of attention in the programming of saccades . Vision Research . 35 . 13 . 1897–1916 . 10.1016/0042-6989(94)00279-U . 0042-6989.
Fukushima . Kunihiko . 1987-12-01 . Neural network model for selective attention in visual pattern recognition and associative recall . Applied Optics . en . 26 . 23 . 4985 . 10.1364/AO.26.004985 . 0003-6935.
Itti . L. . Koch . C. . Niebur . E. . November 1998 . A model of saliency-based visual attention for rapid scene analysis . IEEE Transactions on Pattern Analysis and Machine Intelligence . 20 . 11 . 1254–1259 . 10.1109/34.730558.
Giles . C. Lee . Maxwell . Tom . 1987-12-01 . Learning, invariance, and generalization in high-order neural networks . Applied Optics . en . 26 . 23 . 4972 . 10.1364/AO.26.004972 . 0003-6935.
Feldman . J. A. . Ballard . D. H. . 1982-07-01 . Connectionist models and their properties . Cognitive Science . 6 . 3 . 205–254 . 10.1016/S0364-0213(82)80001-3 . 0364-0213.
1409.3215 . cs.CL . Ilya . Sutskever . Oriol . Vinyals . Sequence to sequence learning with neural networks . Le . Quoc Viet . 2014.
Web site: Vinyals . Oriol . Toshev . Alexander . Bengio . Samy . Erhan . Dumitru . 2015 . Show and Tell: A Neural Image Caption Generator . 3156–3164.
Xu . Kelvin . Ba . Jimmy . Kiros . Ryan . Cho . Kyunghyun . Courville . Aaron . Salakhudinov . Ruslan . Zemel . Rich . Bengio . Yoshua . 2015-06-01 . Show, Attend and Tell: Neural Image Caption Generation with Visual Attention . Proceedings of the 32nd International Conference on Machine Learning . en . PMLR . 2048–2057.
Web site: Lee . Juho . Lee . Yoonho . Kim . Jungtaek . Kosiorek . Adam R . Choi . Seungjin . Teh . Yee Whye . Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks . arXiv . arXiv . 13 August 2024.