GloVe explained

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.[1] Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

It is developed as an open-source project at Stanford[2] and was launched in 2014. It was designed as a competitor to word2vec, and the original paper noted multiple improvements of GloVe over word2vec., both approaches are outdated, and Transformer-based models, such as ELMo and BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP.[3]

Definition

You shall know a word by the company it keeps (Firth, J. R. 1957:11)[4]
The idea of GloVe is to construct, for each word

i

, two vectors

wi,\tildewi

, such that the relative positions of the vectors capture part of the statistical regularities of the word

i

. The statistical regularity is defined as the co-occurrence probabilities. Words that resemble each other in meaning should also resemble each other in co-occurrence probabilities.

Word counting

Let the vocabulary be

V

, the set of all possible words (aka "tokens"). Punctuation is either ignored, or treated as vocabulary, and similarly for capitalization and other typographical details.

If two words occur close to each other, then we say that they occur in the context of each other. For example, if the context length is 3, then we say that in the following sentence

GloVe1, coined2 from3 Global4 Vectors5, is6 a7 model8 for9 distributed10 word11 representation12
the word "model8" is in the context of "word11" but not the context of "representation12".

A word is not in the context of itself, so "model8" is not in the context of the word "model8", although, if a word appears again in the same context, then it does count.

Let

Xij

be the number of times that the word

j

appears in the context of the word

i

over the entire corpus. For example, if the corpus is just "I don't think that that is a problem." we have

Xthat,=2

since the first "that" appears in the second one's context, and vice versa.

Let

Xi=\sumjXij

be the number of words in the context of all instances of word

i

. By counting, we haveX_i = 2 \times(\text) \times \# (\texti)(except for words occurring right at the start and end of the corpus)

Probabilistic modelling

Let P_ := P(k | i) := \fracbe the co-occurrence probability. That is, if one samples a random occurrence of the word

i

in the entire document, and a random word within its context, that word is

k

with probability

Pik

. Note that

PikPki

in general. For example, in a typical modern English corpus,

Pado,

is close to one, but

Pmuch,

is close to zero. This is because the word "ado" is almost only used in the context of the archaic phrase "much ado about", but the word "much" occurs in all kinds of contexts.

For example, in a 6 billion token corpus, we have

Table 1 of ! Probability and Ratio !

k=solid

!

k=gas

!

k=water

!

k=fashion

P(k\midice)

1.9 x 10-4

6.6 x 10-5

3.0 x 10-3

1.7 x 10-5

P(k\midsteam)

2.2 x 10-5

7.8 x 10-4

2.2 x 10-3

1.8 x 10-5

P(k\midice)/P(k\midsteam)

8.9

8.5 x 10-2

1.36

0.96

Inspecting the table, we see that the words "ice" and "steam" are indistinguishable along the "water" (often co-occurring with both) and "fashion" (rarely co-occurring with either), but distinguishable along the "solid" (co-occurring more with ice) and "gas" (co-occurring more with "steam").

The idea is to learn two vectors

wi,\tildewi

for each word

i

, such that we have a multinomial logistic regression:w_i^T \tilde w_j + b_i + \tilde b_j \approx \ln P_and the terms

bi,\tildebj

are unimportant parameters.

This means that if the words

i,j

have similar co-occurrence probabilities

(Pik)k(Pjk)k

, then their vectors should also be similar:

wiwj

.

Logistic regression

Naively, logistic regression can be run by minimizing the squared loss:L = \sum_ (w_i^T \tilde w_j + b_i + \tilde b_j - \ln P_)^2However, this would be noisy for rare co-occurrences. To fix the issue, the squared loss is weighted so that the loss is slowly ramped-up as the absolute number of co-occurrences

Xij

increases:L = \sum_ f(X_) (w_i^T \tilde w_j + b_i + \tilde b_j - \ln P_)^2 wheref(x)=\left\

Notes and References

  1. Pennington . Jeffrey . Socher . Richard . Manning . Christopher . October 2014 . Moschitti . Alessandro . Pang . Bo . Daelemans . Walter . GloVe: Global Vectors for Word Representation . Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Doha, Qatar . Association for Computational Linguistics . 1532–1543 . 10.3115/v1/D14-1162.
  2. https://www.aclweb.org/anthology/D14-1162 GloVe: Global Vectors for Word Representation (pdf)
  3. Von der Mosel . Julian . Trautsch . Alexander . Herbold . Steffen . 2022 . On the validity of pre-trained transformers for natural language processing in the software engineering domain . IEEE Transactions on Software Engineering . 49 . 4 . 1487–1507 . 2109.04738 . 10.1109/TSE.2022.3178469 . 1939-3520 . 237485425.
  4. Book: Firth, J. R. . Studies in Linguistic Analysis . Wiley-Blackwell . 1957.