A word n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network–based models, which have been superseded by large language models.[1] It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word is considered, it is called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model. Special tokens are introduced to denote the start and end of a sentence
\langles\rangle
\langle/s\rangle
To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.
See also: Bag-of-words model.
A special case, where n = 1, is called a unigram model. Probability of each word in a sequence is independent from probabilities of other word in the sequence. Each word's probability in the sequence is equal to the word's probability in an entire document.
The model consists of units, each treated as one-state finite automata.[2] Words with their probabilities in a document can be illustrated as follows.
Word | Its probability in doc | |
---|---|---|
a | 0.1 | |
world | 0.2 | |
likes | 0.05 | |
we | 0.05 | |
share | 0.3 | |
... | ... |
Total mass of word probabilities distributed across the document's vocabulary, is 1.
The probability generated for a specific query is calculated as
Unigram models of different documents have different probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:
Word | Its probability in Doc1 | Its probability in Doc2 | |
---|---|---|---|
a | 0.1 | 0.3 | |
world | 0.2 | 0.1 | |
likes | 0.05 | 0.03 | |
we | 0.05 | 0.02 | |
share | 0.3 | 0.2 | |
... | ... | ... |
In a bigram word (n = 2) language model, the probability of the sentence I saw the red house is approximated as
In a trigram (n = 3) language model, the approximation is
Note that the context of the first n – 1 n-grams is filled with start-of-sentence markers, typically denoted <s>.
Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.
The approximation method calculates the probability
P(w1,\ldots,wm)
w1,\ldots,wm
It is assumed that the probability of observing the ith word wi (in the context window consisting of the preceding i − 1 words) can be approximated by the probability of observing it in the shortened context window consisting of the preceding n − 1 words (nth-order Markov property). To clarify, for n = 3 and i = 2 we have
P(wi\midwi-(n-1),\ldots,wi-1)=P(w2\midw1)
The conditional probability can be calculated from n-gram model frequency counts:
See main article: Statistical machine translation.
An issue when using n-gram language models are out-of-vocabulary (OOV) words. They are encountered in computational linguistics and natural language processing when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the n-grams in the corpus that contain an out-of-vocabulary word are ignored. The n-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed.[3]
Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g.
See main article: Approximate string matching. n-grams were also used for approximate matching. If we convert strings (with only letters in the English alphabet) into character 3-grams, we get a
263
It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.
n-gram-based searching was also used for plagiarism detection.
To choose a value for n in an n-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.
There are problems of balance weight between infrequent grams (for example, if a proper name appeared in the training data) and frequent grams. Also, items not seen in the training data will be given a probability of 0.0 without smoothing. For unseen but plausible data from a sample, one can introduce pseudocounts. Pseudocounts are generally motivated on Bayesian grounds.
In practice it was necessary to smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. The reason is that models derived directly from the n-gram frequency counts have severe problems when confronted with any n-grams that have not explicitly been seen before – the zero-frequency problem. Various smoothing methods were used, from simple "add-one" (Laplace) smoothing (assign a count of 1 to unseen n-grams; see Rule of succession) to more sophisticated models, such as Good–Turing discounting or back-off models. Some of these methods are equivalent to assigning a prior distribution to the probabilities of the n-grams and using Bayesian inference to compute the resulting posterior n-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations.
Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word n-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are skipped over.[5]
Formally, a -skip--gram is a length- subsequence where the components occur at distance at most from each other.
For example, in the input text:
the rain in Spain falls mainly on the plain
the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences
the in, rain Spain, in falls, Spain mainly, falls on, mainly the, and on plain.
In skip-gram model, semantic relations between words are represented by linear combinations, capturing a form of compositionality. For example, in some such models, if is the function that maps a word to its -d vector representation, then
where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.[6] [7]
Syntactic n-grams are n-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text.[8] [9] [10] For example, the sentence "economic news has little effect on financial markets" can be transformed to syntactic n-grams following the tree structure of its dependency relations: news-economic, effect-little, effect-on-markets-financial.
Syntactic n-grams are intended to reflect syntactic structure more faithfully than linear n-grams, and have many of the same applications, especially as features in a vector space model. Syntactic n-grams for certain tasks gives better results than the use of standard n-grams, for example, for authorship attribution.[11]
Another type of syntactic n-grams are part-of-speech n-grams, defined as fixed-length contiguous overlapping subsequences that are extracted from part-of-speech sequences of text. Part-of-speech n-grams have several applications, most commonly in information retrieval.[12]
n-grams find use in several areas of computer science, computational linguistics, and applied mathematics.
They have been used to: