Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and .[2]
A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results;[3] however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.
Let
c(w,w')
w
w'
The equation for bigram probabilities is as follows:
pKN(wi|wi-1)=
max(c(wi-1,wi)-\delta,0) | |
\sumw'c(wi-1,w') |
+
λ | |
wi-1 |
pKN(wi)
Where the unigram probability
pKN(wi)
wi
pKN(wi)=
|\{w':0<c(w',wi)\ | |
| |
}{|\{(w',w''):0<c(w',w'')\}|}
Note that
pKN
The parameter
\delta
The value of the normalizing constant
λ | |
wi-1 |
pKN(wi|wi-1)
wi
\delta<1
wi
wi-1
{\delta}/\left(\sumw'c(wi-1,w')\right)
wi
wi-1
pKN(wi|wi-1)
pKN(wi)
pKN(wi)
λ | |
wi-1 |
λ | |
wi-1 |
=
\delta | |
\sumw'c(wi-1,w') |
|\{w':0<c(wi-1,w')\}|
This equation can be extended to n-grams. Let
i-1 | |
w | |
i-n+1 |
n-1
wi
pKN(wi|w
i-1 | |
i-n+1 |
)=
| ||||||||||||
|
+\delta
| |||||||||
|}{\sum |
w'
i-1 | |
c(w | |
i-n+1 |
,w')}pKN(wi|w
i-1 | |
i-n+2 |
)
This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero.[6] Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.
Modifications of this method also exist. Chen and Goodman's 1998 paper lists and benchmarks several such modifications. Computational efficiency and scaling to multi-core systems is the focus of Chen and Goodman’s 1998 modification.[7] This approach is once used for Google Translate under a MapReduce implementation.[8] KenLM is a performant open-source implementation.[9]