Sliding window based part-of-speech tagging is used to part-of-speech tag a text.
A high percentage of words in a natural language are words which out of context can be assigned more than one part of speech. The percentage of these ambiguous words is typically around 30%, although it depends greatly on the language. Solving this problem is very important in many areas of natural language processing. For example in machine translation changing the part-of-speech of a word can dramatically change its translation.
Sliding window based part-of-speech taggers are programs which assign a single part-of-speech to a given lexical form of a word, by looking at a fixed sized "window" of words around the word to be disambiguated.
The two main advantages of this approach are:
Let
\Gamma=\{\gamma1,\gamma2,\ldots,\gamma|\Gamma|\}
be the set of grammatical tags of the application, that is, the set of all possible tags which may be assigned to a word, and let
W=\{w1,w2,\ldots\}
be the vocabulary of the application. Let
T:W → P(\Gamma)
be a function for morphological analysis which assigns each
w
T(w)\subseteq\Gamma
\Sigma=\{\sigma1,\sigma2,\ldots,\sigma|\Sigma|\}
be the set of word classes, that in general will be a partition of
W
\sigma\in\Sigma
w,\Sigma,\sigma
\sigma
Normally,
\Sigma
With these definitions it is possible to state problem in the following way: Given a text
w[1]w[2]\ldotsw[L]\inW*
w[t]
T(w[t])\in\Sigma
\sigma[1]\sigma[2]\ldots\sigma[L]\inW*
\gamma[1]\gamma[2]\ldots\gamma[L]
\gamma[t]\inT(\sigma[t])
A statistical tagger looks for the most probable tag for an ambiguously tagged text
\sigma[1]\sigma[2]\ldots\sigma[L]
\gamma*[1]\ldots\gamma*[L]=\operatorname{\argmax}\gamma[t]p(\gamma[1]\ldots\gamma[L]\sigma[1]\ldots\sigma[L])
Using Bayes formula, this is converted into:
\gamma*[1]\ldots\gamma*[L]=\operatorname{\argmax}\gamma[t]p(\gamma[1]\ldots\gamma[L])p(\sigma[1]\ldots\sigma[L]\gamma[1]\ldots\gamma[L])
where
p(\gamma[1]\gamma[2]\ldots\gamma[L])
p(\sigma[1]...\sigma[L]\gamma[1]\ldots\gamma[L])
\sigma[1]\ldots\sigma[L]
In a Markov model, these probabilities are approximated as products. The syntactic probabilities are modelled by a first order Markov process:
p(\gamma[1]\gamma[2]\ldots\gamma[L])=
t=L | |
\prod | |
t=1 |
p(\gamma[t+1]\gamma[t])
where
\gamma[0]
\gamma[L+1]
Lexical probabilities are independent of context:
p(\sigma[1]\sigma[2]\ldots\sigma[L]\gamma[1]\gamma[2]\ldots\gamma[L])=
t=L | |
\prod | |
t=1 |
p(\sigma[t]\gamma[t])
One form of tagging is to approximate the first probability formula:
p(\sigma[1]\sigma[2]\ldots\sigma[L]\gamma[1]\gamma[2]\ldots\gamma[L])=
t=L | |
\prod | |
t=1 |
p(\gamma[t]C(-)[t]\sigma[t]C(+)[t])
where
C(-)[t]=\sigma[t-N(-)]\sigma[t-N(-)]\ldots\sigma[t-1]
N(+)
In this way the sliding window algorithm only has to take into account a context of size
N(-)+N(+)+1
N(-)=N(+)=1