In statistics, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other. MEMMs find applications in natural language processing, specifically in part-of-speech tagging[1] and information extraction.[2]
Suppose we have a sequence of observations
O1,...,On
S1,...,Sn
P(S1,...,Sn\midO1,...,On)
P(S1,...,Sn\midO1,...,On)=
nP(S | |
\prod | |
t |
\midSt-1,Ot).
P(s\mids',o)
s'
s
P(s\mids',o)=Ps'(s\mido)=
1 | |
Z(o,s') |
\exp\left(\sumaλafa(o,s)\right).
fa(o,s)
Z(o,s')
\operatorname{E}e\left[fa(o,s)\right]=\operatorname{E}p\left[fa(o,s)\right] foralla.
λa
The optimal state sequence
S1,...,Sn
\alphat+1(s)=\sums'\alphat(s')Ps'(s\midot+1).
An advantage of MEMMs rather than HMMs for sequence tagging is that they offer increased freedom in choosing features to represent observations. In sequence tagging situations, it is useful to use domain knowledge to design special-purpose features. In the original paper introducing MEMMs, the authors write that "when trying to extract previously unseen company names from a newswire article, the identity of a word alone is not very predictive; however, knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive (in conjunction with the context provided by the state-transition structure)."[2] Useful sequence tagging features, such as these, are often non-independent. Maximum entropy models do not assume independence between features, but generative observation models used in HMMs do.[2] Therefore, MEMMs allow the user to specify many correlated, but informative features.
Another advantage of MEMMs versus HMMs and conditional random fields (CRFs) is that training can be considerably more efficient. In HMMs and CRFs, one needs to use some version of the forward–backward algorithm as an inner loop in training. However, in MEMMs, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation.
A drawback of MEMMs is that they potentially suffer from the "label bias problem," where states with low-entropy transition distributions "effectively ignore their observations." Conditional random fields were designed to overcome this weakness,[5] which had already been recognised in the context of neural network-based Markov models in the early 1990s.[5] [6] Another source of label bias is that training is always done with respect to known previous tags, so the model struggles at test time when there is uncertainty in the previous tag.