In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.[1] The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.[2] [3] [4]
If the likelihood function is differentiable, the derivative test for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the ordinary least squares estimator for a linear regression model maximizes the likelihood when the random errors are assumed to have normal distributions with the same variance.[5]
From the perspective of Bayesian inference, MLE is generally equivalent to maximum a posteriori (MAP) estimation with uniform prior distributions (or a normal prior distribution with a standard deviation of infinity). In frequentist inference, MLE is a special case of an extremum estimator, with the objective function being the likelihood.
We model a set of observations as a random sample from an unknown joint probability distribution which is expressed in terms of a set of parameters. The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability. We write the parameters governing the joint distribution as a vector
\theta=\left[\theta1,\theta2,\ldots,\thetak\right]T
\{f( ⋅ ;\theta)\mid\theta\in\Theta\} ,
\Theta
y=(y1,y2,\ldots,yn)
l{L}n(\theta)=l{L}n(\theta;y)=fn(y;\theta) ,
fn(y;\theta)
fn(y;\theta)=
n | |
\prod | |
k=1 |
univar(y | |
f | |
k; |
\theta)~.
The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space,[6] that is
\hat{\theta}=\underset{\theta\in\Theta}{\operatorname{arg max}}l{L}n(\theta;y)~.
~\hat{\theta}=\hat{\theta}n(y)\in\Theta~
l{L}n
\hat{\theta}n:Rn\to\Theta
\Theta
\Theta
In practice, it is often convenient to work with the natural logarithm of the likelihood function, called the log-likelihood:
\ell(\theta;y)=lnl{L}n(\theta;y)~.
\ell(\theta;y)
\theta
l{L}n~.
\ell(\theta;y)
\Theta,
\partial\ell | |
\partial\theta1 |
=0,
\partial\ell | |
\partial\theta2 |
=0, \ldots,
\partial\ell | |
\partial\thetak |
=0~,
\widehat{\theta},
\widehat{\theta}
H\left(\widehat{\theta}\right)=\begin{bmatrix}\left.
\partial2\ell | ||||||||
|
\right|\theta=\widehat{\theta
is negative semi-definite at
\widehat{\theta}
While the domain of the likelihood function—the parameter space—is generally a finite-dimensional subset of Euclidean space, additional restrictions sometimes need to be incorporated into the estimation process. The parameter space can be expressed as
\Theta=\left\{\theta:\theta\inRk, h(\theta)=0\right\}~,
where
h(\theta)=\left[h1(\theta),h2(\theta),\ldots,hr(\theta)\right]
Rk
Rr~.
\theta
\Theta
~h(\theta)=0~.
Theoretically, the most natural approach to this constrained optimization problem is the method of substitution, that is "filling out" the restrictions
h1,h2,\ldots,hr
h1,h2,\ldots,hr,hr+1,\ldots,hk
h\ast=\left[h1,h2,\ldots,hk\right]
Rk
\phii=hi(\theta1,\theta2,\ldots,\thetak)~.
\Sigma
\Sigma=\GammaT\Gamma ,
\Gamma
\GammaT
In practice, restrictions are usually imposed using the method of Lagrange which, given the constraints as defined above, leads to the restricted likelihood equations
\partial\ell | |
\partial\theta |
-
\partialh(\theta)T | |
\partial\theta |
λ=0
h(\theta)=0 ,
where
~λ=\left[λ1,λ2,\ldots,λr\right]T~
\partialh(\theta)T | |
\partial\theta |
Nonparametric maximum likelihood estimation can be performed using the empirical likelihood.
\widehat{\ell}(\theta;x)
\widehat{\ell}(\theta;x)= | 1n |
\sum |
n | |
i=1 |
lnf(xi\mid\theta),
\ell(\theta)=\operatorname{E}[lnf(xi\mid\theta)]
Maximum-likelihood estimators have no optimum properties for finite samples, in the sense that (when evaluated on finite samples) other estimators may have greater concentration around the true parameter-value. However, like other estimation methods, maximum likelihood estimation possesses a number of attractive limiting properties: As the sample size increases to infinity, sequences of maximum likelihood estimators have these properties:
\hat{\theta}
\theta
g(\theta)
\theta
\alpha=g(\theta)
\hat{\alpha}=g(\hat{\theta})
g
g
Under the conditions outlined below, the maximum likelihood estimator is consistent. The consistency means that if the data were generated by
f( ⋅ ;\theta0)
\widehat{\theta}
\widehat{\theta}mle \xrightarrow{p
Under slightly stronger conditions, the estimator converges almost surely (or strongly):
\widehat{\theta}mle \xrightarrow{a.s.
In practical applications, data is never generated by
f( ⋅ ;\theta0)
f( ⋅ ;\theta0)
To establish consistency, the following conditions are sufficient.[16]
The dominance condition can be employed in the case of i.i.d. observations. In the non-i.i.d. case, the uniform convergence in probability can be checked by showing that the sequence
\widehat{\ell}(\theta\midx)
If one wants to demonstrate that the ML estimator
\widehat{\theta}
\sup\theta\in\Theta\left\| \widehat{\ell}(\theta\midx)-\ell(\theta) \right\| \xrightarrow{a.s.
Additionally, if (as assumed above) the data were generated by
f( ⋅ ;\theta0)
\sqrt{n}\left(\widehat{\theta}mle-
-1 | |
\theta | |
0\right) \xrightarrow{d} l{N}\left(0,I |
\right)
The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability (or probability density, in the continuous case). If the parameter consists of a number of components, then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the complete parameter. Consistent with this, if
\widehat{\theta}
\theta
g(\theta)
\theta
\alpha=g(\theta)
\widehat{\alpha}=g(\widehat{\theta}).
It maximizes the so-called profile likelihood:
\bar{L}(\alpha)=\sup\theta:L(\theta).
The MLE is also equivariant with respect to certain transformations of the data. If
y=g(x)
g
fY(y)=
fX(x) | |
|g'(x)| |
and hence the likelihood functions for
X
Y
For example, the MLE parameters of the log-normal distribution are the same as those of the normal distribution fitted to the logarithm of the data.
As assumed above, if the data were generated by
~f( ⋅ ;\theta0)~,
\sqrt{n}\left(\widehat{\theta}mle-\theta0\right) \xrightarrow{d} l{N}\left(0, l{I}-1\right)~,
~l{I}~
l{I}jk=\operatorname{E}l[ -{
| |||||||||||
\partial\thetaj\partial\thetak |
} r]~.
In particular, it means that the bias of the maximum likelihood estimator is equal to zero up to the order .
However, when we consider the higher-order terms in the expansion of the distribution of this estimator, it turns out that has bias of order . This bias is equal to (componentwise)[19]
bh \equiv \operatorname{E}l[ \left(\widehat\thetamle-\theta0\right)h r] =
1 | |
n |
m | |
\sum | |
i,j,k=1 |
l{I}h l{I}j\left(
1 | |
2 |
Ki + Jj,i\right)
where
l{I}j
l{I}-1
1 | |
2 |
Ki + Jj,i = \operatorname{E}l[
12 | |||||||||||||
|
+
| |||||||
\partial\thetaj |
| |||||||||||
\partial\thetai\partial\thetak |
r]~.
Using these formulae it is possible to estimate the second-order bias of the maximum likelihood estimator, and correct for that bias by subtracting it:
* | |
\widehat{\theta} | |
mle |
=\widehat{\theta}mle-\widehat{b}~.
This bias-corrected estimator is (at least within the curved exponential family), meaning that it has minimal mean squared error among all second-order bias-corrected estimators, up to the terms of the order . It is possible to continue this process, that is to derive the third-order bias-correction term, and so on. However, the maximum likelihood estimator is not third-order efficient.[20]
A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior distribution on the parameters. Indeed, the maximum a posteriori estimate is the parameter that maximizes the probability of given the data, given by Bayes' theorem:
\operatorname{P}(\theta\midx1,x2,\ldots,xn)=
f(x1,x2,\ldots,xn\mid\theta)\operatorname{P | |
(\theta)}{\operatorname{P}(x |
1,x2,\ldots,xn)}
where
\operatorname{P}(\theta)
\operatorname{P}(x1,x2,\ldots,xn)
f(x1,x2,\ldots,xn\mid\theta)\operatorname{P}(\theta)
\operatorname{P}(\theta)
f(x1,x2,\ldots,xn\mid\theta)
\operatorname{P}(\theta)
In many practical applications in machine learning, maximum-likelihood estimation is used as the model for parameter estimation.
The Bayesian Decision theory is about designing a classifier that minimizes total expected risk, especially, when the costs (the loss function) associated with different decisions are equal, the classifier is minimizing the error over the whole distribution.[21]
Thus, the Bayes Decision Rule is stated as
"decide
w1
~\operatorname{P}(w1|x) > \operatorname{P}(w2|x)~;~
w2
w1,w2
w=\underset{w}{\operatorname{arg max}}
infty | |
\int | |
-infty |
\operatorname{P}(error\midx)\operatorname{P}(x)\operatorname{d}x~
\operatorname{P}(error\midx)=\operatorname{P}(w1\midx)~
w2
\operatorname{P}(error\midx)=\operatorname{P}(w2\midx)
w1 .
By applying Bayes' theorem
\operatorname{P}(wi\midx)=
\operatorname{P | |
(x |
\midwi)\operatorname{P}(wi)}{\operatorname{P}(x)}
hBayes=\underset{w}{\operatorname{arg max}}l[\operatorname{P}(x\midw)\operatorname{P}(w)r] ,
hBayes
\operatorname{P}(w)
Finding
\hat\theta
\hat\theta
Q\hat
P | |
\theta0 |
\theta
\hat\theta
P | |
\theta0 |
Proof. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
For simplicity of notation, let's assume that P=Q. Let there be n i.i.d data samples y=(y1,y2,\ldots,yn) y\sim
\hat\theta P\theta \begin{align} \hat\theta&=\underset{\theta}{\operatorname{argmax}}
(y)=\underset{\theta}{\operatorname{argmax}}P\theta(y)=\underset{\theta}{\operatorname{argmax}}P(y\mid\theta)\\ &=\underset{\theta}{\operatorname{argmax}}
P(yi\mid\theta)=\underset{\theta}{\operatorname{argmax}}
logP(yi\mid\theta)\\ &=\underset{\theta}{\operatorname{argmax}}\left(
logP(yi\mid\theta)-
logP(yi\mid\theta0)\right)=\underset{\theta}{\operatorname{argmax}}
\left(logP(yi\mid\theta)-logP(yi\mid\theta0)\right)\\ &=\underset{\theta}{\operatorname{argmax}}
=\underset{\theta}{\operatorname{argmin}}
log
=\underset{\theta}{\operatorname{argmin}}
log
\\ &=\underset{\theta}{\operatorname{argmin}}
h\theta(yi) \underset{n\toinfty}{\longrightarrow} \underset{\theta}{\operatorname{argmin}}E[h\theta(y)]\\ &=\underset{\theta}{\operatorname{argmin}}\int
(y)h\theta(y)dy=\underset{\theta}{\operatorname{argmin}}\int
(y)log
dy\\ &=\underset{\theta}{\operatorname{argmin}}DKL(P
\parallelP\theta)\end{align} Where h\theta(x)=log
\hat\theta Since cross entropy is just Shannon's entropy plus KL divergence, and since the entropy of
|
See main article: German tank problem. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution); thus, the sample size is 1. If n is unknown, then the maximum likelihood estimator
\widehat{n}
\widehat{n}
Suppose one wishes to determine just how biased an unfair coin is. Call the probability of tossing a 'head' p. The goal then becomes to determine p.
Suppose the coin is tossed 80 times: i.e. the sample might be something like x1 = H, x2 = T, ..., x80 = T, and the count of the number of heads "H" is observed.
The probability of tossing tails is 1 − p (so here p is θ above). Suppose the outcome is 49 heads and 31 tails, and suppose the coin was taken from a box containing three coins: one which gives heads with probability p = , one which gives heads with probability p = and another which gives heads with probability p = . The coins have lost their labels, so which one it was is unknown. Using maximum likelihood estimation, the coin that has the largest likelihood can be found, given the data that were observed. By using the probability mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but for different values of p (the "probability of success"), the likelihood function (defined below) takes one of three values:
\begin{align} \operatorname{P}l[ H=49\midp=\tfrac{1}{3} r]&=\binom{80}{49}(\tfrac{1}{3})49(1-\tfrac{1}{3})31 ≈ 0.000,\\[6pt] \operatorname{P}l[ H=49\midp=\tfrac{1}{2} r]&=\binom{80}{49}(\tfrac{1}{2})49(1-\tfrac{1}{2})31 ≈ 0.012,\\[6pt] \operatorname{P}l[ H=49\midp=\tfrac{2}{3} r]&=\binom{80}{49}(\tfrac{2}{3})49(1-\tfrac{2}{3})31 ≈ 0.054~. \end{align}
The likelihood is maximized when = , and so this is the maximum likelihood estimate for .
Now suppose that there was only one coin but its could have been any value The likelihood function to be maximised is
L(p)=fD(H=49\midp)=\binom{80}{49}p49(1-p)31~,
and the maximisation is over all possible values
One way to maximize this function is by differentiating with respect to and setting to zero:
\begin{align} 0&=
\partial | |
\partialp |
\left(\binom{80}{49}p49(1-p)31\right)~,\\[8pt] 0&=49p48(1-p)31-31p49(1-p)30\\[8pt] &=p48(1-p)30\left[49(1-p)-31p\right]\\[8pt] &=p48(1-p)30\left[49-80p\right]~. \end{align}
This is a product of three terms. The first term is 0 when = 0. The second is 0 when = 1. The third is zero when = . The solution that maximizes the likelihood is clearly = (since = 0 and = 1 result in a likelihood of 0). Thus the maximum likelihood estimator for is .
This result is easily generalized by substituting a letter such as in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields which is the maximum likelihood estimator for any sequence of Bernoulli trials resulting in 'successes'.
l{N}(\mu,\sigma2)
f(x\mid\mu,\sigma2)=
1 | |
\sqrt{2\pi\sigma2 |
} \exp\left(-
(x-\mu)2 | |
2\sigma2 |
\right),
the corresponding probability density function for a sample of independent identically distributed normal random variables (the likelihood) is
f(x1,\ldots,xn\mid\mu,\sigma2)=
n | |
\prod | |
i=1 |
f(xi\mid\mu,\sigma2)=\left(
1 | |
2\pi\sigma2 |
\right)n/2\exp\left(-
| |||||||||||||||
2\sigma2 |
\right).
This family of distributions has two parameters: ; so we maximize the likelihood,
l{L}(\mu,\sigma2)=f(x1,\ldots,xn\mid\mu,\sigma2)
Since the logarithm function itself is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm (the log-likelihood itself is not necessarily strictly increasing). The log-likelihood can be written as follows:
logl(l{L}(\mu,\sigma2)r)=-
n | |
2 |
log(2\pi\sigma2) -
1 | |
2\sigma2 |
n | |
\sum | |
i=1 |
2 | |
(x | |
i-\mu) |
(Note: the log-likelihood is closely related to information entropy and Fisher information.)
We now compute the derivatives of this log-likelihood as follows.
\begin{align} 0&=
\partial | |
\partial\mu |
logl(l{L}(\mu,\sigma2)r)= 0-
-2n(\bar{x | |
-\mu) }{2\sigma |
2}. \end{align}
\bar{x}
\widehat\mu=\bar{x}=
n | |
\sum | |
i=1 |
xi | |
n |
.
This is indeed the maximum of the function, since it is the only turning point in and the second derivative is strictly less than zero. Its expected value is equal to the parameter of the given distribution,
\operatorname{E}l[ \widehat\mu r]=\mu,
which means that the maximum likelihood estimator
\widehat\mu
Similarly we differentiate the log-likelihood with respect to and equate to zero:
\begin{align} 0&=
\partial | |
\partial\sigma |
logl(l{L}(\mu,\sigma2)r)=-
n | |
\sigma |
+
1 | |
\sigma3 |
n | |
\sum | |
i=1 |
2. \end{align} | |
(x | |
i-\mu) |
which is solved by
\widehat\sigma2=
1 | |
n |
2. | |
\sum | |
i-\mu) |
Inserting the estimate
\mu=\widehat\mu
\widehat\sigma2=
1 | |
n |
n | |
\sum | |
i=1 |
(xi-\bar{x})2=
1 | |
n |
n | |
\sum | |
i=1 |
2 | ||
x | - | |
i |
1 | |
n2 |
n | |
\sum | |
j=1 |
xixj.
To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables (statistical error)
\deltai\equiv\mu-xi
\widehat\sigma2=
1 | |
n |
n | |
\sum | |
i=1 |
(\mu-
2 | ||
\delta | - | |
i) |
1 | |
n2 |
n | |
\sum | |
j=1 |
(\mu-\deltai)(\mu-\deltaj).
Simplifying the expression above, utilizing the facts that
\operatorname{E}l[ \deltai r]=0
2 r] | |
\operatorname{E}l[ \delta | |
i |
=\sigma2
\operatorname{E}l[ \widehat\sigma2 r]=
n-1 | |
n |
\sigma2.
This means that the estimator
\widehat\sigma2
\sigma2
\widehat\sigma
\sigma
\widehat\sigma2
\widehat\sigma
Formally we say that the maximum likelihood estimator for
\theta=(\mu,\sigma2)
\widehat{\theta}=\left(\widehat{\mu},\widehat{\sigma}2\right).
In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.
The normal log-likelihood at its maximum takes a particularly simple form:
logl(l{L}(\widehat\mu,\widehat\sigma)r)=
-n | |
2 |
l(log(2\pi\widehat\sigma2)+1r)
This maximum log-likelihood can be shown to be the same for more general least squares, even for non-linear least squares. This is often used in determining likelihood-based approximate confidence intervals and confidence regions, which are generally more accurate than those using the asymptotic normality discussed above.
It may be the case that variables are correlated, that is, not independent. Two random variables
y1
y2
f(y1,y2)=f(y1)f(y2)
Suppose one constructs an order-n Gaussian vector out of random variables
(y1,\ldots,yn)
(\mu1,\ldots,\mun)
\Sigma
f(y1,\ldots,y
|
In the bivariate case, the joint probability density function is given by:
f(y1,y2)=
1 | |
2\pi\sigma1\sigma2\sqrt{1-\rho2 |
In this and other cases where a joint density function exists, the likelihood function is defined as above, in the section "principles," using this density.
X1, X2,\ldots, Xm
n
x1+x2+ … +xm=n
pi
p1+p2+ … +pm=1
Xi
x1, x2,\ldots,xm
f(x1,x2,\ldots,xm\midp1,p2,\ldots,p
|
\prod
xi | |
p | |
i |
=\binom{n}{x1,x2,\ldots,xm}
x1 | |
p | |
1 |
x2 | |
p | |
2 |
…
xm | |
p | |
m |
Each box taken separately against all the other boxes is a binomial and this is an extension thereof.
The log-likelihood of this is:
\ell(p1,p2,\ldots,pm)=log
m | |
n!-\sum | |
i=1 |
logxi!+\sum
m | |
i=1 |
xilogpi
The constraint has to be taken into account and use the Lagrange multipliers:
L(p1,p2,\ldots,pm,λ)=\ell(p1,p2,\ldots,pm)+λ\left(1-\sum
m | |
i=1 |
pi\right)
By posing all the derivatives to be 0, the most natural estimate is derived
\hat{p} | ||||
|
Maximizing log likelihood, with and without constraints, can be an unsolvable problem in closed form, then we have to use iterative procedures.
Except for special cases, the likelihood equations
\partial\ell(\theta;y) | |
\partial\theta |
=0
cannot be solved explicitly for an estimator
\widehat{\theta}=\widehat{\theta}(y)
\theta
\widehat{\theta}1
\left\{\widehat{\theta}r\right\}
\widehat{\theta}r+1=\widehat{\theta}r+ηrdr\left(\widehat{\theta}\right)
where the vector
dr\left(\widehat{\theta}\right)
ηr
(Note: here it is a maximization problem, so the sign before gradient is flipped)
ηr\in\R+
dr\left(\widehat{\theta}\right)=\nabla\ell\left(\widehat{\theta}r;y\right)
Gradient descent method requires to calculate the gradient at the rth iteration, but no need to calculate the inverse of second-order derivative, i.e., the Hessian matrix. Therefore, it is computationally faster than Newton-Raphson method.
ηr=1
dr\left(\widehat{\theta}\right)=
-1 | |
-H | |
r\left(\widehat{\theta}\right) |
sr\left(\widehat{\theta}\right)
where
sr(\widehat{\theta})
-1 | |
H | |
r |
\left(\widehat{\theta}\right)
dr\left(\widehat{\theta}\right)=-\left[
1 | |
n |
n | |
\sum | |
t=1 |
\partial\ell(\theta;y) | |
\partial\theta |
\left(
\partial\ell(\theta;y) | |
\partial\theta |
\right)T\right]-1sr\left(\widehat{\theta}\right)
Other quasi-Newton methods use more elaborate secant updates to give approximation of Hessian matrix.
DFP formula finds a solution that is symmetric, positive-definite and closest to the current approximate value of second-order derivative:
Hk+1= \left(I-\gammakyk
T\right) | |
s | |
k |
Hk\left(I-\gammaksk
T\right) | |
y | |
k |
+\gammakyk
T, | |
y | |
k |
where
yk=\nabla\ell(xk+sk)-\nabla\ell(xk),
\gammak=
1 | |||||||||
|
,
sk=xk+1-xk.
BFGS also gives a solution that is symmetric and positive-definite:
Bk+1=Bk+
| |||||||||||||
|
-
| |||||||||||||||||||
|
,
where
yk=\nabla\ell(xk+sk)-\nabla\ell(xk),
sk=xk+1-xk.
BFGS method is not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum. However, BFGS can have acceptable performance even for non-smooth optimization instances
Another popular method is to replace the Hessian with the Fisher information matrix,
l{I}(\theta)=\operatorname{E}\left[Hr\left(\widehat{\theta}\right)\right]
Although popular, quasi-Newton methods may converge to a stationary point that is not necessarily a local or global maximum,[32] but rather a local minimum or a saddle point. Therefore, it is important to assess the validity of the obtained solution to the likelihood equations, by verifying that the Hessian, evaluated at the solution, is both negative definite and well-conditioned.[33]
Early users of maximum likelihood include Carl Friedrich Gauss, Pierre-Simon Laplace, Thorvald N. Thiele, and Francis Ysidro Edgeworth.[34] [35] It was Ronald Fisher however, between 1912 and 1922, who singlehandedly created the modern version of the method.[36] [37]
Maximum-likelihood estimation finally transcended heuristic justification in a proof published by Samuel S. Wilks in 1938, now called Wilks' theorem.[38] The theorem shows that the error in the logarithm of likelihood values for estimates from multiple independent observations is asymptotically χ 2-distributed, which enables convenient determination of a confidence region around any estimate of the parameters. The only difficult part of Wilks' proof depends on the expected value of the Fisher information matrix, which is provided by a theorem proven by Fisher.[39] Wilks continued to improve on the generality of the theorem throughout his life, with his most general proof published in 1962.[40]
Reviews of the development of maximum likelihood estimation have been provided by a number of authors.[41] [42] [43] [44] [45] [46] [47] [48]