In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and C. R. Rao,[1] [2] [3] but has also been derived independently by Maurice Fréchet,[4] Georges Darmois,[5] and by Alexander Aitken and Harold Silverstone.[6] [7] It is also known as Fréchet-Cramér–Rao or Fréchet-Darmois-Cramér-Rao lower bound. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.
An unbiased estimator that achieves this bound is said to be (fully) efficient. Such a solution achieves the lowest possible mean squared error among all unbiased methods, and is, therefore, the minimum variance unbiased (MVU) estimator. However, in some cases, no unbiased technique exists which achieves the bound. This may occur either if for any unbiased estimator, there exists another with a strictly smaller variance, or if an MVU estimator exists, but its variance is strictly greater than the inverse of the Fisher information.
The Cramér–Rao bound can also be used to bound the variance of estimators of given bias. In some cases, a biased approach can result in both a variance and a mean squared error that are the unbiased Cramér–Rao lower bound; see estimator bias.
Significant progress over the Cramér–Rao lower bound was proposed by A. Bhattacharyya through a series of works, called Bhattacharyya Bound.[8] [9] [10] [11]
The Cramér–Rao bound is stated in this section for several increasingly general cases, beginning with the case in which the parameter is a scalar and its estimator is unbiased. All versions of the bound require certain regularity conditions, which hold for most well-behaved distributions. These conditions are listed later in this section.
Suppose
\theta
n
x
f(x;\theta)
\hat{\theta}
\theta
I(\theta)
\operatorname{var}(\hat{\theta}) \geq | 1 |
I(\theta) |
I(\theta)
I(\theta)=n\operatorname{E}X;\theta\left[ \left(
\partial\ell(X;\theta) | |
\partial\theta |
\right)2 \right]
and
\ell(x;\theta)=log(f(x;\theta))
x
\operatorname{E}x;\theta
f(x;\theta)
X
X
If
\ell(x;\theta)
I(\theta)=-n\operatorname{E}X;\theta\left[
\partial2\ell(X;\theta) | |
\partial\theta2 |
\right]
The efficiency of an unbiased estimator
\hat{\theta}
e(\hat{\theta})=
I(\theta)-1 | |
\operatorname{var |
(\hat{\theta})}
or the minimum possible variance for an unbiased estimator divided by its actual variance.The Cramér–Rao lower bound thus gives
e(\hat{\theta})\le1
A more general form of the bound can be obtained by considering a biased estimator
T(X)
\theta
\psi(\theta)
E\{T(X)\}-\theta=\psi(\theta)-\theta
\operatorname{var}(T) \geq | [\psi'(\theta)]2 |
I(\theta) |
\psi'(\theta)
\psi(\theta)
\theta
I(\theta)
Apart from being a bound on estimators of functions of the parameter, this approach can be used to derive a bound on the variance of biased estimators with a given bias, as follows.[14] Consider an estimator
\hat{\theta}
b(\theta)=E\{\hat{\theta}\}-\theta
\psi(\theta)=b(\theta)+\theta
\psi(\theta)
(\psi'(\theta))2/I(\theta)
\hat{\theta}
b(\theta)
\operatorname{var}\left(\hat{\theta}\right) \geq
[1+b'(\theta)]2 | |
I(\theta) |
.
b(\theta)=0
It's trivial to have a small variance − an "estimator" that is constant has a variance of zero. But from the above equation, we find that the mean squared error of a biased estimator is bounded by
| ||||
\operatorname{E}\left((\hat{\theta}-\theta) |
+b(\theta)2,
using the standard decomposition of the MSE. Note, however, that if
1+b'(\theta)<1
1/I(\theta)
1+b'(\theta)=
n | |
n+2 |
<1
Extending the Cramér–Rao bound to multiple parameters, define a parameter column vector
\boldsymbol{\theta}=\left[\theta1,\theta2,...,\thetad\right]T\inRd
f(x;\boldsymbol{\theta})
The Fisher information matrix is a
d x d
Im,
Im,=\operatorname{E}\left[
\partial | |
\partial\thetam |
logf\left(x;\boldsymbol{\theta}\right)
\partial | |
\partial\thetak |
logf\left(x;\boldsymbol{\theta}\right) \right]=-\operatorname{E}\left[
\partial2 | |
\partial\thetam\partial\thetak |
logf\left(x;\boldsymbol{\theta}\right) \right].
Let
\boldsymbol{T}(X)
\boldsymbol{T}(X)=(T1(X),\ldots,
T | |
T | |
d(X)) |
\operatorname{E}[\boldsymbol{T}(X)]
\boldsymbol{\psi}(\boldsymbol{\theta})
\boldsymbol{T}(X)
T \operatorname{cov} | |
I\left(\boldsymbol{\theta}\right) \geq \phi(\theta) | |
\boldsymbol{\theta |
\operatorname{cov}\boldsymbol{\theta
A\geB
A-B
\phi(\theta):=\partial\boldsymbol{\psi}(\boldsymbol{\theta})/\partial\boldsymbol{\theta}
ij
\partial\psii(\boldsymbol{\theta})/\partial\thetaj
If
\boldsymbol{T}(X)
\boldsymbol{\theta}
\boldsymbol{\psi}\left(\boldsymbol{\theta}\right)=\boldsymbol{\theta}
\operatorname{cov}\boldsymbol{\theta
If it is inconvenient to compute the inverse of the Fisher information matrix,then one can simply take the reciprocal of the corresponding diagonal elementto find a (possibly loose) lower bound.[16]
\operatorname{var}\boldsymbol{\theta
The bound relies on two weak regularity conditions on the probability density function,
f(x;\theta)
T(X)
x
f(x;\theta)>0
x
\theta
T
f(x;\theta)
x
\theta
f(x;\theta)
\theta
Proof based on.[17]
For the general scalar case:
Assume that
T=t(X)
\psi(\theta)
X
\operatorname{E}(T)=\psi(\theta)
\theta
\operatorname{var}(t(X))\geq
[\psi\prime(\theta)]2 | |
I(\theta) |
.
Let
X
f(x;\theta)
T=t(X)
\psi(\theta)
V
V=
\partial | |
\partial\theta |
lnf(X;\theta)=
1 | |
f(X;\theta) |
\partial | |
\partial\theta |
f(X;\theta)
where the chain rule is used in the final equality above. Then the expectation of
V
\operatorname{E}(V)
\operatorname{E}(V)=\intf(x;\theta)\left[
1 | |
f(x;\theta) |
\partial | |
\partial\theta |
f(x;\theta)\right]dx=
\partial | |
\partial\theta |
\intf(x;\theta)dx=0
where the integral and partial derivative have been interchanged (justified by the second regularity condition).
\operatorname{cov}(V,T)
V
T
\operatorname{cov}(V,T)=\operatorname{E}(VT)
\operatorname{E}(V)=0
\begin{align} \operatorname{cov}(V,T) &=\operatorname{E} \left(T ⋅ \left[
1 | |
f(X;\theta) |
\partial | |
\partial\theta |
f(X;\theta)\right] \right)\\[6pt] &=\intt(x)\left[
1 | |
f(x;\theta) |
\partial | |
\partial\theta |
f(x;\theta)\right]f(x;\theta)dx\\[6pt] &=
\partial | |
\partial\theta |
\left[\intt(x)f(x;\theta)dx\right] =
\partial | |
\partial\theta |
E(T)=\psi\prime(\theta) \end{align}
again because the integration and differentiation operations commute (second condition).
The Cauchy–Schwarz inequality shows that
\sqrt{\operatorname{var}(T)\operatorname{var}(V)}\geq\left|\operatorname{cov}(V,T)\right|=\left|\psi\prime(\theta) \right|
therefore
\operatorname{var}(T)\geq
[\psi\prime(\theta)]2 | |
\operatorname{var |
(V)} =
[\psi\prime(\theta)]2 | |
I(\theta) |
For the case of a d-variate normal distribution
\boldsymbol{x} \sim l{N}d \left(\boldsymbol{\mu}(\boldsymbol{\theta}) , {\boldsymbolC}(\boldsymbol{\theta}) \right)
Im,=
\partial\boldsymbol{\mu | |
T}{\partial |
\thetam} {\boldsymbolC}-1
\partial\boldsymbol{\mu | |
For example, let
w[j]
n
\theta
\sigma2
w[j]\siml{N}d,\left(\theta{\boldsymbol1},\sigma2{\boldsymbolI}\right).
I(\theta) = \left( | \partial\boldsymbol{\mu |
(\theta)}{\partial\theta}\right) |
T{\boldsymbolC}-1\left(
\partial\boldsymbol{\mu | |
(\theta)}{\partial\theta}\right) = |
n | |
\sum | |
i=1 |
1 | |
\sigma2 |
=
n | |
\sigma2 |
,
and so the Cramér–Rao bound is
\operatorname{var}\left(\hat\theta\right) \geq
\sigma2 | |
n |
.
Suppose X is a normally distributed random variable with known mean
\mu
\sigma2
T= |
| |||||||||||||||
n |
.
Then T is unbiased for
\sigma2
E(T)=\sigma2
\operatorname{var}(T)=\operatorname{var}\left(
| |||||||
n |
\right)=
| |||||||
(X |
2}{n | |
i-\mu) |
2}=
n\operatorname{var | |
(X-\mu) |
2}{n
| ||||
\left[ \operatorname{E}\left\{(X-\mu)4\right\}-\left(\operatorname{E}\{(X-\mu)2\}\right)2 \right]
(the second equality follows directly from the definition of variance). The first term is the fourth moment about the mean and has value
3(\sigma2)2
(\sigma2)2
\operatorname{var}(T)= | 2(\sigma2)2 |
n |
.
V
V= | \partial |
\partial\sigma2 |
log\left[L(\sigma2,X)\right]
where
L
| ||||
log\left[L(\sigma |
V= | \partial |
\partial\sigma2 |
log\left[L(\sigma2,X)\right]=
\partial | |
\partial\sigma2 |
| ||||
\left[-log(\sqrt{2\pi\sigma |
\right]=-
1 | + | |
2\sigma2 |
(X-\mu)2 | |
2(\sigma2)2 |
where the second equality is from elementary calculus. Thus, the information in a single observation is just minus the expectation of the derivative of
V
I =-\operatorname{E}\left( | \partialV | \right) =-\operatorname{E}\left(- |
\partial\sigma2 |
(X-\mu)2 | + | |
(\sigma2)3 |
1 | \right) = | |
2(\sigma2)2 |
\sigma2 | - | |
(\sigma2)3 |
1 | = | |
2(\sigma2)2 |
1 | |
2(\sigma2)2 |
.
Thus the information in a sample of
n
n
n | |
2(\sigma2)2 |
.
The Cramér–Rao bound states that
\operatorname{var}(T)\geq | 1 |
I |
.
In this case, the inequality is saturated (equality is achieved), showing that the estimator is efficient.
However, we can achieve a lower mean squared error using a biased estimator. The estimator
T= |
| |||||||||||||||
n+2 |
.
obviously has a smaller variance, which is in fact
\operatorname{var}(T)= | 2n(\sigma2)2 |
(n+2)2 |
.
Its bias is
\left(1- | n |
n+2 |
| ||||
\right)\sigma |
so its mean squared error is
\operatorname{MSE}(T)=\left( | 2n | + |
(n+2)2 |
4 | |
(n+2)2 |
\right)(\sigma2)
| ||||
which is less than what unbiased estimators can achieve according to the Cramér–Rao bound.
When the mean is not known, the minimum mean squared error estimate of the variance of a sample from Gaussian distribution is achieved by dividing by
n+1
n-1
n+2