In decision theory and estimation theory, Stein's example (also known as Stein's phenomenon or Stein's paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on average (that is, having lower expected mean squared error) than any method that handles the parameters separately. It is named after Charles Stein of Stanford University, who discovered the phenomenon in 1955.
An intuitive explanation is that optimizing for the mean-squared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent. If one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.
The following is the simplest form of the paradox, the special case in which the number of observations is equal to the number of parameters to be estimated. Let
\boldsymbol{\theta}
n\geq3
Xi
\thetai
X
n
\boldsymbol{\theta}
X\siml{N}(\boldsymbol{\theta},In)
Under these conditions, it is intuitive and common to use each measurement as an estimate of its corresponding parameter. This so-called "ordinary" decision rule can be written as
\hat{\boldsymbol{\theta}}=X
E[\|\boldsymbol{\theta}-\hat\boldsymbol{\theta}\|2]
n\geq3
\boldsymbol{\theta}
\boldsymbol{\theta}
\boldsymbol{\theta}
\boldsymbol{\theta}
The estimators of Stein's paradox are, for a given
\boldsymbol{\theta}
X
X
\hat{\boldsymbol{\theta}}1
\hat{\boldsymbol{\theta}}2
\boldsymbol{\theta}
\hat{\boldsymbol{\theta}}1
\hat{\boldsymbol{\theta}}2
\boldsymbol{\theta}
Many simple, practical estimators achieve better performance than the "ordinary" decision rule. The best-known example is the James–Stein estimator, which shrinks
X
X
n
n
n\geq3
n\geq3
For any particular value of
\boldsymbol{\theta}
E[(\thetai-
2]. | |
\hat{\theta} | |
i) |
\boldsymbol{\theta}
\sigma=1
X
\operatorname{sign}(Xi)max(|Xi|-0.5,0)
0.5
X
\boldsymbol{\theta}
X
\boldsymbol{\theta}
Xi
\hat{\theta}i
\boldsymbol{\theta}
n
An example of the above setting occurs in channel estimation in telecommunications, for instance, because different factors affect overall channel performance.
Stein's example is surprising, since the "ordinary" decision rule is intuitive and commonly used. In fact, numerous methods for estimator construction, including maximum likelihood estimation, best linear unbiased estimation, least squares estimation and optimal equivariant estimation, all result in the "ordinary" estimator. Yet, as discussed above, this estimator is suboptimal.
To demonstrate the unintuitive nature of Stein's example, consider the following real-world example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.
At first sight it appears that somehow we get a better estimator for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbledon and the weight of a candy bar. However, we have not obtained a better estimator for US wheat yield by itself, but we have produced an estimator for the vector of the means of all three random variables, which has a reduced total risk. This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component. Also, a specific set of the three estimated mean values obtained with the new estimator will not necessarily be better than the ordinary set (the measured values). It is only on average that the new estimator is better.
The risk function of the decision rule
d(x)=x
R(\theta,d)=\operatorname{E}\theta[|\boldsymbol{\theta}-X|2]
| ||||
=\int(\boldsymbol{\theta}-x) |
\right)n/2e(-1/2)(\boldsymbol{\theta-x)T(\boldsymbol{\theta}-x)}dx
=n.
Now consider the decision rule
d'(x)=x-
\alpha | |
|x|2 |
x,
\alpha=n-2
d'
d
R(\theta,d')=\operatorname{E}\theta\left[\left|\theta-X+
\alpha | |
|X|2 |
X\right|2\right]
=\operatorname{E}\theta\left[|\theta-X|2+
| ||||
2(\theta-X) |
X+
\alpha2 | |
|X|4 |
|X|2\right]
=\operatorname{E}\theta\left[|\theta-X|2\right]+
2\alpha\operatorname{E} | ||||
|
\right]+
2\operatorname{E} | ||||
\alpha | ||||
|
\right]
\alpha
h:x\mapstoh(x)\inR
1\leqi\leqn
h
xi
\operatorname{E}\theta[(\thetai-Xi)h(X)\midXj=xj(j ≠ i)]=\int(\thetai-
x | ||||
|
\right)n/2e-(1/2)(\boldsymbol{\theta
T(\boldsymbol{\theta}-x)}dx | |
-x) | |
i |
=\left[h(x)\left( | 1 |
2\pi |
\right)n/2e-(1/2)(\boldsymbol{\theta-x)T(\boldsymbol{\theta}-x)}\right]
infty | |
xi=-infty |
-\int
\partialh | (x)\left( | |
\partialxi |
1 | |
2\pi |
\right)n/2e-(1/2)(\boldsymbol{\theta
T(\boldsymbol{\theta}-x)}dx | |
-x) | |
i |
=-
\operatorname{E} | ||||
|
(X)\midXj=xj(j ≠ i)\right].
Therefore,
\operatorname{E}\theta[(\thetai-Xi)h(X)]=
-\operatorname{E} | ||||
|
(X)\right].
(This result is known as Stein's lemma.) Now, we choose
h(x)=
xi | |
|x|2 |
.
h
\partialh | |
\partialxi |
=
1 | |
|x|2 |
-
| |||||||||
|x|4 |
TX | ||||
\operatorname{E} | ||||
|
^2 |
=-
n | |
\sum | |
i=1 |
\operatorname{E}\theta\left[
1 | |
|X|2 |
-
| |||||||||
|X|4 |
\right]
=-(n-2)\operatorname{E}\theta\left[
1 | |
|X|2 |
\right].
Then returning to the risk function of
d'
R(\theta,d')=n-
2\alpha(n-2)\operatorname{E} | ||||
|
\right]+
2\operatorname{E} | ||||
\alpha | ||||
|
\right].
This quadratic in
\alpha
\alpha=n-2
R(\theta,d')=R(\theta,d)-
2\operatorname{E} | ||||
(n-2) | ||||
|
\right]
R(\theta,d')<R(\theta,d).
d
It remains to justify the use of
h(X)=
X | |
|X|2 |
.
This function is not continuously differentiable, since it is singular at
x=0
h(X)=
X | |
\varepsilon+|X|2 |
is continuously differentiable, and after following the algebra through and letting
\varepsilon\to0