In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function (i.e., the posterior expected loss). Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.
Suppose an unknown parameter
\theta
\pi
\widehat{\theta}=\widehat{\theta}(x)
\theta
L(\theta,\widehat{\theta})
\widehat{\theta}
E\pi(L(\theta,\widehat{\theta}))
\theta
\widehat{\theta}
\widehat{\theta}
E(L(\theta,\widehat{\theta})|x)
x
If the prior is improper then an estimator which minimizes the posterior expected loss for each
x
See main article: Minimum mean square error. The most common risk function used for Bayesian estimation is the mean square error (MSE), also called squared error risk. The MSE is defined by
MSE=E\left[(\widehat{\theta}(x)-\theta)2\right],
\theta
x
Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the posterior distribution,[3]
\widehat{\theta}(x)=E[\theta|x]=\int\thetap(\theta|x)d\theta.
See main article: Conjugate prior. If there is no inherent reason to prefer one prior probability distribution over another, a conjugate prior is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some parametric family, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.
Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.
Following are some examples of conjugate priors.
x|\theta
x|\theta\simN(\theta,\sigma2)
\theta\simN(\mu,\tau2)
\widehat{\theta}(x)= | \sigma2 | \mu+ |
\sigma2+\tau2 |
\tau2 | |
\sigma2+\tau2 |
x.
x1,...,xn
xi|\theta\simP(\theta)
\theta\simG(a,b)
\widehat{\theta}(X)= | n\overline{X |
+a}{n+b}. |
x1,...,xn
xi|\theta\simU(0,\theta)
\theta\simPa(\theta0,a)
\widehat{\theta}(X)= | (a+n)max{(\theta0,x1,...,xn) |
Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function by
F
a>0
L(\theta,\widehat{\theta})=a|\theta-\widehat{\theta}|
F(\widehat{\theta}(x)|X)=\tfrac{1}{2}.
a,b>0
L(\theta,\widehat{\theta})=\begin{cases} a|\theta-\widehat{\theta}|,&for\theta-\widehat{\theta}\ge0\\ b|\theta-\widehat{\theta}|,&for\theta-\widehat{\theta}<0 \end{cases}
F(\widehat{\theta}(x)|X)=
a | |
a+b |
.
K>0
L>0
L(\theta,\widehat{\theta})=\begin{cases} 0,&for|\theta-\widehat{\theta}|<K\\ L,&for|\theta-\widehat{\theta}|\geK. \end{cases}
Other loss functions can be conceived, although the mean squared error is the most widely used and validated. Other loss functions are used in statistics, particularly in robust statistics.
The prior distribution
p
\intp(\theta)d\theta=1.
p(\theta)=1
\int{p(\theta)d\theta}=infty.
p(\theta)
The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution
p(\theta|x)=
p(x|\theta)p(\theta) | |
\intp(x|\theta)p(\theta)d\theta |
.
\int{L(\theta,a)p(\theta|x)d\theta}
A typical example is estimation of a location parameter with a loss function of the type
L(a-\theta)
\theta
p(x|\theta)=f(x-\theta)
It is common to use the improper prior
p(\theta)=1
p(\theta|x)=
p(x|\theta)p(\theta) | |
p(x) |
=
f(x-\theta) | |
p(x) |
E[L(a-\theta)|x]=\int{L(a-\theta)p(\theta|x)d\theta}=
1 | |
p(x) |
\intL(a-\theta)f(x-\theta)d\theta.
a(x)
x
\intL(a-\theta)f(x-\theta)d\theta
x.
In this case it can be shown that the generalized Bayes estimator has the form
x+a0
a0
a0
x=0
x1
\intL(a-\theta)f(x1-\theta)d\theta=\intL(a-x1-\theta')f(-\theta')d\theta'.
a
a-x1
a-x1=a0
a(x)=a0+x.
See main article: Empirical Bayes method. A Bayes estimator derived through the empirical Bayes method is called an empirical Bayes estimator. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.
There are both parametric and non-parametric approaches to empirical Bayes estimation.[4]
The following is a simple example of parametric empirical Bayes estimation. Given past observations
x1,\ldots,xn
f(xi|\thetai)
\thetan+1
xn+1
\thetai
\pi
\pi
\mu\pi
\sigma\pi.
\pi
First, we estimate the mean
\mum
\sigmam
x1,\ldots,xn
\widehat{\mu} | ||||
|
\sum{xi},
2 | ||
\widehat{\sigma} | = | |
m |
1 | |
n |
\sum{(xi-\widehat{\mu}
2 | |
m) |
\mum
2 | |
\sigma | |
m |
\mum=E\pi[\muf(\theta)],
2 | |
\sigma | |
m |
=E\pi[\sigma
2 | |
f |
(\theta)]+E\pi[(\muf(\theta)-\mu
2 | |
m) |
],
\muf(\theta)
\sigmaf(\theta)
f(xi|\thetai)
\muf(\theta)=\theta
2 | |
\sigma | |
f |
(\theta)=K
\mu\pi=\mum,
2 | |
\sigma | |
\pi |
2 | |
=\sigma | |
m |
2 | |
-\sigma | |
f |
2 | |
=\sigma | |
m |
-K.
\widehat{\mu}\pi=\widehat{\mu}m,
2 | |
\widehat{\sigma} | |
\pi |
2 | |
=\widehat{\sigma} | |
m |
-K.
xi|\thetai\simN(\thetai,1)
\thetan+1\simN(\widehat{\mu}\pi,\widehat{\sigma}
2 | |
\pi |
)
\thetan+1
xn+1
See also: Admissible decision rule. Bayes rules having finite Bayes risk are typically admissible. The following are some specific examples of admissibility theorems.
By contrast, generalized Bayes rules often have undefined Bayes risk in the case of improper priors. These rules are often inadmissible and the verification of their admissibility can be difficult. For example, the generalized Bayes estimator of a location parameter θ based on Gaussian samples (described in the "Generalized Bayes estimator" section above) is inadmissible for
p>2
Let θ be an unknown random variable, and suppose that
x1,x2,\ldots
f(xi|\theta)
\deltan=\deltan(x1,\ldots,xn)
\deltan
To this end, it is customary to regard θ as a deterministic parameter whose true value is
\theta0
\sqrt{n}(\deltan-\theta0)\toN\left(0,
1 | |
I(\theta0) |
\right),
where I(θ0) is the Fisher information of θ0.It follows that the Bayes estimator δn under MSE is asymptotically efficient.
Another estimator which is asymptotically normal and efficient is the maximum likelihood estimator (MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.
Consider the estimator of θ based on binomial sample x~b(θ,n) where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior, which in this case is the Beta distribution B(a,b), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is
\delta | ||||
|
.
\delta | E[\theta]+ | ||||
|
n | |
a+b+n |
\deltaMLE.
On the other hand, when n is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume that a=b; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight as a+b bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(a,b) exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviation d which gives the weight of prior information equal to 1/(4d2)-1 bits of new information."
Another example of the same phenomena is the case when the prior estimate and a measurement are normally distributed. If the prior is centered at B with deviation Σ, and the measurement is centered at b with deviation σ, then the posterior is centered at
\alpha | B+ | |
\alpha+\beta |
\beta | |
\alpha+\beta |
b
For example, if Σ=σ/2, then the deviation of 4 measurements combined matches the deviation of the prior (assuming that errors of measurements are independent). And the weights α,β in the formula for posterior match this: the weight of the prior is 4 times the weight of the measurement. Combining this prior with n measurements with average v results in the posterior centered at
4 | V+ | |
4+n |
n | |
4+n |
v
Compare to the example of binomial distribution: there the prior has the weight of (σ/Σ)²−1 measurements. One can see that the exact weight does depend on the details of the distribution, but when σ≫Σ, the difference becomes small.
The Internet Movie Database uses a formula for calculating and comparing the ratings of films by its users, including their Top Rated 250 Titles which is claimed to give "a true Bayesian estimate".[7] The following Bayesian formula was initially used to calculate a weighted average score for the Top 250, though the formula has since changed:
W={Rv+Cm\overv+m}
W
R
v
m
C
Note that W is just the weighted arithmetic mean of R and C with weight vector (v, m). As the number of ratings surpasses m, the confidence of the average rating surpasses the confidence of the mean vote for all films (C), and the weighted bayesian rating (W) approaches a straight average (R). The closer v (the number of ratings for the film) is to zero, the closer W is to C, where W is the weighted rating and C is the average rating of all films. So, in simpler terms, the fewer ratings/votes cast for a film, the more that film's Weighted Rating will skew towards the average across all films, while films with many ratings/votes will have a rating approaching its pure arithmetic average rating.
IMDb's approach ensures that a film with only a few ratings, all at 10, would not rank above "the Godfather", for example, with a 9.2 average from over 500,000 ratings.