In statistics and signal processing, a minimum mean square error (MMSE) estimator is an estimation method which minimizes the mean square error (MSE), which is a common measure of estimator quality, of the fitted values of a dependent variable. In the Bayesian setting, the term MMSE more specifically refers to estimation with quadratic loss function. In such case, the MMSE estimator is given by the posterior mean of the parameter to be estimated. Since the posterior mean is cumbersome to calculate, the form of the MMSE estimator is usually constrained to be within a certain class of functions. Linear MMSE estimators are a popular choice since they are easy to use, easy to calculate, and very versatile. It has given rise to many popular estimators such as the Wiener–Kolmogorov filter and Kalman filter.
The term MMSE more specifically refers to estimation in a Bayesian setting with quadratic cost function. The basic idea behind the Bayesian approach to estimation stems from practical situations where we often have some prior information about the parameter to be estimated. For instance, we may have prior information about the range that the parameter can assume; or we may have an old estimate of the parameter that we want to modify when a new observation is made available; or the statistics of an actual random signal such as speech. This is in contrast to the non-Bayesian approach like minimum-variance unbiased estimator (MVUE) where absolutely nothing is assumed to be known about the parameter in advance and which does not account for such situations. In the Bayesian approach, such prior information is captured by the prior probability density function of the parameters; and based directly on Bayes theorem, it allows us to make better posterior estimates as more observations become available. Thus unlike non-Bayesian approach where parameters of interest are assumed to be deterministic, but unknown constants, the Bayesian estimator seeks to estimate a parameter that is itself a random variable. Furthermore, Bayesian estimation can also deal with situations where the sequence of observations are not necessarily independent. Thus Bayesian estimation provides yet another alternative to the MVUE. This is useful when the MVUE does not exist or cannot be found.
Let
x
n x 1
y
m x 1
\hat{x}(y)
x
y
e=\hat{x}-x
\operatorname{MSE}=\operatorname{tr}\left\{\operatorname{E}\{(\hat{x}-x)(\hat{x}-x)T\}\right\}=\operatorname{E}\{(\hat{x}-x)T(\hat{x}-x)\},
\operatorname{E}
x
y
x
\operatorname{E}\left\{(\hat{x}-x)2\right\}
\operatorname{tr}\left\{\operatorname{E}\{eeT\}\right\}=\operatorname{E}\left\{\operatorname{tr}\{eeT\}\right\}=\operatorname{E}\{eTe\}=
n | |
\sum | |
i=1 |
2\}. | |
\operatorname{E}\{e | |
i |
The MMSE estimator is then defined as the estimator achieving minimal MSE:
\hat{x}\operatorname{MMSE
\hat{x}\operatorname{MMSE
In other words, the MMSE estimator is the conditional expectation of
x
\hat{x}MMSE
Ce=\operatorname{E}\{(\hat{x}-x)(\hat{x}-x)T\}
CX|Y
Ce=CX|Y
\operatorname{E}\{\hat{x}\operatorname{MMSE
\sqrt{n}(\hat{x}\operatorname{MMSE
where
I(x)
x
x
\hat{x}=g(y)
\hat{x}\operatorname{MMSE
\operatorname{E}\{(\hat{x}\operatorname{MMSE
for all
g(y)
l{V}=\{g(y)\midg:Rm → R,\operatorname{E}\{g(y)2\}<+infty\}
\operatorname{E}\{
*(y)-x | |
(g | |
i) |
gj(y)\}=0,
for all i and j. More succinctly put, the cross-correlation between the minimum estimation error
\hat{x}\operatorname{MMSE
\hat{x}
\operatorname{E}\{(\hat{x}\operatorname{MMSE
x
y
Wy+b
W
b
In many cases, it is not possible to determine the analytical expression of the MMSE estimator. Two basic numerical approaches to obtain the MMSE estimate depends on either finding the conditional expectation
\operatorname{E}\{x\midy\}
One possibility is to abandon the full optimality requirements and seek a technique minimizing the MSE within a particular class of estimators, such as the class of linear estimators. Thus, we postulate that the conditional expectation of
x
y
y
\operatorname{E}\{x\midy\}=Wy+b
y
W
b
\operatorname{E}\{x\midy\}
minW,b\operatorname{MSE} s.t. \hat{x}=Wy+b.
One advantage of such linear MMSE estimator is that it is not necessary to explicitly calculate the posterior probability density function of
x
x
y
x
y
The expression for optimal
b
W
b=\bar{x}-W\bar{y},
W=CXY
-1 | |
C | |
Y |
.
where
\bar{x}=\operatorname{E}\{x\}
\bar{y}=\operatorname{E}\{y\},
CXY
x
y
CY
y
Thus, the expression for linear MMSE estimator, its mean, and its auto-covariance is given by
\hat{x}=CXY
-1 | |
C | |
Y |
(y-\bar{y})+\bar{x},
\operatorname{E}\{\hat{x}\}=\bar{x},
C\hat{X
where the
CYX
y
x
Lastly, the error covariance and minimum mean square error achievable by such estimator is
Ce=CX-C\hat{X
\operatorname{LMMSE}=\operatorname{tr}\{Ce\}.
Let us have the optimal linear MMSE estimator given as
\hat{x}=Wy+b
W
b
\operatorname{E}\{\hat{x}\}=\operatorname{E}\{x\}.
Plugging the expression for
\hat{x}
b=\bar{x}-W\bar{y},
where
\bar{x}=\operatorname{E}\{x\}
\bar{y}=\operatorname{E}\{y\}
\hat{x}=W(y-\bar{y})+\bar{x}
and the expression for estimation error becomes
\hat{x}-x=W(y-\bar{y})-(x-\bar{x}).
From the orthogonality principle, we can have
\operatorname{E}\{(\hat{x}-x)(y-\bar{y})T\}=0
g(y)=y-\bar{y}
\begin{align} \operatorname{E}\{(\hat{x}-x)(y-\bar{y})T\}&=\operatorname{E}\{(W(y-\bar{y})-(x-\bar{x}))(y-\bar{y})T\}\ &=W\operatorname{E}\{(y-\bar{y})(y-\bar{y})T\}-\operatorname{E}\{(x-\bar{x})(y-\bar{y})T\}\ &=WCY-CXY.\end{align}
When equated to zero, we obtain the desired expression for
W
W=CXY
-1 | |
C | |
Y |
.
The
CXY
CY
CXY
T | |
=C | |
YX |
CYX
WT=
-1 | |
C | |
Y |
CYX.
Thus the full expression for the linear MMSE estimator is
\hat{x}=CXY
-1 | |
C | |
Y |
(y-\bar{y})+\bar{x}.
Since the estimate
\hat{x}
\operatorname{E}\{\hat{x}\}=\bar{x}
\begin{align} C\hat{X
Putting the expression for
W
WT
C\hat{X
Lastly, the covariance of linear MMSE estimation error will then be given by
\begin{align} Ce&=\operatorname{E}\{(\hatx-x)(\hatx-x)T\}\\ &=\operatorname{E}\{(\hatx-x)(W(y-\bar{y})-(x-\bar{x}))T\}\\ &=\underbrace{\operatorname{E}\{(\hatx-x)(y-\bar{y})T\}}0WT-\operatorname{E}\{(\hatx-x)(x-\bar{x})T\}\\ &=-\operatorname{E}\{(W(y-\bar{y})-(x-\bar{x}))(x-\bar{x})T\}\\ &=\operatorname{E}\{(x-\bar{x})(x-\bar{x})T\}-W\operatorname{E}\{(y-\bar{y})(x-\bar{x})T\}\\ &=CX-WCYX. \end{align}
The first term in the third line is zero due to the orthogonality principle. Since
W=CXY
-1 | |
C | |
Y |
Ce
Ce=CX-CXY
-1 | |
C | |
Y |
CYX.
This we can recognize to be the same as
Ce=CX-C\hat{X
\operatorname{LMMSE}=\operatorname{tr}\{Ce\}
For the special case when both
x
y
\hat{x}=
\sigmaXY | ||||||
|
(y-\bar{y})+\bar{x}=\rho
\sigmaX | |
\sigmaY |
(y-\bar{y})+\bar{x},
2 | |
\sigma | |
e |
=
2 | |
\sigma | |
X |
-
| |||||||
|
=(1-
2, | |
\rho | |
X |
where
\rho=
\sigmaXY | |
\sigmaX\sigmaY |
x
y
The above two equations allows us to interpret the correlation coefficient either as normalized slope of linear regression
\left( | \hat{x |
- |
\bar{x}}{\sigmaX
or as square root of the ratio of two variances
\rho2=
| ||||||||||||||||
|
=
| |||||||
When
\rho=0
\hat{x}=\bar{x}
2 | |
\sigma | |
e |
=
2 | |
\sigma | |
X |
x
\rho=\pm1
\hat{x}=
\sigmaXY | |
\sigmaY |
(y-\bar{y})+\bar{x}
2 | |
\sigma | |
e |
=0
x
y
Standard method like Gauss elimination can be used to solve the matrix equation for
W
CY
W
CY
y
Let us further model the underlying process of observation as a linear process:
y=Ax+z
A
z
\operatorname{E}\{z\}=0
CXZ=0
\operatorname{E}\{y\}=A\bar{x},
CY=
T | |
AC | |
XA |
+CZ,
CXY=CXAT.
Thus the expression for the linear MMSE estimator matrix
W
W=CX
T | |
A | |
XA |
+
-1 | |
C | |
Z) |
.
Putting everything into the expression for
\hat{x}
\hat{x}=CX
T | |
A | |
XA |
+
-1 | |
C | |
Z) |
(y-A\bar{x})+\bar{x}.
Lastly, the error covariance is
Ce=CX-C\hat{X
The significant difference between the estimation problem treated above and those of least squares and Gauss–Markov estimate is that the number of observations m, (i.e. the dimension of
y
x
T | |
(AC | |
XA |
+
-1 | |
C | |
Z) |
CZ
x
CZ=0
T | |
AC | |
XA |
An alternative form of expression can be obtained by using the matrix identity
CX
T | |
A | |
XA |
+
-1 | |
C | |
Z) |
=
-1 | |
(A | |
Z |
A+
-1 | |
C | |
X |
)-1AT
-1 | |
C | |
Z |
,
T | |
(AC | |
XA |
+CZ)
-1 | |
(A | |
Z |
A+
-1 | |
C | |
X |
),
W=
-1 | |
(A | |
Z |
A+
-1 | |
C | |
X |
)-1
-1 | |
A | |
Z |
,
Ce=
-1 | |
(A | |
Z |
A+
-1 | |
C | |
X |
)-1.
Since
W
Ce
W=CeAT
-1 | |
C | |
Z |
\hat{x}
\hat{x}=CeAT
-1 | |
C | |
Z |
(y-A\bar{x})+\bar{x}.
In this form the above expression can be easily compared with ridge regression, weighed least square and Gauss–Markov estimate. In particular, when
-1 | |
C | |
X |
=0
x
W=
-1 | |
(A | |
Z |
A)-1
-1 | |
A | |
Z |
-1 | |
C | |
Z |
z
CZ=\sigma2I,
I
W=(ATA)-1AT
-1 | |
C | |
X |
=λI
z
W=(ATA+λI)-1AT
In many real-time applications, observational data is not available in a single batch. Instead the observations are made in a sequence. One possible approach is to use the sequential observations to update an old estimate as additional data becomes available, leading to finer estimates. One crucial difference between batch estimation and sequential estimation is that sequential estimation requires an additional Markov assumption.
In the Bayesian framework, such recursive estimation is easily facilitated using Bayes' rule. Given
k
y1,\ldots,yk
xk
\begin{align} p(xk|y1,\ldots,yk)&\proptop(yk|x,y1,\ldots,yk-1)p(xk|y1,\ldots,yk-1)\\ &=p(yk|xk)p(xk|y1,\ldots,yk-1). \end{align}
The
p(xk|y1,\ldots,yk)
p(yk|xk)
p(xk|y1,\ldots,yk-1)
yk
y1,\ldots,yk-1
x
p(yk|xk,y1,\ldots,yk-1)=p(yk|xk).
This is the Markov assumption.
The MMSE estimate
\hat{x}k
p(xk|y1,\ldots,yk)
x
p(xk|y1,\ldots,yk-1)=p(xk-1|y1,\ldots,yk-1).
Thus, the prior density for k-th time step is the posterior density of (k-1)-th time step. This structure allows us to formulate a recursive approach to estimation.
In the context of linear MMSE estimator, the formula for the estimate will have the same form as before:
\hat{x}=CXY
-1 | |
C | |
Y |
(y-\bar{y})+\bar{x}.
X
Y
p(xk|y1,\ldots,yk-1)
p(yk|xk)
For the prior density
p(xk|y1,\ldots,yk-1)
\bar{x}k=E[xk|y1,\ldots,yk-1]=E[xk-1|y1,\ldots,yk-1]=\hat{x}k-1
and its covariance matrix is given by the previous error covariance matrix,
C | |
Xk|Y1,\ldots,Yk-1 |
=
C | |
Xk-1|Y1,\ldots,Yk-1 |
=
C | |
ek-1 |
,
as per by the properties of MMSE estimators and the stationarity assumption.
Similarly, for the linear observation process, the mean of the likelihood
p(yk|xk)
\bar{y}k=A\bar{x}k=A\hat{x}k-1
\begin{align} C | |
Yk|Xk |
&=A
C | |
Xk|Y1,\ldots,Yk-1 |
AT+CZ=A
C | |
ek-1 |
AT+CZ. \end{align}
The difference between the predicted value of
Yk
\bar{y}k=A\hat{x}k-1
yk
\tilde{y}k=yk-\bar{y}k
E[\tilde{y}k]=0
C\tilde{Yk}=
C | |
Yk|Xk |
Hence, in the estimate update formula, we should replace
\bar{x}
CX
\hat{x}k-1
C | |
ek-1 |
\bar{y}
CY
\bar{y}k-1
C\tilde{Yk}
CXY
\begin{align} C | |
XkYk|Y1,\ldots,Yk-1 |
&=
C | |
ek-1\tilde{Y |
k}=
C | |
ek-1 |
AT. \end{align}
Thus, we have the new estimate as new observation
yk
\begin{align} \hat{x}k&=\hat{x}k-1+
C | |
ek-1\tilde{Y |
k}C\tilde{Y
-1 | |
k} |
(yk-\bar{y}k)\ &=\hat{x}k-1+
C | |
ek-1 |
AT
(AC | |
ek-1 |
AT+
-1 | |
C | |
Z) |
(yk-A\hat{x}k-1) \end{align}
C | |
ek |
=
C | |
ek-1 |
-
C | |
ek-1 |
T(AC | |
A | |
ek-1 |
AT+
-1 | |
C | |
Z) |
AC | |
ek-1 |
.
From the point of view of linear algebra, for sequential estimation, if we have an estimate
\hat{x}1
Y1
The repeated use of the above two equations as more observations become available lead to recursive estimation techniques. The expressions can be more compactly written as
Wk=
C | |
ek-1 |
T(AC | |
A | |
ek-1 |
AT+
-1 | |
C | |
Z) |
,
\hat{x}k=\hat{x}k-1+Wk(yk-A\hat{x}k-1),
C | |
ek |
=(I-Wk
A)C | |
ek-1 |
.
The matrix
Wk
-1 | |
C | |
ek |
=
-1 | |
C | |
ek-1 |
+AT
-1 | |
C | |
Z |
A,
Wk=
C | |
ek |
AT
-1 | |
C | |
Z |
,
\hat{x}k=\hat{x}k-1+Wk(yk-A\hat{x}k-1),
The repetition of these three steps as more data becomes available leads to an iterative estimation algorithm. The generalization of this idea to non-stationary cases gives rise to the Kalman filter. The three update steps outlined above indeed form the update step of the Kalman filter.
As an important special case, an easy to use recursive expression can be derived when at each k-th time instant the underlying linear observation process yields a scalar such that
yk=
T | |
a | |
k |
xk+zk
ak
xk
zk
2 | |
\sigma | |
k |
\hat{x}k+1
\hat{x}k+1=\hat{x}k+wk+1(yk+1-
T | |
a | |
k+1 |
\hat{x}k)
where
yk+1
wk+1
wk+1=
| |||||||||||||||||||
|
.
C | |
ek+1 |
C | |
ek+1 |
=(I-wk+1
T | |
a | |
k+1 |
)C | |
ek |
.
Here, no matrix inversion is required. Also, the gain factor,
wk+1
\hat{x}
Ce
x
Alternative approaches: This important special case has also given rise to many other iterative methods (or adaptive filters), such as the least mean squares filter and recursive least squares filter, that directly solves the original MSE optimization problem using stochastic gradient descents. However, since the estimation error
e
E\{\tilde{y}T\tilde{y}\}
\nabla\hat{x
\hat{x}k+1=\hat{x}k+ηkE\{\tilde{y}kak\},
ηk
E\{ak\tilde{y}k\} ≈ ak\tilde{y}k
In many practical applications, the observation noise is uncorrelated. That is,
CZ
y
m x 1
m
(\ell) | |
w | |
k+1 |
=
| ||||||||||||||||||||||||||
|
(\ell) | |
C | |
ek+1 |
=(I-
(\ell) | |
w | |
k+1 |
(\ell) | |
A | |
k+1 |
(\ell) | |
)C | |
ek |
(\ell) | |
\hat{x} | |
k+1 |
=
(\ell-1) | |
\hat{x} | |
k |
+
(\ell) | |
w | |
k+1 |
(\ell) | |
(y | |
k+1 |
-
(\ell) | |
A | |
k+1 |
(\ell-1) | |
\hat{x} | |
k |
)
where
\ell=1,2,\ldots,m
(0) | |
C | |
ek+1 |
=
C | |
ek |
(0) | |
\hat{x} | |
k+1 |
=\hat{x}k
(\ell) | |
C | |
Zk+1 |
\ell
m x m
C | |
Zk+1 |
(\ell) | |
A | |
k+1 |
\ell
m x n
Ak+1
(m) | |
C | |
ek+1 |
=
C | |
ek+1 |
(m) | |
\hat{x} | |
k+1 |
=\hat{x}k+1
We shall take a linear prediction problem as an example. Let a linear combination of observed scalar random variables
z1,z2
z3
z4
\hatz4
3 | |
=\sum | |
i=1 |
wizi
z=[z1,z2,z3,z4]T
\operatorname{cov}(Z)=\operatorname{E}[zzT]=\left[\begin{array}{cccc} 1&2&3&4\\ 2&5&8&9\\ 3&8&6&10\\ 4&9&10&15\end{array}\right],
wi
\hatz4
In terms of the terminology developed in the previous sections, for this problem we have the observation vector
y=[z1,z2,
T | |
z | |
3] |
W=[w1,w2,w3]
x=z4
CY
CY=\left[\begin{array}{ccc} E[z1,z1]&E[z2,z1]&E[z3,z1]\\ E[z1,z2]&E[z2,z2]&E[z3,z2]\\ E[z1,z3]&E[z2,z3]&E[z3,z3]\end{array}\right]=\left[\begin{array}{ccc} 1&2&3\\ 2&5&8\\ 3&8&6\end{array}\right].
CYX
CYX=\left[\begin{array}{c} E[z4,z1]\\ E[z4,z2]\\ E[z4,z3]\end{array}\right]=\left[\begin{array}{c} 4\\ 9\\ 10\end{array}\right].
We now solve the equation
CY
T=C | |
W | |
YX |
CY
-1 | |
C | |
Y |
CYX=\left[\begin{array}{ccc} 4.85&-1.71&-0.142\\ -1.71&0.428&0.2857\\ -0.142&0.2857&-0.1429\end{array}\right]\left[\begin{array}{c} 4\\ 9\\ 10\end{array}\right]=\left[\begin{array}{c} 2.57\\ -0.142\\ 0.5714\end{array}\right]=WT.
So we have
w1=2.57,
w2=-0.142,
w3=.5714
\hatz4
\left\Verte\right\Vert
2=\operatorname{E}[z | |
4 |
z4]-WCYX=15-WCYX=.2857
CY
W
Consider a vector
y
N
x
y=1x+z
1=[1,1,\ldots,1]T
1
[-x0,x0]
x
x
[-x0,x0]
x
2 | |
\sigma | |
X |
=
2/3. | |
x | |
0 |
z
2I) | |
N(0,\sigma | |
Z |
I
x
z
CXZ=0
\begin{align} &\operatorname{E}\{y\}=0,\\ &CY=\operatorname{E}\{yyT\}=
2 | |
\sigma | |
X |
11T+
2I, | |
\sigma | |
Z |
\\ &CXY=\operatorname{E}\{xyT\}=
2 | |
\sigma | |
X |
1T. \end{align}
Thus, the linear MMSE estimator is given by
\begin{align} \hat{x}&=CXY
-1 | |
C | |
Y |
y\\ &=
2 | |
\sigma | |
X |
2 | |
1 | |
X |
11T+
2I) | |
\sigma | |
Z |
-1y.\end{align}
W
\begin{align} \hat{x}&=\left(1T
1 | ||||||
|
I1+
1 | ||||||
|
\right)-11T
1 | ||||||
|
Iy\\ &=
1 | ||||||
|
\left(
N | ||||||
|
+
1 | ||||||
|
\right)-11Ty\\ &=
| |||||||||||||||
|
\bar{y}, \end{align}
where for
y=[y1,y2,\ldots,y
T | |
N] |
\bar{y}=
1Ty | |
N |
=
| ||||||||||
N |
.
Similarly, the variance of the estimator is
\sigma\hat{X
Thus the MMSE of this linear estimator is
\operatorname{LMMSE}=
2 | |
\sigma | |
X |
-\sigma\hat{X
For very large
N
\hat{x}=
1 | |
N |
N | |
\sum | |
i=1 |
yi,
\sigma\hat{X
However, the estimator is suboptimal since it is constrained to be linear. Had the random variable
x
x
Consider a variation of the above example: Two candidates are standing for an election. Let the fraction of votes that a candidate will receive on an election day be
x\in[0,1].
1-x.
x
[0,1]
\bar{x}=1/2
2 | |
\sigma | |
X |
=1/12.
y1
z1
2. | |
\sigma | |
Z1 |
y2
z2
2. | |
\sigma | |
Z2 |
As with previous example, we have
\begin{align} y1&=x+z1\\ y2&=x+z2. \end{align}
Here, both the
\operatorname{E}\{y1\}=\operatorname{E}\{y2\}=\bar{x}=1/2
y1
y2
\hat{x}=w1(y1-\bar{x})+w2(y2-\bar{x})+\bar{x},
\begin{align} w1&=
| |||||||||||||||||||||
|
,\\ w2&=
| |||||||||||||||||||||
|
. \end{align}
\hat{x}
\sigma\hat{X
\sigma\hat{X
2. | |
\sigma | |
X |
LMMSE=
2 | |
\sigma | |
X |
-\sigma\hat{X
In general, if we have
N
\hat{x}=
N | |
\sum | |
i=1 |
wi(yi-\bar{x})+\bar{x},
wi=
| |||||||||||||||||||||
|
LMMSE=
1 | |||||||||||||||||||||
|
.
Suppose that a musician is playing an instrument and that the sound is received by two microphones, each of them located at two different places. Let the attenuation of sound due to distance at each microphone be
a1
a2
z1
z2
2 | |
\sigma | |
Z1 |
2 | |
\sigma | |
Z2 |
x
2. | |
\sigma | |
X |
We can model the sound received by each microphone as
\begin{align} y1&=a1x+z1\\ y2&=a2x+z2. \end{align}
Here both the
\operatorname{E}\{y1\}=\operatorname{E}\{y2\}=0
y=w1y1+w2y2
wi=
| ||||||||||||||||||
|
.