In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator.[1] In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.[2] [3] [4]
Statistical analyses of multivariate data often involve exploratory studies of the way in which the variables change in relation to one another and this may be followed up by explicit statistical models involving the covariance matrix of the variables. Thus the estimation of covariance matrices directly from observational data plays two roles:
Estimates of covariance matrices are required at the initial stages of principal component analysis and factor analysis, and are also involved in versions of regression analysis that treat the dependent variables in a data-set, jointly with the independent variable as the outcome of a random sample.
Given a sample consisting of n independent observations x1,..., xn of a p-dimensional random vector X ∈ Rp×1 (a p×1 column-vector), an unbiased estimator of the (p×p) covariance matrix
\operatorname{\Sigma}=\operatorname{E}\left[\left(X-\operatorname{E}[X]\right)\left(X-\operatorname{E}[X]\right)T\right]
is the sample covariance matrix
Q={1\over
n | |
{n-1}}\sum | |
i=1 |
(xi-\overline{x})(x
T, | |
i-\overline{x}) |
where
xi
\overline{x}={1\over
n | |
{n}}\sum | |
i=1 |
xi
is the sample mean.This is true regardless of the distribution of the random variable X, provided of course that the theoretical means and covariances exist. The reason for the factor n − 1 rather than n is essentially the same as the reason for the same factor appearing in unbiased estimates of sample variances and sample covariances, which relates to the fact that the mean is not known and is replaced by the sample mean (see Bessel's correction).
In cases where the distribution of the random variable X is known to be within a certain family of distributions, other estimates may be derived on the basis of that assumption. A well-known instance is when the random variable X is normally distributed: in this case the maximum likelihood estimator of the covariance matrix is slightly different from the unbiased estimate, and is given by
Qn |
={1\over
n | |
n}\sum | |
i=1 |
(xi-\overline{x})(x
T. | |
i-\overline{x}) |
A derivation of this result is given below. Clearly, the difference between the unbiased estimator and the maximum likelihood estimator diminishes for large n.
In the general case, the unbiased estimate of the covariance matrix provides an acceptable estimate when the data vectors in the observed data set are all complete: that is they contain no missing elements. One approach to estimating the covariance matrix is to treat the estimation of each variance or pairwise covariance separately, and to use all the observations for which both variables have valid values. Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix.
When estimating the cross-covariance of a pair of signals that are wide-sense stationary, missing samples do not need be random (e.g., sub-sampling by an arbitrary factor is valid).
See main article: Multivariate normal distribution. A random vector X ∈ Rp (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈ Rp × p is a positive-definite matrix and the probability density function of X is
| ||||
f(x)=(2\pi) |
| ||||
\det(\Sigma) |
\exp\left(-{1\over2}(x-\mu)T\Sigma-1(x-\mu)\right)
where μ ∈ Rp×1 is the expected value of X. The covariance matrix Σ is the multidimensional analog of what in one dimension would be the variance, and
| ||||
(2\pi) |
| ||||
\det(\Sigma) |
normalizes the density
f(x)
Suppose now that X1, ..., Xn are independent and identically distributed samples from the distribution above. Based on the observed values x1, ..., xn of this sample, we wish to estimate Σ.
The likelihood function is:
| ||||
l{L}(\mu,\Sigma)=(2\pi) |
n | |
\prod | |
i=1 |
| |||||
\det(\Sigma) | \exp\left(- |
1 | |
2 |
T | |
(x | |
i-\mu) |
\Sigma-1(xi-\mu)\right)
It is fairly readily shown that the maximum-likelihood estimate of the mean vector μ is the "sample mean" vector:
\overline{x}= | x1+ … +xn |
n |
.
See the section on estimation in the article on the normal distribution for details; the process here is similar.
Since the estimate
\bar{x}
l{L}(\overline{x},\Sigma)\propto
| ||||
\det(\Sigma) |
\exp\left(-{1\over2}
n | |
\sum | |
i=1 |
T | |
(x | |
i-\overline{x}) |
\Sigma-1(xi-\overline{x})\right),
and then seek the value of Σ that maximizes the likelihood of the data (in practice it is easier to work with log
l{L}
T | |
(x | |
i-\overline{x}) |
\Sigma-1(xi-\overline{x})
\begin{align} l{L}(\overline{x},\Sigma)&\propto
| ||||
\det(\Sigma) |
\exp\left(-{1\over2}
n | |
\sum | |
i=1 |
\left(\left(xi-\overline{x}\right)T\Sigma-1\left(xi-\overline{x}\right)\right)\right)
| ||||
\\ &=\det(\Sigma) |
\exp\left(-{1\over2}
n | |
\sum | |
i=1 |
\operatorname{tr}\left(\left(xi-\overline{x}\right)\left(xi-\overline{x}\right)T\Sigma-1\right)\right)
| ||||
\\ &=\det(\Sigma) |
\exp\left(-{1\over2}\operatorname{tr}\left(
n | |
\sum | |
i=1 |
\left(xi-\overline{x}\right)\left(xi-\overline{x}\right)T\Sigma-1\right)
| ||||
\right)\\ &=\det(\Sigma) |
\exp\left(-{1\over2}\operatorname{tr}\left(S\Sigma-1\right)\right) \end{align}
where
n | |
S=\sum | |
i=1 |
(xi-\overline{x})
T | |
(x | |
i-\overline{x}) |
\inRp x .
S
p
It follows from the spectral theorem of linear algebra that a positive-definite symmetric matrix S has a unique positive-definite symmetric square root S1/2. We can again use the "cyclic property" of the trace to write
| ||||
\det(\Sigma) |
\exp\left(-{1\over2}\operatorname{tr}\left(
| ||||
S |
\Sigma-1
| ||||
S |
\right)\right).
Let B = S1/2 Σ −1 S1/2. Then the expression above becomes
| ||||
\det(S) |
| ||||
\det(B) |
\exp\left(-{1\over2}\operatorname{tr}(B)\right).
The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes
| ||||
\det(B) |
\exp\left(-{1\over2}\operatorname{tr}(B)\right)
Since the trace of a square matrix equals the sum of eigenvalues ("trace and eigenvalues"), the equation reduces to the problem of finding the eigenvalues λ1, ..., λp that maximize
n | |
\prod | |
i=1 |
| ||||
λ | ||||
i |
\exp\left(-
λi | |
2 |
\right).
This is just a calculus problem and we get λi = n for all i. Thus, assume Q is the matrix of eigen vectors, then
B=Q(nIp)Q-1=nIp
i.e., n times the p×p identity matrix.
Finally we get
| ||||
\Sigma=S |
B-1
| ||||
S |
| |||||
=S | \left( |
1 | |
n |
| ||||
I | ||||
p\right)S |
=
S | |
n, |
i.e., the p×p "sample covariance matrix"
{S\overn}={1\over
n | |
n}\sum | |
i=1 |
(Xi-\overline{X})(X
T | |
i-\overline{X}) |
is the maximum-likelihood estimator of the "population covariance matrix" Σ. At this point we are using a capital X rather than a lower-case x because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. The random matrix S can be shown to have a Wishart distribution with n − 1 degrees of freedom.[5] That is:
n | |
\sum | |
i=1 |
(Xi-\overline{X})(X
T | |
i-\overline{X}) |
\simWp(\Sigma,n-1).
An alternative derivation of the maximum likelihood estimator can be performed via matrix calculus formulae (see also differential of a determinant and differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick:
lnl{L}(\mu,\Sigma)=\operatorname{constant}-{n\over2}ln\det(\Sigma)-{1\over2}\operatorname{tr}\left[\Sigma-1
n | |
\sum | |
i=1 |
(xi-\mu)
T | |
(x | |
i-\mu) |
\right].
The differential of this log-likelihood is
dlnl{L}(\mu,\Sigma)=-
n | |
2 |
\operatorname{tr}\left[\Sigma-1\left\{d\Sigma\right\}\right]-{1\over2}\operatorname{tr}\left[-\Sigma-1\{d\Sigma\}\Sigma-1
n | |
\sum | |
i=1 |
(xi-\mu)(x
T | |
i-\mu) |
-2\Sigma-1
n | |
\sum | |
i=1 |
(xi-\mu)\{d\mu\}T\right].
It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The first order condition for maximum,
dlnl{L}(\mu,\Sigma)=0
d\mu
d\Sigma
\Sigma
n | |
\sum | |
i=1 |
(xi-\mu)=0,
which leads to the maximum likelihood estimator
\widehat\mu=\barX={1\overn}
n | |
\sum | |
i=1 |
Xi.
This lets us simplify
n | |
\sum | |
i=1 |
(xi-\mu)(x
T | |
i-\mu) |
=
n | |
\sum | |
i=1 |
(xi-\barx)(xi-\barx)T=S
as defined above. Then the terms involving
d\Sigma
dlnL
-{1\over2}\operatorname{tr}\left(\Sigma-1\left\{d\Sigma\right\}\left[nIp-\Sigma-1S\right]\right).
The first order condition
dlnl{L}(\mu,\Sigma)=0
\Sigma
n
\widehat\Sigma={1\overn}S,
which of course coincides with the canonical derivation given earlier.
Dwyer[6] points out that decomposition into two terms such as appears above is "unnecessary" and derives the estimator in two lines of working. Note that it may be not trivial to show that such derived estimator is the unique global maximizer for likelihood function.
Given a sample of n independent observations x1,..., xn of a p-dimensional zero-mean Gaussian random variable X with covariance R, the maximum likelihood estimator of R is given by
\hat{R
The parameter
R
E[\hat{R
\hat{R
R
ER[\hat{R
where
\expR(\hat{R
-1 | |
\exp | |
R |
(\hat{R
are the exponential map and inverse exponential map, respectively, "exp" and "log" denote the ordinary matrix exponential and matrix logarithm, and E[·] is the ordinary expectation operator defined on a vector space, in this case the tangent space of the manifold.[1]
The intrinsic bias vector field of the SCM estimator
\hat{R
B(\hat{R
The intrinsic estimator bias is then given by
\expRB(\hat{R
For complex Gaussian random variables, this bias vector field can be shown[1] to equal
B(\hat{R
where
\beta(p,n)=
1 | |
p |
\left(plogn+p-\psi(n-p+1)+(n-p+1)\psi(n-p+2)+\psi(n+1)-(n+1)\psi(n+2)\right)
and ψ(·) is the digamma function. The intrinsic bias of the sample covariance matrix equals
\expRB(\hat{R
and the SCM is asymptotically unbiased as n → ∞.
Similarly, the intrinsic inefficiency of the sample covariance matrix depends upon the Riemannian curvature of the space of positive-definite matrices.
If the sample size n is small and the number of considered variables p is large, the above empirical estimators of covariance and correlation are very unstable. Specifically, it is possible to furnish estimators that improve considerably upon the maximum likelihood estimate in terms of mean squared error. Moreover, for n < p (the number of observations is less than the number of random variables) the empirical estimate of the covariance matrix becomes singular, i.e. it cannot be inverted to compute the precision matrix.
As an alternative, many methods have been suggested to improve the estimation of the covariance matrix. All of these approaches rely on the concept of shrinkage. This is implicit in Bayesian methods and in penalized maximum likelihood methods and explicit in the Stein-type shrinkage approach.
A simple version of a shrinkage estimator of the covariance matrix is represented by the Ledoit-Wolf shrinkage estimator.[7] [8] [9] [10] One considers a convex combination of the empirical estimator (
A
B
\delta
\deltaA+(1-\delta)B
Various shrinkage targets have been proposed:
The shrinkage estimator can be generalized to a multi-target shrinkage estimator that utilizes several targets simultaneously.[11] Software for computing a covariance shrinkage estimator is available in R (packages corpcor and ShrinkCovMat), in Python (scikit-learn library http://scikit-learn.org/stable/modules/covariance.html), and in MATLAB.[12]