The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.
The sample mean is the average value (or mean value) of a sample of numbers taken from a larger population of numbers, where "population" indicates not number of people but the entirety of relevant data, whether collected or not. A sample of 40 companies' sales from the Fortune 500 might be used for convenience instead of looking at the population, all 500 companies' sales. The sample mean is used as an estimator for the population mean, the average value in the entire population, where the estimate is more likely to be close to the population mean if the sample is large and representative. The reliability of the sample mean is estimated using the standard error, which in turn is calculated using the variance of the sample. If the sample is random, the standard error falls with the size of the sample and the sample mean's distribution approaches the normal distribution as the sample size increases.
The term "sample mean" can also be used to refer to a vector of average values when the statistician is looking at the values of several variables in the sample, e.g. the sales, profits, and employees of a sample of Fortune 500 companies. In this case, there is not just a sample variance for each variable but a sample variance-covariance matrix (or simply covariance matrix) showing also the relationship between each pair of variables. This would be a 3×3 matrix when 3 variables are being considered. The sample covariance is useful in judging the reliability of the sample means as estimators and is also useful as an estimate of the population covariance matrix.
Due to their ease of calculation and other desirable characteristics, the sample mean and sample covariance are widely used in statistics to represent the location and dispersion of the distribution of values in the sample, and to estimate the values for the population.
The sample mean is the average of the values of a variable in a sample, which is the sum of those values divided by the number of values. Using mathematical notation, if a sample of N observations on variable X is taken from the population, the sample mean is:
\bar{X}= | 1 |
N |
N | |
\sum | |
i=1 |
Xi.
Under this definition, if the sample (1, 4, 1) is taken from the population (1,1,3,4,0,2,1,0), then the sample mean is
\bar{x}=(1+4+1)/3=2
\mu=(1+1+3+4+0+2+1+0)/8=12/8=1.5
If the statistician is interested in K variables rather than one, each observation having a value for each of those K variables, the overall sample mean consists of K sample means for individual variables. Let
xij
xi
The sample mean vector
\bar{x
\bar{x}j
\bar{x}j=
1 | |
N |
N | |
\sum | |
i=1 |
xij, j=1,\ldots,K.
Thus, the sample mean vector contains the average of the observations for each variable, and is written
\bar{x
styleQ=\left[qjk\right]
qjk=
1 | |
N-1 |
N | |
\sum | |
i=1 |
\left(xij-\bar{x}j\right)\left(xik-\bar{x}k\right),
where
qjk
Q={1\over
N | |
{N-1}}\sum | |
i=1 |
(xi.-\bar{x
Alternatively, arranging the observation vectors as the columns of a matrix, so that
F=\begin{bmatrix}x1&x2&...&xN\end{bmatrix}
Q=
1 | |
N-1 |
(F-\bar{x
1N
\bar{x
M=FT
Q=
1 | |
N-1 |
(M-1N\bar{x
A
ATA
xi.-\bar{x
styleX
styleN-1
styleN
\operatorname{E}(X)
qjk=
1 | |
N |
N | |
\sum | |
i=1 |
\left(xij-\operatorname{E}(Xj)\right)\left(xik-\operatorname{E}(Xk)\right),
using the population mean, has
styleN
The maximum likelihood estimate of the covariance
qjk=
1 | |
N |
N | |
\sum | |
i=1 |
\left(xij-\bar{x}j\right)\left(xik-\bar{x}k\right)
for the Gaussian distribution case has N in the denominator as well. The ratio of 1/N to 1/(N - 1) approaches 1 for large N, so the maximum likelihood estimate approximately equals the unbiased estimate when the sample is large.
For each random variable, the sample mean is a good estimator of the population mean, where a "good" estimator is defined as being efficient and unbiased. Of course the estimator will likely not be the true value of the population mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean. Thus the sample mean is a random variable, not a constant, and consequently has its own distribution. For a random sample of N observations on the jth random variable, the sample mean's distribution itself has mean equal to the population mean
E(Xj)
2 | |
\sigma | |
j/N |
2 | |
\sigma | |
j |
The arithmetic mean of a population, or population mean, is often denoted μ.[2] The sample mean
\bar{x}
\operatornameE(\bar{x})=\mu
and the variance of the sample mean is
\operatorname{var}(\bar{x})=
\sigma2 | |
n. |
If the samples are not independent, but correlated, then special care has to be taken in order to avoid the problem of pseudoreplication.
If the population is normally distributed, then the sample mean is normally distributed as follows:
\bar{x}\thicksimN\left\{\mu,
\sigma2 | |
n |
\right\}.
If the population is not normally distributed, the sample mean is nonetheless approximately normally distributed if n is large and σ2/n < +∞. This is a consequence of the central limit theorem.
See main article: Weighted mean.
In a weighted sample, each vector
stylebf{x}i
stylewi\geq0
N | |
\sum | |
i=1 |
wi=1.
(If they are not, divide the weights by their sum).Then the weighted mean vector
style\bar{x
\bar{x
and the elements
qjk
styleQ
qjk=
1 | |||||||||||||||
|
N | |
\sum | |
i=1 |
wi\left(xij-\bar{x}j\right)\left(xik-\bar{x}k\right).
If all weights are the same,
stylewi=1/N
The sample mean and sample covariance are not robust statistics, meaning that they are sensitive to outliers. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably quantile-based statistics such as the sample median for location,[4] and interquartile range (IQR) for dispersion. Other alternatives include trimming and Winsorising, as in the trimmed mean and the Winsorized mean.