Law of total variance explained

In probability theory, the law of total variance[1] or variance decomposition formula or conditional variance formulas or law of iterated variances also known as Eve's law,[2] states that if

X

and

Y

are random variables on the same probability space, and the variance of

Y

is finite, then

\operatorname(Y) = \operatorname[\operatorname{Var}(Y \mid X)] + \operatorname(\operatorname[Y \mid X]).

In language perhaps better known to statisticians than to probability theorists, the two terms are the "unexplained" and the "explained" components of the variance respectively (cf. fraction of variance unexplained, explained variation). In actuarial science, specifically credibility theory, the first component is called the expected value of the process variance (EVPV) and the second is called the variance of the hypothetical means (VHM).[3] These two components are also the source of the term "Eve's law", from the initials EV VE for "expectation of variance" and "variance of expectation".

Explanation

To understand the formula above, we need to comprehend the random variables

\operatorname{E}[Y|X]

and

\operatorname{Var}(Y|X)

. These variables depend on the value of

X

: for a given

x

,

\operatorname{E}[Y|X=x]

and

\operatorname{Var}(Y|X=x)

are constant numbers. Essentially, we use the possible values of

X

to group the outcomes and then compute the expected values and variances for each group.

The "unexplained" component

\operatorname{E}(\operatorname{Var}[Y|X])

is simply the average of all the variances of

Y

within each group.The "explained" component

\operatorname{Var}(\operatorname{E}[Y|X])

is the variance of the expected values, i.e., it represents the part of the variance that is explained by the variation of the average value of

Y

for each group.

For an illustration, consider the example of a dog show (a selected excerpt of Analysis_of_variance#Example). Let the random variable

Y

correspond to the dog weight and

X

correspond to the breed. In this situation, it is reasonable to expect that the breed explains a major portion of the variance in weight since there is a big variance in the breeds' average weights. Of course, there is still some variance in weight for each breed, which is taken into account in the "unexplained" term.

Note that the "explained" term actually means "explained by the averages." If variances for each fixed

X

(e.g., for each breed in the example above) are very distinct, those variances are still combined in the "unexplained" term.

Examples

Example 1

Five graduate students take an exam that is graded from 0 to 100. Let

Y

denote the student's grade and

X

indicate whether the student is international or domestic. The data is summarized as follows:
Student

Y

X

120International
230International
3100International
440Domestic
560Domestic

Among international students, the mean is

\operatornameE[Y|X=International]=50

and the variance is

\operatorname{Var}(Y|X=International)=

3800
3

=1266.\overline{6}

.

Among domestic students, the mean is

\operatornameE[Y|X=Domestic]=50

and the variance is

\operatorname{Var}(Y|X=Domestic)=100

.

X

P(X)

\operatornameE[Y|X]

\operatorname{Var}(Y

X)
International3/5501266.6
Domestic2/550100

The part of the variance of

Y

"unexplained" by

X

is the mean of the variances for each group. In this case, it is
\left(3\right)\left(
5
3800
3

\right)+\left(

2
5

\right)(100)=800

. The part of the variance of

Y

"explained" by

X

is the variance of the means of

Y

inside each group defined by the values of the

X

. In this case, it is zero, since the mean is the same for each group. So the total variation is

\operatorname{Var}(Y)=\operatorname{E}[\operatorname{Var}(Y|X)]+\operatorname{Var}(\operatornameE[Y|X])=800+0=800.

Example 2

Suppose is a coin flip with the probability of heads being . Suppose that when then is drawn from a normal distribution with mean and standard deviation, and that when then is drawn from normal distribution with mean and standard deviation . Then the first, "unexplained" term on the right-hand side of the above formula is the weighted average of the variances,, and the second, "explained" term is the variance of the distribution that gives with probability and gives with probability .

Formulation

There is a general variance decomposition formula for

c\geq2

components (see below).[4] For example, with two conditioning random variables:\operatorname[Y] = \operatorname\left[\operatorname{Var}\left(Y \mid X_1, X_2\right)\right] + \operatorname[\operatorname{Var}(\operatorname{E}\left[Y \mid X_1, X_2\right] \mid X_1)] + \operatorname(\operatorname\left[Y \mid X_1\right]),which follows from the law of total conditional variance:\operatorname(Y \mid X_1) = \operatorname \left[\operatorname{Var}(Y \mid X_1, X_2) \mid X_1\right] + \operatorname \left(\operatorname\left[Y \mid X_1, X_2 \right] \mid X_1\right).

\operatorname{E}(Y\midX)

is a random variable in its own right, whose value depends on the value of

X.

Notice that the conditional expected value of

Y

given the

X=x

is a function of

x

(this is where adherence to the conventional and rigidly case-sensitive notation of probability theory becomes important!). If we write

\operatorname{E}(Y\midX=x)=g(x)

then the random variable

\operatorname{E}(Y\midX)

is just

g(X).

Similar comments apply to the conditional variance.

One special case, (similar to the law of total expectation) states that if

A1,\ldots,An

is a partition of the whole outcome space, that is, these events are mutually exclusive and exhaustive, then\begin\operatorname (X) = & \sum_^n \operatorname(X\mid A_i) \Pr(A_i) + \sum_^n \operatorname[X\mid A_i]^2 (1-\Pr(A_i))\Pr(A_i) \\[4pt]& - 2\sum_^n \sum_^ \operatorname[X \mid A_i] \Pr(A_i)\operatorname[X\mid A_j] \Pr(A_j).\end

In this formula, the first component is the expectation of the conditional variance; the other two components are the variance of the conditional expectation.

Proof

Finite Case

Let

(x1,y1),\ldots,(xn,yn)

be observed values of

(X,Y)

, with repetitions.

Set

\bar{y}=\operatorname{E}[Y]

and, for each possible value

x

of

X

, set
\bar{y}
xi

=\operatorname{E}[Y|X=xi]

.

Note that

(yi-\bar{y})2=\left(yi-

\bar{y}
xi

+

\bar{y}
xi

-\bar{y}\right)2=(yi-

\bar{y}
xi

)2+

(\bar{y}
xi

-\bar{y})2+2(yi-

\bar{y}
xi
)(\bar{y}
xi

-\bar{y}).

Summing these for

1\leqi\leqn

, the last parcel becomes
n
\sum
i=1

2(yi-

\bar{y}
xi
)(\bar{y}
xi

-\bar{y})=2\sumx\left(

\sum
\{1\leqi\leqn|xi=x\
} (y_i - \bar_) \right) (\bar_ - \bar) = 2 \sum_ 0 \cdot (\bar_ - \bar) = 0.

Hence,

\operatorname{Var}(Y)=

1
n
n
\sum
i=1

(yi-\bar{y})2=

1
n
n
\sum
i=1

(yi-

\bar{y}
xi

)2+

1
n
n
\sum
i=1
(\bar{y}
xi

-\bar{y})2=\operatorname{E}[\operatorname{Var}(Y\midX)]+\operatorname{Var}(\operatorname{E}[Y\midX]).

General Case

The law of total variance can be proved using the law of total expectation.[5] First,\operatorname(Y) = \operatorname\left[Y^2\right] - \operatorname[Y]^2from the definition of variance. Again, from the definition of variance, and applying the law of total expectation, we have\operatorname\left[Y^2\right] = \operatorname\left[\operatorname{E}[Y^2\mid X]\right] = \operatorname \left[\operatorname{Var}(Y \mid X) + \operatorname{E}[Y \mid X]^2\right].

Now we rewrite the conditional second moment of

Y

in terms of its variance and first moment, and apply the law of total expectation on the right hand side:\operatorname\left[Y^2\right] - \operatorname[Y]^2 = \operatorname \left[\operatorname{Var}(Y \mid X) + \operatorname{E}[Y \mid X]^2\right] - \operatorname [\operatorname{E}[Y \mid X]]^2.

Since the expectation of a sum is the sum of expectations, the terms can now be regrouped:= \left(\operatorname [\operatorname{Var}(Y \mid X)]\right) + \left(\operatorname \left[\operatorname{E}[Y \mid X]^2\right] - \operatorname [\operatorname{E}[Y \mid X]]^2\right).

Finally, we recognize the terms in the second set of parentheses as the variance of the conditional expectation

\operatorname{E}[Y\midX]

:= \operatorname [\operatorname{Var}(Y \mid X)] + \operatorname (\operatorname[Y \mid X]).

General variance decomposition applicable to dynamic systems

The following formula shows how to apply the general, measure theoretic variance decomposition formula to stochastic dynamic systems. Let

Y(t)

be the value of a system variable at time

t.

Suppose we have the internal histories (natural filtrations)

H1t,H2t,\ldots,Hc-1,t

, each one corresponding to the history (trajectory) of a different collection of system variables. The collections need not be disjoint. The variance of

Y(t)

can be decomposed, for all times

t,

into

c\geq2

components as follows:\begin\operatorname[Y(t)] = & \operatorname(\operatorname[Y(t)\mid H_{1t},H_{2t},\ldots,H_{c-1,t}]) \\[4pt]& + \sum_^\operatorname(\operatorname[\operatorname{E}[Y(t)\mid H_{1t},H_{2t},\ldots,H_{jt}] \mid H_,H_,\ldots,H_]) \\[4pt]& + \operatorname(\operatorname[Y(t)\mid H_{1t}]).\end

The decomposition is not unique. It depends on the order of the conditioning in the sequential decomposition.

The square of the correlation and explained (or informational) variation

In cases where

(Y,X)

are such that the conditional expected value is linear; that is, in cases where\operatorname(Y \mid X) = a X + b,it follows from the bilinearity of covariance that a=andb = \operatorname(Y)- \operatorname(X)and the explained component of the variance divided by the total variance is just the square of the correlation between

Y

and

X;

that is, in such cases, = \operatorname(X, Y)^2.

One example of this situation is when

(X,Y)

have a bivariate normal (Gaussian) distribution.

More generally, when the conditional expectation

\operatorname{E}(Y\midX)

is a non-linear function of

X

\iota_ = = \operatorname(\operatorname(Y \mid X), Y)^2,which can be estimated as the

R

squared from a non-linear regression of

Y

on

X,

using data drawn from the joint distribution of

(X,Y).

When

\operatorname{E}(Y\midX)

has a Gaussian distribution (and is an invertible function of

X

), or

Y

itself has a (marginal) Gaussian distribution, this explained component of variation sets a lower bound on the mutual information:\operatorname(Y; X) \geq \ln \left([1 - \iota_{Y \mid X}]^\right).

Higher moments

\mu3

says\mu_3(Y)=\operatorname\left(\mu_3(Y \mid X)\right) + \mu_3(\operatorname(Y \mid X)) + 3\operatorname(\operatorname(Y \mid X), \operatorname(Y \mid X)).

For higher cumulants, a generalization exists. See law of total cumulance.

See also

References

Notes and References

  1. Neil A. Weiss, A Course in Probability, Addison - Wesley, 2005, pages 385 - 386.
  2. Joseph K. Blitzstein and Jessica Hwang: "Introduction to Probability"
  3. Book: Mahler. Howard C. . Dean. Curtis Gary. 2001 . Chapter 8: Credibility . http://people.stat.sfu.ca/~cltsai/ACMA315/Ch8_Credibility.pdf . . Foundations of Casualty Actuarial Science. 4th . . 525–526 . 978-0-96247-622-8. June 25, 2015.
  4. Bowsher, C.G. and P.S. Swain, Identifying sources of variation and the flow of information in biochemical networks, PNAS May 15, 2012 109 (20) E1320-E1328.
  5. Neil A. Weiss, A Course in Probability, Addison - Wesley, 2005, pages 380 - 383.