The binomial sum variance inequality states that the variance of the sum of binomially distributed random variables will always be less than or equal to the variance of a binomial variable with the same n and p parameters. In probability theory and statistics, the sum of independent binomial random variables is itself a binomial random variable if all the component variables share the same success probability. If success probabilities differ, the probability distribution of the sum is not binomial.[1] The lack of uniformity in success probabilities across independent trials leads to a smaller variance.[2] [3] [4] [5] [6] and is a special case of a more general theorem involving the expected value of convex functions.[7] In some statistical applications, the standard binomial variance estimator can be used even if the component probabilities differ, though with a variance estimate that has an upward bias.
Consider the sum, Z, of two independent binomial random variables, X ~ B(m0, p0) and Y ~ B(m1, p1), where Z = X + Y. Then, the variance of Z is less than or equal to its variance under the assumption that p0 = p1, that is, if Z had a binomial distribution.[8] Symbolically,
Var(Z)\leqslantE[Z](1-\tfrac{E[Z]}{m0+m1})
We wish to prove that
Var(Z)\leqslantE[Z](1-
E[Z] | |
m0+m1 |
)
If Z has a binomial distribution with parameters n and p, then the expected value of Z is given by E[''Z''] = np and the variance of Z is given by Var[''Z''] = np(1 – p). Letting n = m0 + m1 and substituting E[''Z''] for np gives
Var(Z)=E[Z](1-
E[Z] | |
m0+m1 |
)
Var(Z)=E[X](1-
E[X] | |
m0 |
)+E[Y](1-
E[Y] | |
m1 |
)
In order to prove the theorem, it is therefore sufficient to prove that
E[X](1-
E[X] | |
m0 |
)+E[Y](1-
E[Y] | |
m1 |
)\leqslantE[Z](1-
E[Z] | |
m0+m1 |
)
Substituting E[''X''] + E[''Y''] for E[''Z''] gives
E[X](1-
E[X] | |
m0 |
)+E[Y](1-
E[Y] | |
m1 |
)\leqslant(E[X]+E[Y])(1-
E[X]+E[Y] | |
m0+m1 |
)
-
E[X]2 | |
m0 |
-
E[Y]2 | |
m1 |
\leqslant-
(E[X]+E[Y])2 | |
m0+m1 |
E[X]-
E[X]2 | |
m0 |
+E[Y]-
E[Y]2 | |
m1 |
\leqslantE[X]+E[Y]-
(E[X]+E[Y])2 | |
m0+m1 |
E[X]2 | |
m0 |
+
E[Y]2 | |
m1 |
\geqslant
(E[X]+E[Y])2 | |
m0+m1 |
E[X]2 | |
m0 |
+
E[Y]2 | |
m1 |
\geqslant
E[X]2+2E[X]E[Y]+E[Y]2 | |
m0+m1 |
m0m1(m0+m1)
(m0m1+{m
2){E[X] | |
1} |
2}+
2+m | |
({m | |
0m |
2} | |
1){E[Y] |
\geqslantm0m
2+2E[X]E[Y]+{E[Y]] | |
1({E[X]} |
2})
2{E[X] | |
{m | |
1} |
2}-2m0m1E[X]E[Y]+
2{E[Y] | |
{m | |
0} |
2}\geqslant0
(m1E[X]-
2 | |
m | |
0E[Y]) |
\geqslant0
Although this proof was developed for the sum of two variables, it is easily generalized to greater than two. Additionally, if the individual success probabilities are known, then the variance is known to take the form
\operatorname{Var}(Z)=n\bar{p}(1-\bar{p})-ns2,
where
s2=
1 | |
n |
n | |
\sum | |
i=1 |
2 | |
(p | |
i-\bar{p}) |
p=\bar{p}
The inequality can be useful in the context of multiple testing, where many statistical hypothesis tests are conducted within a particular study. Each test can be treated as a Bernoulli variable with a success probability p. Consider the total number of positive tests as a random variable denoted by S. This quantity is important in the estimation of false discovery rates (FDR), which quantify uncertainty in the test results. If the null hypothesis is true for some tests and the alternative hypothesis is true for other tests, then success probabilities are likely to differ between these two groups. However, the variance inequality theorem states that if the tests are independent, the variance of S will be no greater than it would be under a binomial distribution.